Help with regular expression

PeterJones

@Ben-1 said in Help with regular expression:

(?<=")[^"\r\n]+?\.latest:/
The file in question has 5.5 million lines of text in it (296MB), maybe it’s just too much.

I thought there might be too many quotes… try changing to

(?<=: ")[^"\r\n]+?\.latest:/

… that should only match things that come after : " , which should help narrow down the range of matches.

I just pasted

"id" : "file.name.latest:/Custom/path/file.exe"
"uid" : "file.name.latest:/Custom/path/file.exe"
"sid" : "file.name.latest:/Custom/path/file.exe"

4 million times – well, really, select-all, copy, paste, so it doubles every time, so it’s really 32ⁿ=32²²=3*4.194M=12.582M lines, so there are 625MB. It does not claim it’s invalid, and while it takes a long time, it seemed to be working.

I first tried with the original, but accidentally had a space before the (?<="), and it gave me the error you showed when I tried to FIND; when I deleted that space, and did just a FIND, it had no problem finding the next few occurrences. So it’s not giving the error. Though when I did the Replace All, it’s taken at least 5 minutes without any apparent progress… Notepad++ does have performance issues with files that are more than a hundred megabytes.

Unfortunately, after another 5 minutes while I wrote up my reply, it still hasn’t finished for me, so I’m cancelling it – oh, no, just as I wrote that, it finished. The search with (?<=")[^"\r\n]+?\.latest:/ worked for me with 625MB.

But maybe your machine has different memory limitations than mine does.

Try seeing if the more-restrictive (?<=: ")[^"\r\n]+?\.latest:/ works for you. If you have a way to break up the 300MB file into smaller chunks, I might recommend trying that, too.

And Terry posted this while I was writing mine up:

Find What: "[^:]+:/

I’d recommend a modified "[^:"\r\n]+?" , otherwise it might still hit the problem of trying to wrap lines and being too greedy, if you have some data in the middle that doesn’t fit your simple format

A Former User

This is strange. None of those work for these larger files. Sadly, these files contains confidential information otherwise I’d just upload it. Tried restarting the computer just to make sure it wasn’t some odd memory issue, but this system has 64GB of RAM on 64bit OS so it shouldn’t be a memory limitation.

There are are multiple "s in every single line, and it seems to break after millions of these entries:
“BFCE9424314C03B3E”,
“B9603434384AC353E”,
“B00204343730E233E”,
{
“t” : “13.68”,
“v” : “-0.7822026”,
“c” : “3”,
“i” : “-0.7824571”,
“o” : “-0.7819693”
},

I am trying to warp my head around a better way to detect what I need to replace/delete without simply looking for ". The text before what I need to replace always contains the following:

" : "

So any text after (not including it) I need to delete, up to and including this text.

.latest:/

That might reduce the amount of things it needs to look for?

Terry R

@Ben-1
With your latest “examples” it seems that your file isn’t just lines of what you originally wanted edited, rather it contains a mixture of other types of lines, is that correct?

If so, then another method might be to:

number all the lines
mark all the lines which contain the text to be edited
cut these lines and paste into a blank tab
edit these lines in the new tab
paste the edited lines back
re-sort the lines so they are back in line order
remove the line numbers that were added in step #1

Terry

A Former User

@Terry-R
It would take thousands of hours with the amount of data and files there are. I think I am just going to tell this client I cannot figure out how to achieve what he wants and not bill him. I’ve already put hours of my personal time trying to figure it out and I thought it’d just be a pretty simple notepad++ mass edit.

For some reason the problem has gotten even worse, now the original expression is producing the error on files as little as 80MB, files that successfully processed yesterday with this exact expression.

(?<=")[^"]+\.latest:/

I do appreciate the help given though.

Coises

@Ben-1 I can’t promise this will work, but I would try:

Find what : ^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with: $1

A Former User

@Coises said in Help with regular expression:

@Ben-1 I can’t promise this will work, but I would try:

Find what : ^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with: $1

Thanks, it produces the results
Can’t find the text “^([^:\r\n]++: “)[^:\r\n”]++(?<=.latest):/”

Mark Olson

FWIW no regex-based solution to this problem really addresses the core issue here, which is that you are trying to use regular expressions to parse a non-regular language, namely JSON. Regex are awesome, but they are not the right tool for every job.

I didn’t post earlier here because the examples you gave earlier looked like they might not be syntactically valid JSON, and therefore would probably be impossible for most JSON libraries to work with.

For example

"a": "foo"
"b": "bar"

is not valid JSON
but

{
"a": "foo",
"b": "bar"
}

is valid JSON even though it looks similar.

And guess what?

{
“a”: ”foo”,
”b”: ”bar”
}

is not valid JSON, because the curly quote character ” is not the same as the ASCII double quote character ".

Hopefully this illustrates the importance of using the code boxes when posting in the forum!

While we’re at it, this also illustrates another important idea, which is that you should assume that every single character is relevant when working with computer data. For example, you may not see the point of the commas after the colon-separated strings, but most JSON parsers are very opinionated and they will refuse to parse your input if they’re missing.

Sadly, the JSON plugins for NPP are largely useless here because you are trying to work with a large number of files.

I would strongly recommend giving up on regular expressions for this task, and instead learn Python and try using the json module for this task.

If you choose to use Python, this forum is not the place to ask for help. Go to a general programming forum instead.

Coises

@Ben-1 said in Help with regular expression:

@Coises said in Help with regular expression:

@Ben-1 I can’t promise this will work, but I would try:

Find what : ^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with: $1

Thanks, it produces the results
Can’t find the text “^([^:\r\n]++: “)[^:\r\n”]++(?<=.latest):/”

Hmmm… is your real data a little more complex than your examples? Perhaps you have something like:

"id" : "http://file.name.latest:/Custom/path/file.exe"
"uid" : "ftp://file.name.latest:/Custom/path/file.exe"
"sid" : "https://file.name.latest:/Custom/path/file.exe"

which includes a colon inside the second set of quotes before the colon in .latest:/? If so, then indeed, my expression would not match; fixing that would require knowing more about the exact patterns in your real data.

PeterJones

@Ben-1 said in Help with regular expression:

It would take thousands of hours with the amount of data and files there are. I think I am just going to tell this client I cannot figure out how to achieve what he wants and not bill him. I’ve already put hours of my personal time trying to figure it out and I thought it’d just be a pretty simple notepad++ mass edit.

Wait, what?! If we had given you a working regex, you would have billed a client and gotten paid for the work we did? Yeesh.

A Former User

I installed Sublime Text, put in the expression and it works flawlessly with all the files. Out of curiosity, I put in my original expression that I came up with before posting in the original thread and it works flawlessly in Sublime Text as well. So in the end, this is a notepad++ issue.

You are being ridiculous. I work at a computer repair company and a customer wanted help with something, I told him I’d look into it, couldn’t figure it out so I seeked help. Newflash… that’s what anyone you are paying to do any task is going to do when they hit a roadblock. YOU decide to be apart of online forum and help people for free. If you don’t enjoy doing that, don’t do it.

Thanks for your efforts and help but seriously the interactions I had on this forum were beyond strange and over controlling. I’m out of here.

PeterJones

@A-Former-User said in Help with regular expression:

the interactions I had on this forum were beyond strange and over controlling

For future readers, to sum up:

This individual came, asking us to do his job for him without telling us that’s what he was doing.
His question wasn’t asked well enough that he got any upvotes, but he got answers which matched our best guess for what he was trying to do. This is trying to be helpful, not controlling.
When he tried to reply, it triggered a problem with the forum software (which is provided to this Community for free, and we have no control over how it’s implemented), so he decided to delete his account because he couldn’t figure out how to delete a link from his reply.
When he asked his question as a new user, it was pointed out that it was a continuation of the original question (to better help people who were answering to have the full context for his question); and I went to the effort to explain to him how he could avoid that problem; and a user upvoted his explanation of why he re-created his account, in order to make it easier for him to post. This was me trying to be helpful, not controlling.
In both the original topic and again in this follow-on topic, when it was obvious that he was having difficulty expressing his question in a way that we could supply him with an answer that would solve his problem, I suggested he read the FAQs, and follow their instructions for how to ask his question better. This was not to “control” him, but rather so that he could better get his ideas across, so that we’d be able to better help him, and to avoid wasting his time (and ours) by continuing to post without formatting the data in a way that it comes across correctly. (And earlier, I had even reformatted a post for him, to help him better communicate his ideas.)
Once he was able to communicate his problem, we were able to give him an expression which worked, both on the example data he gave us, and reportedly on the initial data he was testing it on.
As he started to change the original parameters to the question, we still continued to provide alternate solutions.
When he hit a limitation for the file size, there were still multiple attempts to help him make it work.
But at this point, it was revealed that there were hidden constraints which he hadn’t mentioned to us, which made the data look quite different from the simple examples, which were exacerbating the difficulty with the file size… but attempts were still made to help.
When he gave up, he revealed that he was actually trying to earn money by having us solve this problem for him, and then tried to blame us (me) for being controlling when I highlighted the fact that he was trying to earn money on our free labor.

Please understand: if your job includes using Notepad++, and you have a Notepad++ question, you are still allowed to ask it. But when you make us go through so many iterations, and keep on changing the parameters of the question, then in the end reveal that all of this was solely so that you could hopefully earn some extra cash, it’s bound to raise some hackles. Be honest up front, respond to questions, and show a willingness to learn, and we bend over backward to try to help. Violate that trust, and we will sour on helping you.

----

Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.

PeterJones

As a parting shot, the original poster wrote,

I installed Sublime Text, put in the expression and it works flawlessly with all the files. Out of curiosity, I put in my original expression that I came up with before posting in the original thread and it works flawlessly in Sublime Text as well. So in the end, this is a notepad++ issue.

Do any of the regulars have enough experience with Sublime to know how it’s implemented differently from Notepad++, that might allow it to not complain with that regex in huge file, whereas Notepad++ does complain?

Is Sublime just better at chunking large files so that the memory never blows up?

Or does it use a different regex engine that’s more performant on large files?

Or do we think that the reports of errors were due to user error, not an actual issue with Notepad++?

If we can come up with an easy-to-reproduce large data file, and a simplified regex that works fine on Sublime and works fine on a small dataset in Notepad++, but blows up with a large dataset in Notepad++, we might be able to put in an issue – pointing out that the competition works better in that instance might light a fire under the developers, to see if they can improve that performance, which might in turn improve large-file search/replace performance in general, which would be a boon to all large-file users of Notepad++.

mkupper

@PeterJones It looks like they use Boost for the main finder. This Sublime Text forum post from 2022 seems to confirm that. That editor uses the Oniguruma regex library for parsing its configuration files.

Sublime Text has a portable edition available but I’m not sufficiently invested in the OP’s problem to verify that a particular regexp and set of data fails in Notepad++ but works in Sublime Text. I ran the portable edition and verified that simple multi-line spanning using [^x]* regexp works as that was my first guess as to a possible difference.

In 2016 there was mention of a custom regex engine but I did not dig further to see what that was about.

I found I liked the user interface of that editor’s find and search/replace box. It uses icons with hover text meaning there is not as much visual clutter as Notepad++. It’s also docked to the bottom. I did not experiment to see if it can be undocked as I suspect some people like Notepad++'s ability to have the finder box outside of the text window.

Coises

@PeterJones said in Help with regular expression:

Or does it use a different regex engine that’s more performant on large files?

Documentation here states, “Sublime Text uses the Perl Compatible Regular Expressions (PCRE) engine from the Boost library.”

Since Sublime Text is not an open source project, we have no way of knowing if they have modified the Boost.Regex engine in some way.

I’ve looked into this error message a bit, but I haven’t learned enough yet to understand what is happening. What I think I know thus far is that the exception leading to this message is thrown when the regex engine determines that its set of internal states is getting very large, and/or that it is growing out of proportion to the length of text it is examining. I believe this sort of thing happens when you have multiple quantifiers that can each lead to as many alternatives as there are characters in the text (e.g., *, +, *? or +?; but not *+ or ++) and there are long stretches of text that do not match (but can take many — probably millions — of attempts to determine that).

There is a configuration define, BOOST_REGEX_MAX_STATE_COUNT, which has an effect (the details of which I have not yet worked out) on this. Unless I’m missing something, Notepad++ does not set this, leaving it at the Boost.Regex default.

guy038

Hello, @a-former-user, @terry-r, @peterjones, @Coises, @mark-olson and All

Ah… an other interesting topic, regarding two different perpectives !

First, I will tak about some technical points
Secondly, I’ll give my opinion about the original poster’s behaviour

To, begin with, a general remark :

As this regex S/R simply concerns a suppression of some characters in one single line, at a time, it should work nicely in all cases ! Of course, depending on the total size of the file, the operation can take some minutes but it should necessarily end up correctly ! Furthermore, the message :

The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent “eternal” matches that take an indefinite period time to locate.

Should not appear for this simple line by line replacement !

So, what happened ? Did you notice that, in all the regex’s syntaxes provided, no one thought about adding leading modifiers ! And, given that, in his first post, @a-former-user said :

To accomplish this, I have tried using the following Find/Replace expressions and settings

Find What = (?<=")[^"]+\.latest:(?=/)
Replace With =
Search Mode = REGULAR EXPRESSION
Dot Matches Newline = Yes

May be, the fact that the . Matches newline option was checked, led to the The complexity… message ??

Moreover, when modifying a very big file, I think it’s best to :

First stop and restart Notepad++
Change, temporarily, the file extension to .txt, in order to avoid any lexing behaviour !
Not use the Word wrap feature ( Not totally sure ! )

Now, I, personally, did a try with this regex S/R :

SEARCH (?-is)(?<=" : ").+\.latest:/
REPLACE Leave EMPTY

I created an INPUT file with the following template :

1,000 times the text below :

{
"id" : "file.name.latest:/Custom/path/file.exe"
"uid" : "file.name.latest:/Custom/path/file.exe"
"sid" : "file.name.latest:/Custom/path/file.exe"
}

Then, the part :

{
"t" : "13.68",
"v" : "-0.7822026",
"o" : "-0.7819693"
},

And this set of 6,006 lines was repeated, itself, 1,000 times.

So, I got a total file of 157,066,000 bytes, containing 6,006,000 lines, whose 3,000,000 lines contained the part to delete

Running, first, this regex S/R, below, against this INPUT file, with my old Win XP machine with 1 Gb of RAM :
- SEARCH (?-is)(?<=" : ").+\.latest:/
- REPLACE Leave EMPTY

=> The S/R was complete after 5m and 45s with the message Replace All: 3000000 occurrences were replaced from caret to end-of-file

Now, running this regex S/R, against this same file, with my recent Windows 10 laptop, with 16 Gb of RAM :

=> The S/R was complete after 67s, with the same message ! The modified file is now 103,066,000 bytes long for 6,006,000 lines

Admittedly, this file is still half the size of @a-former-user’s file but, with its strong configuration, he should have succeeded to get the correct replacements !

Now, some thoughts about the @a-former-user’s behaviour. I hope that the DeepL translation will be accurate enough to correctly express my feelings !

Peter, I understand your disappointment : you’ve all been working hard to find solutions, while @a-former-user was simply waiting for THE perfect solution, for which he also had to receive financial compensation! Very frustrating for all of us!

Now, let’s ask ourselves: to what extent can we be sure that the solutions provided, as far as regular expressions are concerned, will be used by users solely, for their own personal needs? I, myself, have, on few occasions, had the opportunity to send back, by email, certain corrected files, of which, of course, I was unaware of the subsequent use that was made of them !

So I think it’s pointless to take offence at @a-former-user’s attitude. We have no control over the future of the proposed solutions. We’re just doing our best we can, free of charge, and that’s all there is to it ! We just have to remember that, of all the contributions we make, a very small number will eventually end in a transfer of money.

Of course, IF we had been able to find a solution that satisfied @a-former-user, it would have been elegant, and rather “fair-play”, on his part, in the event of a real financial reward, to make a donation to @Don-Ho of all or part of it !

Let’s not be bitter ! It’s just our privilege, like many other sites, to provide free information to as many people as possible. If some people do not want to play the game, we’ll let them.

Best Regards,

guy038

Coises

@guy038 said in Help with regular expression:

May be, the fact that the . Matches newline option was checked, led to the The complexity… message ?

All documentation I’ve seen says that option is equivalent to setting the “default” condition of the (?s)/(?-s) flag; and, in turn, that flag affects exactly one thing: whether or not an unescaped period matches end-of-line characters.

There are no unescaped periods in the regular expressions under consideration here.

I was not successful in constructing a file similar to the original poster’s examples that caused the complexity message to appear with the expressions considered in this thread. However, the original poster’s response to my last proposal — that it found no matches — tells me his data was different from the samples he showed us in some way that influenced matching. I’m reasonably confident that the complexity message nearly always appears when there are very many possibilities for a match to fail between successes. Regardless of the (?s) flag, these expressions (aside from my last suggestion) could match across lines. Without knowing how or how effectively Boost.Regex optimizes, it makes sense that the number of trials could become absurd if, for example, there were many thousands of successive lines containing no quote marks.

Mark Olson

@Coises
If the regex they were using included [^"], which can match newlines, it seems plausible that their issues could happen whether or not . Matches Newline was checked.

guy038

Hi, @mark-olson and All,

Oh…, my God : you’re a thousand times right, Mark ! Of course, negative class character like [^...] does not depend at all on the status of the . matches newline option, nor of the possible leading modifiers (?s) and (?-s). I should have remembered that !

Indeed, place this simple text in a new tab :

Test in order to verify
the maximum range of
characters matched
by the regex expression

And use, for example, any of the regexes below :

With the . matches newline option unchecked

 - SEARCH [^=]+.+[^=]+

 - SEARCH (?-s)[^=]+.+[^=]+

 - SEARCH (?s)[^=]+.+[^=]+

With the . matches newline option checked

 - SEARCH [^=]+.+[^=]+

 - SEARCH (?-s)[^=]+.+[^=]+

 - SEARCH (?s)[^=]+.+[^=]+

You can easily verify that, whatever the regex used, all text, from the very beginning to its very end, is always matched !

Remark :

In fact, as the in-line modifiers override the . matches newline option, there are only four distinct cases :

SEARCH [^=]+.+[^=]+ with the . matches newline option unchecked
SEARCH [^=]+.+[^=]+ with the . matches newline option checked
SEARCH (?-s)[^=]+.+[^=]+
SEARCH (?s)[^=]+.+[^=]+

Thus, in my previous post, I should have expressed myself :

May be, the fact, that a negative class character is used ( [^ ...] syntax ), led to the The complexity… message ??

Best Regards,

guy038