Help with regular expression
-
@Ben-1 said in Help with regular expression:
(?<=")[^"\r\n]+?\.latest:/
The file in question has 5.5 million lines of text in it (296MB), maybe it’s just too much.
I thought there might be too many quotes… try changing to
(?<=: ")[^"\r\n]+?\.latest:/
… that should only match things that come after
: "
, which should help narrow down the range of matches.I just pasted
"id" : "file.name.latest:/Custom/path/file.exe" "uid" : "file.name.latest:/Custom/path/file.exe" "sid" : "file.name.latest:/Custom/path/file.exe"
4 million times – well, really, select-all, copy, paste, so it doubles every time, so it’s really 32ⁿ=32²²=3*4.194M=12.582M lines, so there are 625MB. It does not claim it’s invalid, and while it takes a long time, it seemed to be working.
I first tried with the original, but accidentally had a space before the
(?<=")
, and it gave me the error you showed when I tried to FIND; when I deleted that space, and did just a FIND, it had no problem finding the next few occurrences. So it’s not giving the error. Though when I did the Replace All, it’s taken at least 5 minutes without any apparent progress… Notepad++ does have performance issues with files that are more than a hundred megabytes.Unfortunately, after another 5 minutes while I wrote up my reply, it still hasn’t finished for me, so I’m cancelling it – oh, no, just as I wrote that, it finished. The search with
(?<=")[^"\r\n]+?\.latest:/
worked for me with 625MB.But maybe your machine has different memory limitations than mine does.
Try seeing if the more-restrictive
(?<=: ")[^"\r\n]+?\.latest:/
works for you. If you have a way to break up the 300MB file into smaller chunks, I might recommend trying that, too.And Terry posted this while I was writing mine up:
Find What:
"[^:]+:/
I’d recommend a modified
"[^:"\r\n]+?"
, otherwise it might still hit the problem of trying to wrap lines and being too greedy, if you have some data in the middle that doesn’t fit your simple format -
This is strange. None of those work for these larger files. Sadly, these files contains confidential information otherwise I’d just upload it. Tried restarting the computer just to make sure it wasn’t some odd memory issue, but this system has 64GB of RAM on 64bit OS so it shouldn’t be a memory limitation.
There are are multiple "s in every single line, and it seems to break after millions of these entries:
“BFCE9424314C03B3E”,
“B9603434384AC353E”,
“B00204343730E233E”,
{
“t” : “13.68”,
“v” : “-0.7822026”,
“c” : “3”,
“i” : “-0.7824571”,
“o” : “-0.7819693”
},I am trying to warp my head around a better way to detect what I need to replace/delete without simply looking for ". The text before what I need to replace always contains the following:
" : "
So any text after (not including it) I need to delete, up to and including this text.
.latest:/
That might reduce the amount of things it needs to look for?
-
@Ben-1
With your latest “examples” it seems that your file isn’t just lines of what you originally wanted edited, rather it contains a mixture of other types of lines, is that correct?If so, then another method might be to:
- number all the lines
- mark all the lines which contain the text to be edited
- cut these lines and paste into a blank tab
- edit these lines in the new tab
- paste the edited lines back
- re-sort the lines so they are back in line order
- remove the line numbers that were added in step #1
Terry
-
@Terry-R
It would take thousands of hours with the amount of data and files there are. I think I am just going to tell this client I cannot figure out how to achieve what he wants and not bill him. I’ve already put hours of my personal time trying to figure it out and I thought it’d just be a pretty simple notepad++ mass edit.For some reason the problem has gotten even worse, now the original expression is producing the error on files as little as 80MB, files that successfully processed yesterday with this exact expression.
(?<=")[^"]+\.latest:/
I do appreciate the help given though.
-
@Ben-1 I can’t promise this will work, but I would try:
Find what :
^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with:$1
-
@Coises said in Help with regular expression:
@Ben-1 I can’t promise this will work, but I would try:
Find what :
^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with:$1
Thanks, it produces the results
Can’t find the text “^([^:\r\n]++: “)[^:\r\n”]++(?<=.latest):/” -
FWIW no regex-based solution to this problem really addresses the core issue here, which is that you are trying to use regular expressions to parse a non-regular language, namely JSON . Regex are awesome, but they are not the right tool for every job.
I didn’t post earlier here because the examples you gave earlier looked like they might not be syntactically valid JSON, and therefore would probably be impossible for most JSON libraries to work with.
For example
"a": "foo" "b": "bar"
is not valid JSON
but{ "a": "foo", "b": "bar" }
is valid JSON even though it looks similar.
And guess what?
{ “a”: ”foo”, ”b”: ”bar” }
is not valid JSON, because the curly quote character
”
is not the same as the ASCII double quote character"
.Hopefully this illustrates the importance of using the code boxes when posting in the forum!
While we’re at it, this also illustrates another important idea, which is that you should assume that every single character is relevant when working with computer data. For example, you may not see the point of the commas after the colon-separated strings, but most JSON parsers are very opinionated and they will refuse to parse your input if they’re missing.
Sadly, the JSON plugins for NPP are largely useless here because you are trying to work with a large number of files.
I would strongly recommend giving up on regular expressions for this task, and instead learn Python and try using the json module for this task.
If you choose to use Python, this forum is not the place to ask for help. Go to a general programming forum instead.
-
@Ben-1 said in Help with regular expression:
@Coises said in Help with regular expression:
@Ben-1 I can’t promise this will work, but I would try:
Find what :
^([^:\r\n]++: ")[^:\r\n"]++(?<=\.latest):/
Replace with:$1
Thanks, it produces the results
Can’t find the text “^([^:\r\n]++: “)[^:\r\n”]++(?<=.latest):/”Hmmm… is your real data a little more complex than your examples? Perhaps you have something like:
"id" : "http://file.name.latest:/Custom/path/file.exe" "uid" : "ftp://file.name.latest:/Custom/path/file.exe" "sid" : "https://file.name.latest:/Custom/path/file.exe"
which includes a colon inside the second set of quotes before the colon in
.latest:/
? If so, then indeed, my expression would not match; fixing that would require knowing more about the exact patterns in your real data. -
@Ben-1 said in Help with regular expression:
It would take thousands of hours with the amount of data and files there are. I think I am just going to tell this client I cannot figure out how to achieve what he wants and not bill him. I’ve already put hours of my personal time trying to figure it out and I thought it’d just be a pretty simple notepad++ mass edit.
Wait, what?! If we had given you a working regex, you would have billed a client and gotten paid for the work we did? Yeesh.
-
I installed Sublime Text, put in the expression and it works flawlessly with all the files. Out of curiosity, I put in my original expression that I came up with before posting in the original thread and it works flawlessly in Sublime Text as well. So in the end, this is a notepad++ issue.
You are being ridiculous. I work at a computer repair company and a customer wanted help with something, I told him I’d look into it, couldn’t figure it out so I seeked help. Newflash… that’s what anyone you are paying to do any task is going to do when they hit a roadblock. YOU decide to be apart of online forum and help people for free. If you don’t enjoy doing that, don’t do it.
Thanks for your efforts and help but seriously the interactions I had on this forum were beyond strange and over controlling. I’m out of here.
-
@A-Former-User said in Help with regular expression:
the interactions I had on this forum were beyond strange and over controlling
For future readers, to sum up:
- This individual came, asking us to do his job for him without telling us that’s what he was doing.
- His question wasn’t asked well enough that he got any upvotes, but he got answers which matched our best guess for what he was trying to do. This is trying to be helpful, not controlling.
- When he tried to reply, it triggered a problem with the forum software (which is provided to this Community for free, and we have no control over how it’s implemented), so he decided to delete his account because he couldn’t figure out how to delete a link from his reply.
- When he asked his question as a new user, it was pointed out that it was a continuation of the original question (to better help people who were answering to have the full context for his question); and I went to the effort to explain to him how he could avoid that problem; and a user upvoted his explanation of why he re-created his account, in order to make it easier for him to post. This was me trying to be helpful, not controlling.
- In both the original topic and again in this follow-on topic, when it was obvious that he was having difficulty expressing his question in a way that we could supply him with an answer that would solve his problem, I suggested he read the FAQs, and follow their instructions for how to ask his question better. This was not to “control” him, but rather so that he could better get his ideas across, so that we’d be able to better help him, and to avoid wasting his time (and ours) by continuing to post without formatting the data in a way that it comes across correctly. (And earlier, I had even reformatted a post for him, to help him better communicate his ideas.)
- Once he was able to communicate his problem, we were able to give him an expression which worked, both on the example data he gave us, and reportedly on the initial data he was testing it on.
- As he started to change the original parameters to the question, we still continued to provide alternate solutions.
- When he hit a limitation for the file size, there were still multiple attempts to help him make it work.
- But at this point, it was revealed that there were hidden constraints which he hadn’t mentioned to us, which made the data look quite different from the simple examples, which were exacerbating the difficulty with the file size… but attempts were still made to help.
- When he gave up, he revealed that he was actually trying to earn money by having us solve this problem for him, and then tried to blame us (me) for being controlling when I highlighted the fact that he was trying to earn money on our free labor.
Please understand: if your job includes using Notepad++, and you have a Notepad++ question, you are still allowed to ask it. But when you make us go through so many iterations, and keep on changing the parameters of the question, then in the end reveal that all of this was solely so that you could hopefully earn some extra cash, it’s bound to raise some hackles. Be honest up front, respond to questions, and show a willingness to learn, and we bend over backward to try to help. Violate that trust, and we will sour on helping you.
----
Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.
-
As a parting shot, the original poster wrote,
I installed Sublime Text, put in the expression and it works flawlessly with all the files. Out of curiosity, I put in my original expression that I came up with before posting in the original thread and it works flawlessly in Sublime Text as well. So in the end, this is a notepad++ issue.
Do any of the regulars have enough experience with Sublime to know how it’s implemented differently from Notepad++, that might allow it to not complain with that regex in huge file, whereas Notepad++ does complain?
Is Sublime just better at chunking large files so that the memory never blows up?
Or does it use a different regex engine that’s more performant on large files?
Or do we think that the reports of errors were due to user error, not an actual issue with Notepad++?
If we can come up with an easy-to-reproduce large data file, and a simplified regex that works fine on Sublime and works fine on a small dataset in Notepad++, but blows up with a large dataset in Notepad++, we might be able to put in an issue – pointing out that the competition works better in that instance might light a fire under the developers, to see if they can improve that performance, which might in turn improve large-file search/replace performance in general, which would be a boon to all large-file users of Notepad++.
-
@PeterJones It looks like they use Boost for the main finder. This Sublime Text forum post from 2022 seems to confirm that. That editor uses the Oniguruma regex library for parsing its configuration files.
Sublime Text has a portable edition available but I’m not sufficiently invested in the OP’s problem to verify that a particular regexp and set of data fails in Notepad++ but works in Sublime Text. I ran the portable edition and verified that simple multi-line spanning using
[^x]*
regexp works as that was my first guess as to a possible difference.In 2016 there was mention of a custom regex engine but I did not dig further to see what that was about.
I found I liked the user interface of that editor’s find and search/replace box. It uses icons with hover text meaning there is not as much visual clutter as Notepad++. It’s also docked to the bottom. I did not experiment to see if it can be undocked as I suspect some people like Notepad++'s ability to have the finder box outside of the text window.
-
@PeterJones said in Help with regular expression:
Or does it use a different regex engine that’s more performant on large files?
Documentation here states, “Sublime Text uses the Perl Compatible Regular Expressions (PCRE) engine from the Boost library.”
Since Sublime Text is not an open source project, we have no way of knowing if they have modified the Boost.Regex engine in some way.
I’ve looked into this error message a bit, but I haven’t learned enough yet to understand what is happening. What I think I know thus far is that the exception leading to this message is thrown when the regex engine determines that its set of internal states is getting very large, and/or that it is growing out of proportion to the length of text it is examining. I believe this sort of thing happens when you have multiple quantifiers that can each lead to as many alternatives as there are characters in the text (e.g., *, +, *? or +?; but not *+ or ++) and there are long stretches of text that do not match (but can take many — probably millions — of attempts to determine that).
There is a configuration define, BOOST_REGEX_MAX_STATE_COUNT, which has an effect (the details of which I have not yet worked out) on this. Unless I’m missing something, Notepad++ does not set this, leaving it at the Boost.Regex default.
-
Hello, @a-former-user, @terry-r, @peterjones, @Coises, @mark-olson and All
Ah… an other interesting topic, regarding two different perpectives !
-
First, I will tak about some technical points
-
Secondly, I’ll give my opinion about the original poster’s behaviour
To, begin with, a general remark :
As this regex S/R simply concerns a suppression of some characters in one single line, at a time, it should work nicely in all cases ! Of course, depending on the total size of the file, the operation can take some minutes but it should necessarily end up correctly ! Furthermore, the message :
The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent “eternal” matches that take an indefinite period time to locate.
Should not appear for this simple line by line replacement !
So, what happened ? Did you notice that, in all the regex’s syntaxes provided, no one thought about adding leading modifiers ! And, given that, in his first post, @a-former-user said :
To accomplish this, I have tried using the following Find/Replace expressions and settings
Find What =
(?<=")[^"]+\.latest:(?=/)
Replace With =
Search Mode = REGULAR EXPRESSION
Dot Matches Newline = YesMay be, the fact that the
. Matches newline
option was checked, led to the The complexity… message ??Moreover, when modifying a very big file, I think it’s best to :
-
First stop and restart Notepad++
-
Change, temporarily, the file extension to
.txt
, in order to avoid any lexing behaviour ! -
Not use the
Word wrap
feature ( Not totally sure ! )
Now, I, personally, did a try with this regex S/R :
-
SEARCH
(?-is)(?<=" : ").+\.latest:/
-
REPLACE
Leave EMPTY
I created an INPUT file with the following template :
1,000
times the text below :
{ "id" : "file.name.latest:/Custom/path/file.exe" "uid" : "file.name.latest:/Custom/path/file.exe" "sid" : "file.name.latest:/Custom/path/file.exe" }
Then, the part :
{ "t" : "13.68", "v" : "-0.7822026", "o" : "-0.7819693" },
- And this set of
6,006
lines was repeated, itself,1,000
times.
So, I got a total file of
157,066,000
bytes, containing6,006,000
lines, whose3,000,000
lines contained the part to delete
-
Running, first, this regex S/R, below, against this INPUT file, with my old
Win XP
machine with1 Gb
of RAM :-
SEARCH
(?-is)(?<=" : ").+\.latest:/
-
REPLACE
Leave EMPTY
-
=> The S/R was complete after
5m
and45s
with the messageReplace All: 3000000 occurrences were replaced from caret to end-of-file
- Now, running this regex S/R, against this same file, with my recent
Windows 10
laptop, with16 Gb
of RAM :
=> The S/R was complete after
67s
, with the same message ! The modified file is now103,066,000
bytes long for6,006,000
linesAdmittedly, this file is still half the size of @a-former-user’s file but, with its strong configuration, he should have succeeded to get the correct replacements !
Now, some thoughts about the @a-former-user’s behaviour. I hope that the
DeepL
translation will be accurate enough to correctly express my feelings !Peter, I understand your disappointment : you’ve all been working hard to find solutions, while @a-former-user was simply waiting for THE perfect solution, for which he also had to receive financial compensation! Very frustrating for all of us!
Now, let’s ask ourselves: to what extent can we be sure that the solutions provided, as far as regular expressions are concerned, will be used by users solely, for their own personal needs? I, myself, have, on few occasions, had the opportunity to send back, by email, certain corrected files, of which, of course, I was unaware of the subsequent use that was made of them !
So I think it’s pointless to take offence at @a-former-user’s attitude. We have no control over the future of the proposed solutions. We’re just doing our best we can, free of charge, and that’s all there is to it ! We just have to remember that, of all the contributions we make, a very small number will eventually end in a transfer of money.
Of course, IF we had been able to find a solution that satisfied @a-former-user, it would have been elegant, and rather “fair-play”, on his part, in the event of a real financial reward, to make a donation to @Don-Ho of all or part of it !
Let’s not be bitter ! It’s just our privilege, like many other sites, to provide free information to as many people as possible. If some people do not want to play the game, we’ll let them.
Best Regards,
guy038
-
-
@guy038 said in Help with regular expression:
May be, the fact that the . Matches newline option was checked, led to the The complexity… message ?
All documentation I’ve seen says that option is equivalent to setting the “default” condition of the (?s)/(?-s) flag; and, in turn, that flag affects exactly one thing: whether or not an unescaped period matches end-of-line characters.
There are no unescaped periods in the regular expressions under consideration here.
I was not successful in constructing a file similar to the original poster’s examples that caused the complexity message to appear with the expressions considered in this thread. However, the original poster’s response to my last proposal — that it found no matches — tells me his data was different from the samples he showed us in some way that influenced matching. I’m reasonably confident that the complexity message nearly always appears when there are very many possibilities for a match to fail between successes. Regardless of the (?s) flag, these expressions (aside from my last suggestion) could match across lines. Without knowing how or how effectively Boost.Regex optimizes, it makes sense that the number of trials could become absurd if, for example, there were many thousands of successive lines containing no quote marks.
-
@Coises
If the regex they were using included[^"]
, which can match newlines, it seems plausible that their issues could happen whether or not. Matches Newline
was checked. -
Hi, @mark-olson and All,
Oh…, my God : you’re a thousand times right, Mark ! Of course, negative class character like
[^...]
does not depend at all on the status of the. matches newline
option, nor of the possible leading modifiers(?s)
and(?-s)
. I should have remembered that !Indeed, place this simple text in a new tab :
Test in order to verify the maximum range of characters matched by the regex expression
And use, for example, any of the regexes below :
-
With the
. matches newline
option unchecked- SEARCH [^=]+.+[^=]+ - SEARCH (?-s)[^=]+.+[^=]+ - SEARCH (?s)[^=]+.+[^=]+
-
With the
. matches newline
option checked- SEARCH [^=]+.+[^=]+ - SEARCH (?-s)[^=]+.+[^=]+ - SEARCH (?s)[^=]+.+[^=]+
You can easily verify that, whatever the regex used, all text, from the very beginning to its very end, is always matched !
Remark :
In fact, as the in-line modifiers override the
. matches newline
option, there are only four distinct cases :-
SEARCH
[^=]+.+[^=]+
with the. matches newline
option unchecked -
SEARCH
[^=]+.+[^=]+
with the. matches newline
option checked -
SEARCH
(?-s)[^=]+.+[^=]+
-
SEARCH
(?s)[^=]+.+[^=]+
Thus, in my previous post, I should have expressed myself :
May be, the fact, that a negative class character is used (
[^ ...]
syntax ), led to the The complexity… message ??Best Regards,
guy038
-