Help with regular expression
-
@Ben-1 said in Help with regular expression:
It would take thousands of hours with the amount of data and files there are. I think I am just going to tell this client I cannot figure out how to achieve what he wants and not bill him. I’ve already put hours of my personal time trying to figure it out and I thought it’d just be a pretty simple notepad++ mass edit.
Wait, what?! If we had given you a working regex, you would have billed a client and gotten paid for the work we did? Yeesh.
-
I installed Sublime Text, put in the expression and it works flawlessly with all the files. Out of curiosity, I put in my original expression that I came up with before posting in the original thread and it works flawlessly in Sublime Text as well. So in the end, this is a notepad++ issue.
You are being ridiculous. I work at a computer repair company and a customer wanted help with something, I told him I’d look into it, couldn’t figure it out so I seeked help. Newflash… that’s what anyone you are paying to do any task is going to do when they hit a roadblock. YOU decide to be apart of online forum and help people for free. If you don’t enjoy doing that, don’t do it.
Thanks for your efforts and help but seriously the interactions I had on this forum were beyond strange and over controlling. I’m out of here.
-
@A-Former-User said in Help with regular expression:
the interactions I had on this forum were beyond strange and over controlling
For future readers, to sum up:
- This individual came, asking us to do his job for him without telling us that’s what he was doing.
- His question wasn’t asked well enough that he got any upvotes, but he got answers which matched our best guess for what he was trying to do. This is trying to be helpful, not controlling.
- When he tried to reply, it triggered a problem with the forum software (which is provided to this Community for free, and we have no control over how it’s implemented), so he decided to delete his account because he couldn’t figure out how to delete a link from his reply.
- When he asked his question as a new user, it was pointed out that it was a continuation of the original question (to better help people who were answering to have the full context for his question); and I went to the effort to explain to him how he could avoid that problem; and a user upvoted his explanation of why he re-created his account, in order to make it easier for him to post. This was me trying to be helpful, not controlling.
- In both the original topic and again in this follow-on topic, when it was obvious that he was having difficulty expressing his question in a way that we could supply him with an answer that would solve his problem, I suggested he read the FAQs, and follow their instructions for how to ask his question better. This was not to “control” him, but rather so that he could better get his ideas across, so that we’d be able to better help him, and to avoid wasting his time (and ours) by continuing to post without formatting the data in a way that it comes across correctly. (And earlier, I had even reformatted a post for him, to help him better communicate his ideas.)
- Once he was able to communicate his problem, we were able to give him an expression which worked, both on the example data he gave us, and reportedly on the initial data he was testing it on.
- As he started to change the original parameters to the question, we still continued to provide alternate solutions.
- When he hit a limitation for the file size, there were still multiple attempts to help him make it work.
- But at this point, it was revealed that there were hidden constraints which he hadn’t mentioned to us, which made the data look quite different from the simple examples, which were exacerbating the difficulty with the file size… but attempts were still made to help.
- When he gave up, he revealed that he was actually trying to earn money by having us solve this problem for him, and then tried to blame us (me) for being controlling when I highlighted the fact that he was trying to earn money on our free labor.
Please understand: if your job includes using Notepad++, and you have a Notepad++ question, you are still allowed to ask it. But when you make us go through so many iterations, and keep on changing the parameters of the question, then in the end reveal that all of this was solely so that you could hopefully earn some extra cash, it’s bound to raise some hackles. Be honest up front, respond to questions, and show a willingness to learn, and we bend over backward to try to help. Violate that trust, and we will sour on helping you.
----
Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.
-
As a parting shot, the original poster wrote,
I installed Sublime Text, put in the expression and it works flawlessly with all the files. Out of curiosity, I put in my original expression that I came up with before posting in the original thread and it works flawlessly in Sublime Text as well. So in the end, this is a notepad++ issue.
Do any of the regulars have enough experience with Sublime to know how it’s implemented differently from Notepad++, that might allow it to not complain with that regex in huge file, whereas Notepad++ does complain?
Is Sublime just better at chunking large files so that the memory never blows up?
Or does it use a different regex engine that’s more performant on large files?
Or do we think that the reports of errors were due to user error, not an actual issue with Notepad++?
If we can come up with an easy-to-reproduce large data file, and a simplified regex that works fine on Sublime and works fine on a small dataset in Notepad++, but blows up with a large dataset in Notepad++, we might be able to put in an issue – pointing out that the competition works better in that instance might light a fire under the developers, to see if they can improve that performance, which might in turn improve large-file search/replace performance in general, which would be a boon to all large-file users of Notepad++.
-
@PeterJones It looks like they use Boost for the main finder. This Sublime Text forum post from 2022 seems to confirm that. That editor uses the Oniguruma regex library for parsing its configuration files.
Sublime Text has a portable edition available but I’m not sufficiently invested in the OP’s problem to verify that a particular regexp and set of data fails in Notepad++ but works in Sublime Text. I ran the portable edition and verified that simple multi-line spanning using
[^x]*
regexp works as that was my first guess as to a possible difference.In 2016 there was mention of a custom regex engine but I did not dig further to see what that was about.
I found I liked the user interface of that editor’s find and search/replace box. It uses icons with hover text meaning there is not as much visual clutter as Notepad++. It’s also docked to the bottom. I did not experiment to see if it can be undocked as I suspect some people like Notepad++'s ability to have the finder box outside of the text window.
-
@PeterJones said in Help with regular expression:
Or does it use a different regex engine that’s more performant on large files?
Documentation here states, “Sublime Text uses the Perl Compatible Regular Expressions (PCRE) engine from the Boost library.”
Since Sublime Text is not an open source project, we have no way of knowing if they have modified the Boost.Regex engine in some way.
I’ve looked into this error message a bit, but I haven’t learned enough yet to understand what is happening. What I think I know thus far is that the exception leading to this message is thrown when the regex engine determines that its set of internal states is getting very large, and/or that it is growing out of proportion to the length of text it is examining. I believe this sort of thing happens when you have multiple quantifiers that can each lead to as many alternatives as there are characters in the text (e.g., *, +, *? or +?; but not *+ or ++) and there are long stretches of text that do not match (but can take many — probably millions — of attempts to determine that).
There is a configuration define, BOOST_REGEX_MAX_STATE_COUNT, which has an effect (the details of which I have not yet worked out) on this. Unless I’m missing something, Notepad++ does not set this, leaving it at the Boost.Regex default.
-
Hello, @a-former-user, @terry-r, @peterjones, @Coises, @mark-olson and All
Ah… an other interesting topic, regarding two different perpectives !
-
First, I will tak about some technical points
-
Secondly, I’ll give my opinion about the original poster’s behaviour
To, begin with, a general remark :
As this regex S/R simply concerns a suppression of some characters in one single line, at a time, it should work nicely in all cases ! Of course, depending on the total size of the file, the operation can take some minutes but it should necessarily end up correctly ! Furthermore, the message :
The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent “eternal” matches that take an indefinite period time to locate.
Should not appear for this simple line by line replacement !
So, what happened ? Did you notice that, in all the regex’s syntaxes provided, no one thought about adding leading modifiers ! And, given that, in his first post, @a-former-user said :
To accomplish this, I have tried using the following Find/Replace expressions and settings
Find What =
(?<=")[^"]+\.latest:(?=/)
Replace With =
Search Mode = REGULAR EXPRESSION
Dot Matches Newline = YesMay be, the fact that the
. Matches newline
option was checked, led to the The complexity… message ??Moreover, when modifying a very big file, I think it’s best to :
-
First stop and restart Notepad++
-
Change, temporarily, the file extension to
.txt
, in order to avoid any lexing behaviour ! -
Not use the
Word wrap
feature ( Not totally sure ! )
Now, I, personally, did a try with this regex S/R :
-
SEARCH
(?-is)(?<=" : ").+\.latest:/
-
REPLACE
Leave EMPTY
I created an INPUT file with the following template :
1,000
times the text below :
{ "id" : "file.name.latest:/Custom/path/file.exe" "uid" : "file.name.latest:/Custom/path/file.exe" "sid" : "file.name.latest:/Custom/path/file.exe" }
Then, the part :
{ "t" : "13.68", "v" : "-0.7822026", "o" : "-0.7819693" },
- And this set of
6,006
lines was repeated, itself,1,000
times.
So, I got a total file of
157,066,000
bytes, containing6,006,000
lines, whose3,000,000
lines contained the part to delete
-
Running, first, this regex S/R, below, against this INPUT file, with my old
Win XP
machine with1 Gb
of RAM :-
SEARCH
(?-is)(?<=" : ").+\.latest:/
-
REPLACE
Leave EMPTY
-
=> The S/R was complete after
5m
and45s
with the messageReplace All: 3000000 occurrences were replaced from caret to end-of-file
- Now, running this regex S/R, against this same file, with my recent
Windows 10
laptop, with16 Gb
of RAM :
=> The S/R was complete after
67s
, with the same message ! The modified file is now103,066,000
bytes long for6,006,000
linesAdmittedly, this file is still half the size of @a-former-user’s file but, with its strong configuration, he should have succeeded to get the correct replacements !
Now, some thoughts about the @a-former-user’s behaviour. I hope that the
DeepL
translation will be accurate enough to correctly express my feelings !Peter, I understand your disappointment : you’ve all been working hard to find solutions, while @a-former-user was simply waiting for THE perfect solution, for which he also had to receive financial compensation! Very frustrating for all of us!
Now, let’s ask ourselves: to what extent can we be sure that the solutions provided, as far as regular expressions are concerned, will be used by users solely, for their own personal needs? I, myself, have, on few occasions, had the opportunity to send back, by email, certain corrected files, of which, of course, I was unaware of the subsequent use that was made of them !
So I think it’s pointless to take offence at @a-former-user’s attitude. We have no control over the future of the proposed solutions. We’re just doing our best we can, free of charge, and that’s all there is to it ! We just have to remember that, of all the contributions we make, a very small number will eventually end in a transfer of money.
Of course, IF we had been able to find a solution that satisfied @a-former-user, it would have been elegant, and rather “fair-play”, on his part, in the event of a real financial reward, to make a donation to @Don-Ho of all or part of it !
Let’s not be bitter ! It’s just our privilege, like many other sites, to provide free information to as many people as possible. If some people do not want to play the game, we’ll let them.
Best Regards,
guy038
-
-
@guy038 said in Help with regular expression:
May be, the fact that the . Matches newline option was checked, led to the The complexity… message ?
All documentation I’ve seen says that option is equivalent to setting the “default” condition of the (?s)/(?-s) flag; and, in turn, that flag affects exactly one thing: whether or not an unescaped period matches end-of-line characters.
There are no unescaped periods in the regular expressions under consideration here.
I was not successful in constructing a file similar to the original poster’s examples that caused the complexity message to appear with the expressions considered in this thread. However, the original poster’s response to my last proposal — that it found no matches — tells me his data was different from the samples he showed us in some way that influenced matching. I’m reasonably confident that the complexity message nearly always appears when there are very many possibilities for a match to fail between successes. Regardless of the (?s) flag, these expressions (aside from my last suggestion) could match across lines. Without knowing how or how effectively Boost.Regex optimizes, it makes sense that the number of trials could become absurd if, for example, there were many thousands of successive lines containing no quote marks.
-
@Coises
If the regex they were using included[^"]
, which can match newlines, it seems plausible that their issues could happen whether or not. Matches Newline
was checked. -
Hi, @mark-olson and All,
Oh…, my God : you’re a thousand times right, Mark ! Of course, negative class character like
[^...]
does not depend at all on the status of the. matches newline
option, nor of the possible leading modifiers(?s)
and(?-s)
. I should have remembered that !Indeed, place this simple text in a new tab :
Test in order to verify the maximum range of characters matched by the regex expression
And use, for example, any of the regexes below :
-
With the
. matches newline
option unchecked- SEARCH [^=]+.+[^=]+ - SEARCH (?-s)[^=]+.+[^=]+ - SEARCH (?s)[^=]+.+[^=]+
-
With the
. matches newline
option checked- SEARCH [^=]+.+[^=]+ - SEARCH (?-s)[^=]+.+[^=]+ - SEARCH (?s)[^=]+.+[^=]+
You can easily verify that, whatever the regex used, all text, from the very beginning to its very end, is always matched !
Remark :
In fact, as the in-line modifiers override the
. matches newline
option, there are only four distinct cases :-
SEARCH
[^=]+.+[^=]+
with the. matches newline
option unchecked -
SEARCH
[^=]+.+[^=]+
with the. matches newline
option checked -
SEARCH
(?-s)[^=]+.+[^=]+
-
SEARCH
(?s)[^=]+.+[^=]+
Thus, in my previous post, I should have expressed myself :
May be, the fact, that a negative class character is used (
[^ ...]
syntax ), led to the The complexity… message ??Best Regards,
guy038
-