A new bug found 180304
-
Hello @guy038 , long time no see. I found a new bug today. Here’s the bug:
If we have textd
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Which is, exact 5 lines, each ending with
\r\nexcept line 5.
line 1: 1 letterd
line 2: 197 lettera
line 3: 823 lettera
line 4:empty
line 5: 12 letteraAnd the regx is
d(.*\R.*){1,5}c. Of course . match newline is not checkedObviously, no match should be found, because there is no letter
cin the text. But in fact, when I press Find, after a few seconds busy running, it will match the whole text. Even if I insert another letter in front ofd, it will still match the whole text.OK. 2 more detailes of this bug.
- If you replace the
\Rwith\r\nin the regx, you’ll still get the same bug if you add moreato each line. So this bug is not caused by\R. - Why 197, 823 and 12? If you remove 1
afrom any of line 2, 3 or 5, then the search result will be no match. However, the searching time is never short(I understand this is not a very efficient regx expression).
I have done all the experiments on the newest release of NPP, and the François-R Boyer version(which is NPP 6.9.0 with some modification) that you introduced in this post a year ago.
I want know what could be the problem. If it is because the text is too long, I believe there should be some document or statement that declares this limitation.
Why does regx has so many bugs? Is it too difficult to implement? Is there any more stable platform that can execute regx? I know there are some websites that we can execute regx, but we cannot use the search in files functionality of NPP, and I am not sure they’re bugless.Maybe you are curious how I came across this bug. Here’s the story. I want to search for the whole article 2 key words,
dandc, with the restriction thatcis followingdbut it’s not farther than a few lines. So i used regx liked(.*\R.*){1,2}c, where2is increasing. It was fine from 1 to 4, but when I searched withd(.*\R.*){1,5}c, it started to match the whole text.
The text I searched was 2000 lines. I reduce the text little by little, and finally reduce it to 5 lines.Best Regards!
- If you replace the
-
Hello, @古旮 and all
First of all, just read this short reply to @marc-lalonde, below :
Obviously, in your example, there is no recursion feature, nor big amounts of text ! But, I think that all the troubles comes from the
.*\R.*syntaxI succeeded to simplify the problem ! Let start with the following text :
Line 1 :
1letterd, with its line-break
Line 2 :14000lettersa, with its line -break, tooAnd let discuss about this similar regex :
d(.*\R.*){1,2}cAllow me to use the
(?-s)modifier to ensures that the dot will match standard chars, only ! Then, my regexd(.*\R.*){1,2}ccan be re-written :(?-s)d(.*\R.*)(.*\R.*)c( Regex R2 )And, if NO match can be found, the regex engine goes on and, then, tries the regex :
(?-s)d(.*\R.*)c( Regex R1 )Finally, if a match still cannot be found, the Find dialog displays the message Find: Can’t find the text “(?-s)d(.*\R.*){1,2}c”
Well, let’s go back to my example ! To begin with, just create a new line
3with the four letterscdef, only ( IMPORTANT )Now, let’s try the regex
(?-s)d(.*\R.*){1,2}cagainst my text : the regex engine tries the regex R2, first, which does match, immediately, all lettersa, between the lettersdandcincluded !d : Letter d FIRST block .*\R.* : Nothing + \R + 14500 letters a SECOND block .*\R.* : Nothing + \R + Nothing c : Letter cNow, get rid of the string
cdef, in line3and re-try the regex(?-s)d(.*\R.*){1,2}c. This time, troubles begin and, as you said, after 8s about, it wrongly selects all file contents !Then, reduce the number of letters a, in line
2, to14000letters. This time, after 8s, as expected, the Find dialog answers :Find: Can’t find the text “(?-s)d(.*\R.*){1,2}c”
IMPORTANT : Depending of your configuration, and the amount of memory, on your laptop, the limit (
14000-14500) may be quite different than mime, but should occur, anyway !So, how to explain this difference ? Well, at first, as the quantifiers
*are greedy, the regex engine tries the case :d : Letter d FIRST block .*\R.* : Nothing + \R + 14500 letters a SECOND block .*\R.* : Nothing + \R + Nothing c : MISSINGNO match can be found and, as the regex could be rewritten
(?-s)d.*\R.*.*\R.*c, the regex engine, still keeping the regex R2, then, begins to backtrack and tries this other configuration :d : Letter d FIRST block .*\R.* : Nothing + \R + 14499 letters a SECOND block .*\R.* : 1 letter a + \R + Nothing c : MISSINGOf course, as the letter
cis always missing, then it continues to backtrack and chooses to test :d : Letter d FIRST block .*\R.* : Nothing + \R + 14498 letters a SECOND block .*\R.* : 2 letters a + \R + Nothing c : MISSING… and going on, testing
14499cases, trying to reach the last case :d : Letter d FIRST block .*\R.* : Nothing + \R + Nothing SECOND block .*\R.* : 14500 letters a + \R + Nothing c : MISSINGUnfortunately, while testing all the combinations, a catastrophic backtracking error occurred and the regex engine wrongly matches all file contents :-((
Personally, I would advice you to strongly avoid regexes like
(.*)(.*)or(.*)+or even(x+x+)+y!!
Finally, as you said :
I want to search for the whole article 2 key words, d and c, with the restriction that c is following d but it’s not farther than a few lines
I think, @古旮, that the right regex should be, simply,
(?-s)d(.*\R){1,5}c. For instance, this regex would match the text below :d line 1 line 2 line 3 line 4 cbut would not match the following one :
d line 1 line 2 line 3 line 4 line 5 line 6 cI, personally, did a test with the regex
(?-s)d(.*\R){1,2}cand the following textLine 1 :
1letterd+ its line-break
Line 2 :100,000lettersa+ its line-breakIt correctly displays the message Find: Can’t find “(?-s)d(.*\R){1,2}c”
Now, adding a line
3with the stringcdef, without any line-break, it selected, as expected, all text between the first chardand the singlecletter, leaving the final def string unselected !Best Regards,
guy038
P.S. :
Refer, also, to this article, about catastrophic backtracking, by Jan Goyvaerts,
THEdefinitive regexGURU: -
Thanks for the reply. And yes, it seems to be the Catastrophic Backtracking thing. And it seems this stackoverflow exception is not caught.
The regex should be(?-s)d(.*\R){1,5}?.*cbecausecis not always the first of a line.
I know I should avoid low efficient regex expressions, but it is buggy to directly show a wrong result instead of telling me the limitation is reached.