A new bug found 180304

古旮

Hello @guy038 , long time no see. I found a new bug today. Here’s the bug:
If we have text

d
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

aaaaaaaaaaaa

Which is, exact 5 lines, each ending with \r\n except line 5.
line 1: 1 letter d
line 2: 197 letter a
line 3: 823 letter a
line 4: empty
line 5: 12 letter a

And the regx is d(.*\R.*){1,5}c. Of course . match newline is not checked

Obviously, no match should be found, because there is no letter c in the text. But in fact, when I press Find, after a few seconds busy running, it will match the whole text. Even if I insert another letter in front of d, it will still match the whole text.

OK. 2 more detailes of this bug.

If you replace the \R with \r\n in the regx, you’ll still get the same bug if you add more a to each line. So this bug is not caused by \R.
Why 197, 823 and 12? If you remove 1 a from any of line 2, 3 or 5, then the search result will be no match. However, the searching time is never short(I understand this is not a very efficient regx expression).

I have done all the experiments on the newest release of NPP, and the François-R Boyer version(which is NPP 6.9.0 with some modification) that you introduced in this post a year ago.
I want know what could be the problem. If it is because the text is too long, I believe there should be some document or statement that declares this limitation.
Why does regx has so many bugs? Is it too difficult to implement? Is there any more stable platform that can execute regx? I know there are some websites that we can execute regx, but we cannot use the search in files functionality of NPP, and I am not sure they’re bugless.

Maybe you are curious how I came across this bug. Here’s the story. I want to search for the whole article 2 key words, d and c, with the restriction that c is following d but it’s not farther than a few lines. So i used regx like d(.*\R.*){1,2}c, where 2 is increasing. It was fine from 1 to 4, but when I searched with d(.*\R.*){1,5}c, it started to match the whole text.
The text I searched was 2000 lines. I reduce the text little by little, and finally reduce it to 5 lines.

Best Regards!

guy038

Hello, @古旮 and all

First of all, just read this short reply to @marc-lalonde, below :

https://notepad-plus-plus.org/community/topic/15247/replacing-duped-words-across-a-block-block-of-text-respecting/23

Obviously, in your example, there is no recursion feature, nor big amounts of text ! But, I think that all the troubles comes from the .*\R.* syntax

I succeeded to simplify the problem ! Let start with the following text :

Line 1 : 1 letter d, with its line-break
Line 2 : 14000 letters a, with its line -break, too

And let discuss about this similar regex :d(.*\R.*){1,2}c

Allow me to use the (?-s) modifier to ensures that the dot will match standard chars, only ! Then, my regex d(.*\R.*){1,2}c can be re-written :

(?-s)d(.*\R.*)(.*\R.*)c ( Regex R2 )

And, if NO match can be found, the regex engine goes on and, then, tries the regex :

(?-s)d(.*\R.*)c ( Regex R1 )

Finally, if a match still cannot be found, the Find dialog displays the message Find: Can’t find the text “(?-s)d(.*\R.*){1,2}c”

Well, let’s go back to my example ! To begin with, just create a new line 3 with the four letters cdef, only ( IMPORTANT )

Now, let’s try the regex (?-s)d(.*\R.*){1,2}c against my text : the regex engine tries the regex R2, first, which does match, immediately, all letters a, between the letters d and c included !

d                     :   Letter d

FIRST  block .*\R.*   :   Nothing   +   \R   +  14500 letters a

SECOND block .*\R.*   :   Nothing   +   \R   +  Nothing

c                     :   Letter c

Now, get rid of the string cdef, in line 3 and re-try the regex (?-s)d(.*\R.*){1,2}c. This time, troubles begin and, as you said, after 8s about, it wrongly selects all file contents !

Then, reduce the number of letters a, in line 2, to 14000 letters. This time, after 8s, as expected, the Find dialog answers :

Find: Can’t find the text “(?-s)d(.*\R.*){1,2}c”

IMPORTANT : Depending of your configuration, and the amount of memory, on your laptop, the limit ( 14000 - 14500 ) may be quite different than mime, but should occur, anyway !

So, how to explain this difference ? Well, at first, as the quantifiers * are greedy, the regex engine tries the case :

d                     :   Letter d

FIRST  block .*\R.*   :   Nothing   +   \R   +  14500 letters a

SECOND block .*\R.*   :   Nothing   +   \R   +  Nothing

c                     :   MISSING

NO match can be found and, as the regex could be rewritten (?-s)d.*\R.*.*\R.*c, the regex engine, still keeping the regex R2, then, begins to backtrack and tries this other configuration :

d                     :   Letter d

FIRST  block .*\R.*   :   Nothing      +   \R   +  14499 letters a

SECOND block .*\R.*   :   1 letter a   +   \R   +  Nothing

c                     :   MISSING

Of course, as the letter c is always missing, then it continues to backtrack and chooses to test :

d                     :   Letter d

FIRST  block .*\R.*   :   Nothing       +   \R   +  14498 letters a

SECOND block .*\R.*   :   2 letters a   +   \R   +  Nothing

c                     :   MISSING

… and going on, testing 14499 cases, trying to reach the last case :

d                     :   Letter d

FIRST  block .*\R.*   :   Nothing           +   \R   +  Nothing

SECOND block .*\R.*   :   14500 letters a   +   \R   +  Nothing

c                     :   MISSING

Unfortunately, while testing all the combinations, a catastrophic backtracking error occurred and the regex engine wrongly matches all file contents :-((

Personally, I would advice you to strongly avoid regexes like (.*)(.*) or (.*)+ or even (x+x+)+y !!

Finally, as you said :

I want to search for the whole article 2 key words, d and c, with the restriction that c is following d but it’s not farther than a few lines

I think, @古旮, that the right regex should be, simply, (?-s)d(.*\R){1,5}c. For instance, this regex would match the text below :

d
line 1
line 2
line 3
line 4
c

but would not match the following one :

d
line 1
line 2
line 3
line 4
line 5
line 6
c

I, personally, did a test with the regex (?-s)d(.*\R){1,2}c and the following text

Line 1 : 1 letter d + its line-break
Line 2 : 100,000 letters a + its line-break

It correctly displays the message Find: Can’t find “(?-s)d(.*\R){1,2}c”

Now, adding a line 3 with the string cdef, without any line-break, it selected, as expected, all text between the first char d and the single c letter, leaving the final def string unselected !

Best Regards,

guy038

P.S. :

Refer, also, to this article, about catastrophic backtracking, by Jan Goyvaerts, THE definitive regex GURU :

http://www.regular-expressions.info/catastrophic.html

古旮

Thanks for the reply. And yes, it seems to be the Catastrophic Backtracking thing. And it seems this stackoverflow exception is not caught.
The regex should be (?-s)d(.*\R){1,5}?.*c because c is not always the first of a line.
I know I should avoid low efficient regex expressions, but it is buggy to directly show a wrong result instead of telling me the limitation is reached.