Reply to A new bug found 180304 on Sun, 04 Mar 2018 09:43:42 GMT

古旮 — Sun, 04 Mar 2018 09:43:42 GMT

Thanks for the reply. And yes, it seems to be the Catastrophic Backtracking thing. And it seems this stackoverflow exception is not caught.
The regex should be (?-s)d(.*\R){1,5}?.*c because c is not always the first of a line.
I know I should avoid low efficient regex expressions, but it is buggy to directly show a wrong result instead of telling me the limitation is reached.

Reply to A new bug found 180304 on Sun, 04 Mar 2018 15:44:31 GMT

guy038 — Sun, 04 Mar 2018 15:44:31 GMT

Hello, @古旮 and all

First of all, just read this short reply to @marc-lalonde, below :

https://notepad-plus-plus.org/community/topic/15247/replacing-duped-words-across-a-block-block-of-text-respecting/23

Obviously, in your example, there is no recursion feature, nor big amounts of text ! But, I think that all the troubles comes from the .*\R.* syntax

I succeeded to simplify the problem ! Let start with the following text :

Line 1 : 1 letter d, with its line-break
Line 2 : 14000 letters a, with its line -break, too

And let discuss about this similar regex :d(.*\R.*){1,2}c

Allow me to use the (?-s) modifier to ensures that the dot will match standard chars, only ! Then, my regex d(.*\R.*){1,2}c can be re-written :

(?-s)d(.*\R.*)(.*\R.*)c ( Regex R2 )

And, if NO match can be found, the regex engine goes on and, then, tries the regex :

(?-s)d(.*\R.*)c ( Regex R1 )

Finally, if a match still cannot be found, the Find dialog displays the message Find: Can’t find the text “(?-s)d(.*\R.*){1,2}c”

Well, let’s go back to my example ! To begin with, just create a new line 3 with the four letters cdef, only ( IMPORTANT )

Now, let’s try the regex (?-s)d(.*\R.*){1,2}c against my text : the regex engine tries the regex R2, first, which does match, immediately, all letters a, between the letters d and c included !

d                     :   Letter d

FIRST  block .*\R.*   :   Nothing   +   \R   +  14500 letters a

SECOND block .*\R.*   :   Nothing   +   \R   +  Nothing

c                     :   Letter c

Now, get rid of the string cdef, in line 3 and re-try the regex (?-s)d(.*\R.*){1,2}c. This time, troubles begin and, as you said, after 8s about, it wrongly selects all file contents !

Then, reduce the number of letters a, in line 2, to 14000 letters. This time, after 8s, as expected, the Find dialog answers :

Find: Can’t find the text “(?-s)d(.*\R.*){1,2}c”

IMPORTANT : Depending of your configuration, and the amount of memory, on your laptop, the limit ( 14000 - 14500 ) may be quite different than mime, but should occur, anyway !

So, how to explain this difference ? Well, at first, as the quantifiers * are greedy, the regex engine tries the case :

d                     :   Letter d

FIRST  block .*\R.*   :   Nothing   +   \R   +  14500 letters a

SECOND block .*\R.*   :   Nothing   +   \R   +  Nothing

c                     :   MISSING

NO match can be found and, as the regex could be rewritten (?-s)d.*\R.*.*\R.*c, the regex engine, still keeping the regex R2, then, begins to backtrack and tries this other configuration :

d                     :   Letter d

FIRST  block .*\R.*   :   Nothing      +   \R   +  14499 letters a

SECOND block .*\R.*   :   1 letter a   +   \R   +  Nothing

c                     :   MISSING

Of course, as the letter c is always missing, then it continues to backtrack and chooses to test :

d                     :   Letter d

FIRST  block .*\R.*   :   Nothing       +   \R   +  14498 letters a

SECOND block .*\R.*   :   2 letters a   +   \R   +  Nothing

c                     :   MISSING

… and going on, testing 14499 cases, trying to reach the last case :

d                     :   Letter d

FIRST  block .*\R.*   :   Nothing           +   \R   +  Nothing

SECOND block .*\R.*   :   14500 letters a   +   \R   +  Nothing

c                     :   MISSING

Unfortunately, while testing all the combinations, a catastrophic backtracking error occurred and the regex engine wrongly matches all file contents :-((

Personally, I would advice you to strongly avoid regexes like (.*)(.*) or (.*)+ or even (x+x+)+y !!

Finally, as you said :

I want to search for the whole article 2 key words, d and c, with the restriction that c is following d but it’s not farther than a few lines

I think, @古旮, that the right regex should be, simply, (?-s)d(.*\R){1,5}c. For instance, this regex would match the text below :

d
line 1
line 2
line 3
line 4
c

but would not match the following one :

d
line 1
line 2
line 3
line 4
line 5
line 6
c

I, personally, did a test with the regex (?-s)d(.*\R){1,2}c and the following text

Line 1 : 1 letter d + its line-break
Line 2 : 100,000 letters a + its line-break

It correctly displays the message Find: Can’t find “(?-s)d(.*\R){1,2}c”

Now, adding a line 3 with the string cdef, without any line-break, it selected, as expected, all text between the first char d and the single c letter, leaving the final def string unselected !

Best Regards,

guy038

P.S. :

Refer, also, to this article, about catastrophic backtracking, by Jan Goyvaerts, THE definitive regex GURU :

http://www.regular-expressions.info/catastrophic.html