A new bug found 180304
-
Hello @guy038 , long time no see. I found a new bug today. Here’s the bug:
If we have textd
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Which is, exact 5 lines, each ending with
\r\n
except line 5.
line 1: 1 letterd
line 2: 197 lettera
line 3: 823 lettera
line 4:empty
line 5: 12 lettera
And the regx is
d(.*\R.*){1,5}c
. Of course . match newline is not checkedObviously, no match should be found, because there is no letter
c
in the text. But in fact, when I press Find, after a few seconds busy running, it will match the whole text. Even if I insert another letter in front ofd
, it will still match the whole text.OK. 2 more detailes of this bug.
- If you replace the
\R
with\r\n
in the regx, you’ll still get the same bug if you add morea
to each line. So this bug is not caused by\R
. - Why 197, 823 and 12? If you remove 1
a
from any of line 2, 3 or 5, then the search result will be no match. However, the searching time is never short(I understand this is not a very efficient regx expression).
I have done all the experiments on the newest release of NPP, and the François-R Boyer version(which is NPP 6.9.0 with some modification) that you introduced in this post a year ago.
I want know what could be the problem. If it is because the text is too long, I believe there should be some document or statement that declares this limitation.
Why does regx has so many bugs? Is it too difficult to implement? Is there any more stable platform that can execute regx? I know there are some websites that we can execute regx, but we cannot use the search in files functionality of NPP, and I am not sure they’re bugless.Maybe you are curious how I came across this bug. Here’s the story. I want to search for the whole article 2 key words,
d
andc
, with the restriction thatc
is followingd
but it’s not farther than a few lines. So i used regx liked(.*\R.*){1,2}c
, where2
is increasing. It was fine from 1 to 4, but when I searched withd(.*\R.*){1,5}c
, it started to match the whole text.
The text I searched was 2000 lines. I reduce the text little by little, and finally reduce it to 5 lines.Best Regards!
- If you replace the
-
Hello, @古旮 and all
First of all, just read this short reply to @marc-lalonde, below :
Obviously, in your example, there is no recursion feature, nor big amounts of text ! But, I think that all the troubles comes from the
.*\R.*
syntaxI succeeded to simplify the problem ! Let start with the following text :
Line 1 :
1
letterd
, with its line-break
Line 2 :14000
lettersa
, with its line -break, tooAnd let discuss about this similar regex :
d(.*\R.*){1,2}c
Allow me to use the
(?-s)
modifier to ensures that the dot will match standard chars, only ! Then, my regexd(.*\R.*){1,2}c
can be re-written :(?-s)d(.*\R.*)(.*\R.*)c
( Regex R2 )And, if NO match can be found, the regex engine goes on and, then, tries the regex :
(?-s)d(.*\R.*)c
( Regex R1 )Finally, if a match still cannot be found, the Find dialog displays the message Find: Can’t find the text “(?-s)d(.*\R.*){1,2}c”
Well, let’s go back to my example ! To begin with, just create a new line
3
with the four letterscdef
, only ( IMPORTANT )Now, let’s try the regex
(?-s)d(.*\R.*){1,2}c
against my text : the regex engine tries the regex R2, first, which does match, immediately, all lettersa
, between the lettersd
andc
included !d : Letter d FIRST block .*\R.* : Nothing + \R + 14500 letters a SECOND block .*\R.* : Nothing + \R + Nothing c : Letter c
Now, get rid of the string
cdef
, in line3
and re-try the regex(?-s)d(.*\R.*){1,2}c
. This time, troubles begin and, as you said, after 8s about, it wrongly selects all file contents !Then, reduce the number of letters a, in line
2
, to14000
letters. This time, after 8s, as expected, the Find dialog answers :Find: Can’t find the text “(?-s)d(.*\R.*){1,2}c”
IMPORTANT : Depending of your configuration, and the amount of memory, on your laptop, the limit (
14000
-14500
) may be quite different than mime, but should occur, anyway !So, how to explain this difference ? Well, at first, as the quantifiers
*
are greedy, the regex engine tries the case :d : Letter d FIRST block .*\R.* : Nothing + \R + 14500 letters a SECOND block .*\R.* : Nothing + \R + Nothing c : MISSING
NO match can be found and, as the regex could be rewritten
(?-s)d.*\R.*.*\R.*c
, the regex engine, still keeping the regex R2, then, begins to backtrack and tries this other configuration :d : Letter d FIRST block .*\R.* : Nothing + \R + 14499 letters a SECOND block .*\R.* : 1 letter a + \R + Nothing c : MISSING
Of course, as the letter
c
is always missing, then it continues to backtrack and chooses to test :d : Letter d FIRST block .*\R.* : Nothing + \R + 14498 letters a SECOND block .*\R.* : 2 letters a + \R + Nothing c : MISSING
… and going on, testing
14499
cases, trying to reach the last case :d : Letter d FIRST block .*\R.* : Nothing + \R + Nothing SECOND block .*\R.* : 14500 letters a + \R + Nothing c : MISSING
Unfortunately, while testing all the combinations, a catastrophic backtracking error occurred and the regex engine wrongly matches all file contents :-((
Personally, I would advice you to strongly avoid regexes like
(.*)(.*)
or(.*)+
or even(x+x+)+y
!!
Finally, as you said :
I want to search for the whole article 2 key words, d and c, with the restriction that c is following d but it’s not farther than a few lines
I think, @古旮, that the right regex should be, simply,
(?-s)d(.*\R){1,5}c
. For instance, this regex would match the text below :d line 1 line 2 line 3 line 4 c
but would not match the following one :
d line 1 line 2 line 3 line 4 line 5 line 6 c
I, personally, did a test with the regex
(?-s)d(.*\R){1,2}c
and the following textLine 1 :
1
letterd
+ its line-break
Line 2 :100,000
lettersa
+ its line-breakIt correctly displays the message Find: Can’t find “(?-s)d(.*\R){1,2}c”
Now, adding a line
3
with the stringcdef
, without any line-break, it selected, as expected, all text between the first chard
and the singlec
letter, leaving the final def string unselected !Best Regards,
guy038
P.S. :
Refer, also, to this article, about catastrophic backtracking, by Jan Goyvaerts,
THE
definitive regexGURU
: -
Thanks for the reply. And yes, it seems to be the Catastrophic Backtracking thing. And it seems this stackoverflow exception is not caught.
The regex should be(?-s)d(.*\R){1,5}?.*c
becausec
is not always the first of a line.
I know I should avoid low efficient regex expressions, but it is buggy to directly show a wrong result instead of telling me the limitation is reached.