Community
    • Login

    A new bug found 180304

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    regexbug
    3 Posts 2 Posters 1.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • 古旮古
      古旮
      last edited by

      Hello @guy038 , long time no see. I found a new bug today. Here’s the bug:
      If we have text

      d
      aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
      aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

      aaaaaaaaaaaa

      Which is, exact 5 lines, each ending with \r\n except line 5.
      line 1: 1 letter d
      line 2: 197 letter a
      line 3: 823 letter a
      line 4: empty
      line 5: 12 letter a

      And the regx is d(.*\R.*){1,5}c. Of course . match newline is not checked

      Obviously, no match should be found, because there is no letter c in the text. But in fact, when I press Find, after a few seconds busy running, it will match the whole text. Even if I insert another letter in front of d, it will still match the whole text.

      OK. 2 more detailes of this bug.

      1. If you replace the \R with \r\n in the regx, you’ll still get the same bug if you add more a to each line. So this bug is not caused by \R.
      2. Why 197, 823 and 12? If you remove 1 a from any of line 2, 3 or 5, then the search result will be no match. However, the searching time is never short(I understand this is not a very efficient regx expression).

      I have done all the experiments on the newest release of NPP, and the François-R Boyer version(which is NPP 6.9.0 with some modification) that you introduced in this post a year ago.
      I want know what could be the problem. If it is because the text is too long, I believe there should be some document or statement that declares this limitation.
      Why does regx has so many bugs? Is it too difficult to implement? Is there any more stable platform that can execute regx? I know there are some websites that we can execute regx, but we cannot use the search in files functionality of NPP, and I am not sure they’re bugless.

      Maybe you are curious how I came across this bug. Here’s the story. I want to search for the whole article 2 key words, d and c, with the restriction that c is following d but it’s not farther than a few lines. So i used regx like d(.*\R.*){1,2}c, where 2 is increasing. It was fine from 1 to 4, but when I searched with d(.*\R.*){1,5}c, it started to match the whole text.
      The text I searched was 2000 lines. I reduce the text little by little, and finally reduce it to 5 lines.

      Best Regards!

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @古旮 and all

        First of all, just read this short reply to @marc-lalonde, below :

        https://notepad-plus-plus.org/community/topic/15247/replacing-duped-words-across-a-block-block-of-text-respecting/23

        Obviously, in your example, there is no recursion feature, nor big amounts of text ! But, I think that all the troubles comes from the .*\R.* syntax

        I succeeded to simplify the problem ! Let start with the following text :

        Line 1 : 1 letter d, with its line-break
        Line 2 : 14000 letters a, with its line -break, too

        And let discuss about this similar regex :d(.*\R.*){1,2}c

        Allow me to use the (?-s) modifier to ensures that the dot will match standard chars, only ! Then, my regex d(.*\R.*){1,2}c can be re-written :

        (?-s)d(.*\R.*)(.*\R.*)c ( Regex R2 )

        And, if NO match can be found, the regex engine goes on and, then, tries the regex :

        (?-s)d(.*\R.*)c ( Regex R1 )

        Finally, if a match still cannot be found, the Find dialog displays the message Find: Can’t find the text “(?-s)d(.*\R.*){1,2}c”


        Well, let’s go back to my example ! To begin with, just create a new line 3 with the four letters cdef, only ( IMPORTANT )

        Now, let’s try the regex (?-s)d(.*\R.*){1,2}c against my text : the regex engine tries the regex R2, first, which does match, immediately, all letters a, between the letters d and c included !

        d                     :   Letter d
        
        FIRST  block .*\R.*   :   Nothing   +   \R   +  14500 letters a
        
        SECOND block .*\R.*   :   Nothing   +   \R   +  Nothing
        
        c                     :   Letter c
        

        Now, get rid of the string cdef, in line 3 and re-try the regex (?-s)d(.*\R.*){1,2}c. This time, troubles begin and, as you said, after 8s about, it wrongly selects all file contents !

        Then, reduce the number of letters a, in line 2, to 14000 letters. This time, after 8s, as expected, the Find dialog answers :

        Find: Can’t find the text “(?-s)d(.*\R.*){1,2}c”

        IMPORTANT : Depending of your configuration, and the amount of memory, on your laptop, the limit ( 14000 - 14500 ) may be quite different than mime, but should occur, anyway !

        So, how to explain this difference ? Well, at first, as the quantifiers * are greedy, the regex engine tries the case :

        d                     :   Letter d
        
        FIRST  block .*\R.*   :   Nothing   +   \R   +  14500 letters a
        
        SECOND block .*\R.*   :   Nothing   +   \R   +  Nothing
        
        c                     :   MISSING
        

        NO match can be found and, as the regex could be rewritten (?-s)d.*\R.*.*\R.*c, the regex engine, still keeping the regex R2, then, begins to backtrack and tries this other configuration :

        d                     :   Letter d
        
        FIRST  block .*\R.*   :   Nothing      +   \R   +  14499 letters a
        
        SECOND block .*\R.*   :   1 letter a   +   \R   +  Nothing
        
        c                     :   MISSING
        

        Of course, as the letter c is always missing, then it continues to backtrack and chooses to test :

        d                     :   Letter d
        
        FIRST  block .*\R.*   :   Nothing       +   \R   +  14498 letters a
        
        SECOND block .*\R.*   :   2 letters a   +   \R   +  Nothing
        
        c                     :   MISSING
        

        … and going on, testing 14499 cases, trying to reach the last case :

        d                     :   Letter d
        
        FIRST  block .*\R.*   :   Nothing           +   \R   +  Nothing
        
        SECOND block .*\R.*   :   14500 letters a   +   \R   +  Nothing
        
        c                     :   MISSING
        

        Unfortunately, while testing all the combinations, a catastrophic backtracking error occurred and the regex engine wrongly matches all file contents :-((

        Personally, I would advice you to strongly avoid regexes like (.*)(.*) or (.*)+ or even (x+x+)+y !!


        Finally, as you said :

        I want to search for the whole article 2 key words, d and c, with the restriction that c is following d but it’s not farther than a few lines

        I think, @古旮, that the right regex should be, simply, (?-s)d(.*\R){1,5}c. For instance, this regex would match the text below :

        d
        line 1
        line 2
        line 3
        line 4
        c
        

        but would not match the following one :

        d
        line 1
        line 2
        line 3
        line 4
        line 5
        line 6
        c
        

        I, personally, did a test with the regex (?-s)d(.*\R){1,2}c and the following text

        Line 1 : 1 letter d + its line-break
        Line 2 : 100,000 letters a + its line-break

        It correctly displays the message Find: Can’t find “(?-s)d(.*\R){1,2}c”

        Now, adding a line 3 with the string cdef, without any line-break, it selected, as expected, all text between the first char d and the single c letter, leaving the final def string unselected !

        Best Regards,

        guy038

        P.S. :

        Refer, also, to this article, about catastrophic backtracking, by Jan Goyvaerts, THE definitive regex GURU :

        http://www.regular-expressions.info/catastrophic.html

        1 Reply Last reply Reply Quote 0
        • 古旮古
          古旮
          last edited by

          Thanks for the reply. And yes, it seems to be the Catastrophic Backtracking thing. And it seems this stackoverflow exception is not caught.
          The regex should be (?-s)d(.*\R){1,5}?.*c because c is not always the first of a line.
          I know I should avoid low efficient regex expressions, but it is buggy to directly show a wrong result instead of telling me the limitation is reached.

          1 Reply Last reply Reply Quote 1
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors