Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    Is 27333 magical?

    Help wanted · · · – – – · · ·
    3
    7
    74
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • mkupper
      mkupper last edited by

      I’m using Notepad++ v7.9 (32-bit)
      Build time : Sep 22 2020 - 03:24:22
      Path : C:\Program Files (x86)\Notepad++\notepad++.exe
      Admin mode : OFF
      Local Conf mode : OFF
      OS Name : Windows 10 Home (64-bit)
      OS Version : 1909
      OS Build : 18363.1082
      Current ANSI codepage : 1252
      Plugins : DSpellCheck.dll HexEditor.dll mimeTools.dll NppConverter.dll NppExport.dll NppTextFX.dll PythonScript.dll

      I was reviewing a plain text log file and wanted to deal with consecutive lines matching a pattern. My regexp worked for small runs of consecutive lines but when it saw runs of thousands of lines the expression started gobbling up everything from that point on down rather than stopping when the pattern did not match.

      I have reduced this to a test case:

      The plain text file contains around 30000 CRLF terminated lines with

      2020-10-12 00:00:00
      2020-10-12 00:00:00
      2020-10-12 00:00:00
      ...
      2020-10-12 00:00:00
      2020-10-12 00:00:00
      2020-10-12 00:00:00
      

      My expression is: ^(2020-10-12 00:00:00\r\n)+

      The expression works if there are 27333 or fewer consecutive lines of 2020-10-12 00:00:00.
      If there are 27334 or more lines then the expression starts matching all following lines in the file.

      Note that also this happens if I used something like

      abcdefghijklmnopqrstuvwxyz
      abcdefghijklmnopqrstuvwxyz
      abcdefghijklmnopqrstuvwxyz
      ...
      abcdefghijklmnopqrstuvwxyz
      abcdefghijklmnopqrstuvwxyz
      abcdefghijklmnopqrstuvwxyz
      

      with the expression (abcdefghijklmnopqrstuvwxyz\r\n)+

      27333 to 27334 lines is also the make/break case for this. However, 27333 is not always the make/break case.
      When I was reducing my regular expression down to a simplified test case the number of lines changed though not in ways that seemed useful from a debugging or understanding viewpoint.

      For example, I tried (?:abcdefghijklmnopqrstuvwxyz\r\n)+ which shifted the make/break from 27333 lines to 49713 lines.

      The size of the file or matching region in bytes does not seem to be the issue.

      I was running npp v7.8.9 when I noticed this and so upgraded to v7.9 and re-tested with the same results. It’s not a v7.9 issue but rather was introduced earlier or maybe that’s the way it’s supposed to work and I’m not understanding something.

      1. What is causing this?
      2. Is there a simple workaround?
      1 Reply Last reply Reply Quote 0
      • Terry R
        Terry R last edited by Terry R

        @mkupper said in Is 27333 magical?:

        What is causing this?
        Is there a simple workaround?

        There are numerous posts on this forum regarding where the selection can result in overwhelming buffers (for want of a better word). Primarily it can occur when doing lookaheads (?=...) where many lines are covered. Whilst known I don’t believe much can be done currently except to understand the possibility and work around it.

        In your case I’d suggest limiting the expression by changing it to:
        ^(2020-10-12 00:00:00\r\n){1,27000} or perhaps an even lower number so it’s dealing in smaller chunks that would not encounter this issue.

        Terry

        1 Reply Last reply Reply Quote 4
        • mkupper
          mkupper last edited by

          Thank you @Terry-R - lowering the chunk size using {1,10000} works well as a work-around.

          1 Reply Last reply Reply Quote 2
          • Terry R
            Terry R last edited by

            @mkupper said in Is 27333 magical?:

            lowering the chunk size using {1,10000} works well as a work-around

            If you were interested in more info about the issue read the following post:
            https://community.notepad-plus-plus.org/topic/14729/deleting-lines-that-repeat-the-first-15-characters

            Some testing was carried out in this instance and at around 67000 lines the problem was encountered. Other testing since has shown the number of lines at which the problem occurs varies considerably. Indeed it’s likely more about the number of “bytes” rather than lines. Those of us who regularly supply regexes are aware and we sometimes ask about the size of the file being worked as “big” files will often encounter this issue, depending on the type of regex used (capturing vast amounts of bytes in a single bite). ;-)

            Terry

            mkupper 1 Reply Last reply Reply Quote 2
            • mkupper
              mkupper @Terry R last edited by

              @Terry-R - that other thread was excellent. One of the posts from @scott-sumner provided a good clue with <type 'exceptions.RuntimeError'>: The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.

              It appears the regex engine has something, possibly a timer, where if the expression starts taking too long that it’s aborted. The goal is to prevent catastrophic backtracking. Google for regex catastrophic backtracking will find articles about that.

              If that’s the case though we still have the mystery of why the expression then quietly matched the remainder of the file rather than throwing up an error message.

              Alan Kilborn 1 Reply Last reply Reply Quote 1
              • Alan Kilborn
                Alan Kilborn @mkupper last edited by Alan Kilborn

                @mkupper said in Is 27333 magical?:

                If that’s the case though we still have the mystery of why the expression then quietly matched the remainder of the file rather than throwing up an error message.

                Since you seem to like to do some extra reading, that aspect of it is addressed HERE. Enjoy. :-)

                Alan Kilborn 1 Reply Last reply Reply Quote 2
                • Alan Kilborn
                  Alan Kilborn @Alan Kilborn last edited by

                  @Alan-Kilborn said in Is 27333 magical?:

                  Since you seem to like to do some extra reading, that aspect of it is addressed HERE. Enjoy. :-)

                  BTW, the poster called “ghost” in that github topic is our old friend from here, @Claudia-Frank.
                  I know this because I had some separate notes on this regex problem, and my notes referenced Claudia saying something. Now when I look in that thread for her comments, they are attributed to “ghost”. I suppose she terminated her github account (for whatever reason), and when you do that, your old comments get reassigned to a ghost account.

                  1 Reply Last reply Reply Quote 1
                  • First post
                    Last post
                  Copyright © 2014 NodeBB Forums | Contributors