Is 27333 magical?



  • I’m using Notepad++ v7.9 (32-bit)
    Build time : Sep 22 2020 - 03:24:22
    Path : C:\Program Files (x86)\Notepad++\notepad++.exe
    Admin mode : OFF
    Local Conf mode : OFF
    OS Name : Windows 10 Home (64-bit)
    OS Version : 1909
    OS Build : 18363.1082
    Current ANSI codepage : 1252
    Plugins : DSpellCheck.dll HexEditor.dll mimeTools.dll NppConverter.dll NppExport.dll NppTextFX.dll PythonScript.dll

    I was reviewing a plain text log file and wanted to deal with consecutive lines matching a pattern. My regexp worked for small runs of consecutive lines but when it saw runs of thousands of lines the expression started gobbling up everything from that point on down rather than stopping when the pattern did not match.

    I have reduced this to a test case:

    The plain text file contains around 30000 CRLF terminated lines with

    2020-10-12 00:00:00
    2020-10-12 00:00:00
    2020-10-12 00:00:00
    ...
    2020-10-12 00:00:00
    2020-10-12 00:00:00
    2020-10-12 00:00:00
    

    My expression is: ^(2020-10-12 00:00:00\r\n)+

    The expression works if there are 27333 or fewer consecutive lines of 2020-10-12 00:00:00.
    If there are 27334 or more lines then the expression starts matching all following lines in the file.

    Note that also this happens if I used something like

    abcdefghijklmnopqrstuvwxyz
    abcdefghijklmnopqrstuvwxyz
    abcdefghijklmnopqrstuvwxyz
    ...
    abcdefghijklmnopqrstuvwxyz
    abcdefghijklmnopqrstuvwxyz
    abcdefghijklmnopqrstuvwxyz
    

    with the expression (abcdefghijklmnopqrstuvwxyz\r\n)+

    27333 to 27334 lines is also the make/break case for this. However, 27333 is not always the make/break case.
    When I was reducing my regular expression down to a simplified test case the number of lines changed though not in ways that seemed useful from a debugging or understanding viewpoint.

    For example, I tried (?:abcdefghijklmnopqrstuvwxyz\r\n)+ which shifted the make/break from 27333 lines to 49713 lines.

    The size of the file or matching region in bytes does not seem to be the issue.

    I was running npp v7.8.9 when I noticed this and so upgraded to v7.9 and re-tested with the same results. It’s not a v7.9 issue but rather was introduced earlier or maybe that’s the way it’s supposed to work and I’m not understanding something.

    1. What is causing this?
    2. Is there a simple workaround?


  • @mkupper said in Is 27333 magical?:

    What is causing this?
    Is there a simple workaround?

    There are numerous posts on this forum regarding where the selection can result in overwhelming buffers (for want of a better word). Primarily it can occur when doing lookaheads (?=...) where many lines are covered. Whilst known I don’t believe much can be done currently except to understand the possibility and work around it.

    In your case I’d suggest limiting the expression by changing it to:
    ^(2020-10-12 00:00:00\r\n){1,27000} or perhaps an even lower number so it’s dealing in smaller chunks that would not encounter this issue.

    Terry



  • Thank you @Terry-R - lowering the chunk size using {1,10000} works well as a work-around.



  • @mkupper said in Is 27333 magical?:

    lowering the chunk size using {1,10000} works well as a work-around

    If you were interested in more info about the issue read the following post:
    https://community.notepad-plus-plus.org/topic/14729/deleting-lines-that-repeat-the-first-15-characters

    Some testing was carried out in this instance and at around 67000 lines the problem was encountered. Other testing since has shown the number of lines at which the problem occurs varies considerably. Indeed it’s likely more about the number of “bytes” rather than lines. Those of us who regularly supply regexes are aware and we sometimes ask about the size of the file being worked as “big” files will often encounter this issue, depending on the type of regex used (capturing vast amounts of bytes in a single bite). ;-)

    Terry



  • @Terry-R - that other thread was excellent. One of the posts from @scott-sumner provided a good clue with <type 'exceptions.RuntimeError'>: The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.

    It appears the regex engine has something, possibly a timer, where if the expression starts taking too long that it’s aborted. The goal is to prevent catastrophic backtracking. Google for regex catastrophic backtracking will find articles about that.

    If that’s the case though we still have the mystery of why the expression then quietly matched the remainder of the file rather than throwing up an error message.



  • @mkupper said in Is 27333 magical?:

    If that’s the case though we still have the mystery of why the expression then quietly matched the remainder of the file rather than throwing up an error message.

    Since you seem to like to do some extra reading, that aspect of it is addressed HERE. Enjoy. :-)



  • @Alan-Kilborn said in Is 27333 magical?:

    Since you seem to like to do some extra reading, that aspect of it is addressed HERE. Enjoy. :-)

    BTW, the poster called “ghost” in that github topic is our old friend from here, @Claudia-Frank.
    I know this because I had some separate notes on this regex problem, and my notes referenced Claudia saying something. Now when I look in that thread for her comments, they are attributed to “ghost”. I suppose she terminated her github account (for whatever reason), and when you do that, your old comments get reassigned to a ghost account.


Log in to reply