Community
    • Login

    Regex not found in large file, known limitation or bug?

    Scheduled Pinned Locked Moved General Discussion
    3 Posts 3 Posters 50 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Kjell RilbeK
      Kjell Rilbe
      last edited by

      Notepad++ version 8.9.1 but this problem has been with me for quite some time.

      I have a large ANSI text file containing CSV data (400000+ lines) and at least some regex searches fail to find matching text that I know is there. Is this a known limitation, i.e. is it known that regex searches do not work correctly in large files?

      Background:
      I receive this file monthly from a third party, and need to do some quality checks/validation. In particular I search for lines containing an incorrect number of semicolons. There should be 12 semicolons on each line.

      What fails:
      I search for this regex in this month’s file:

      ^[^;\r\n]*(?:(?:;[^;\r\n]*){0,11}|(?:;[^;\r\n]*){13,})$
      

      I know there’s a match at line 358660, and another one at line 382650.

      If I put my cursor at the top of the file, Notepad++ doesn’t find these matches.

      If I put my cursor at line 50178 or below, then it DOES find the first match (and from there is also finds the second one). In fact, I can put the cursor after the first character on line 50177 and it will find the first match, but not if the cursor is at the bottom of that line.

      I’ve seen similar problems with some other regexes as well, but I can’t currently give and additional reproducible examples.

      So, should I report it as a bug or is it already known and documented or “as designed”? In that case, I’d appreciate a pointer to this information.

      PeterJonesP 1 Reply Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @Kjell Rilbe
        last edited by PeterJones

        @Kjell-Rilbe ,

        I created a document that was 524288 lines of

        one;two;three;four;five;six;seven;eight;nine;ten;eleven;twelve;end
        

        so that they all had exactly 12 semicolons.

        When I ran your regex when there are no matches, it told me the complexity was too great
        335b649f-dc56-463f-b705-988ac34fe924-image.png
        – that’s generally an indicator that your * are requiring there to be too many backtrack events. And that’s a function of your exact memory and resource allocation and other such things that cannot be documented as “there is a hard limit at XYZ”.

        I then changed line 358660 to only go through ...;eleven;end and line 382650 to go ...;eleven;twelve;thirteen;end so that they should match. starting with caret at position 1 in the file, it was able to find both of those matches.

        I was hoping that adding the + to the * to make it “greedy” (which prevents backtracking) would make it more efficient, but when I had no matches in the file, it still said it was too complex.

        If I break it into ^[^;\r\n]*+(?:;[^;\r\n]*+){0,11}$ to find <12, and then ^[^;\r\n]*+(?:;[^;\r\n]*+){13} to find at least 13 (notice there isn’t the $ end-of-line… if you really want it to go to the end of the line, ^[^;\r\n]*+(?:;[^;\r\n]*+){13}(?-s).*+$ might be more efficient than {13,}, because after the thirteenth semicolon, it won’t care what characters come between there and EOL). And when I run each of those individually, they both catch what they are supposed to, and if I have a file that has only twelve-semicolon lines, neither finds anything and neither complains about the complexity. So you might want to see if breaking it into two regex instead of one will make it more reliable for you.

        My guess is that it was just too much backtracking. Though I’m surprised it just “fail to find matching text” rather than “gave an error”. Unless you just didn’t notice the error message.


        update: also, if @guy038 or some of the other forum regex gurus will have other ideas beyond what I suggested, feel free to add your input. And I’ve started working on a small section in the User Manual to give a couple of brief ideas to help people get past such an error… the PR is here, and if @guy038 or others wants to comment on that PR (ie, not in this forum discussion, so we don’t hijack this specific quesetion), feel free.

        1 Reply Last reply Reply Quote 0
        • CoisesC
          Coises
          last edited by

          @Kjell-Rilbe, @PeterJones:

          Try:

          ^[^;\r\n]*+(?:;[^;\r\n]*+){12}$(*SKIP)(*FAIL)|^.*$
          

          This is one of those cases where it’s much less confusing to skip what you don’t want to match than to try to specify what you do want to match.

          1 Reply Last reply Reply Quote 2
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors