Community
    • Login

    RegEx bug with big files

    Scheduled Pinned Locked Moved General Discussion
    4 Posts 4 Posters 1.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • niente0N
      niente0
      last edited by niente0

      Hello, I have a problem with RegEx on big files, in particular if I search for a regex like this:
      [^@]abcdef
      in a file that has no occurrences of it and that it’s bigger than 33336155 characters, the whole text is selected and no “not found” message appears.
      Can you check?
      Thanks

      EkopalypseE Alan KilbornA 2 Replies Last reply Reply Quote 1
      • EkopalypseE
        Ekopalypse @niente0
        last edited by

        @niente0

        there are already some known issues which, unfortunately, do not get the necessary attention.

        1 Reply Last reply Reply Quote 2
        • Alan KilbornA
          Alan Kilborn @niente0
          last edited by

          @niente0 said:

          the whole text is selected and no “not found” message appears

          This sounds like the classic bug with the regex engine described here (and other places):

          • https://github.com/notepad-plus-plus/notepad-plus-plus/issues/4761
          • https://notepad-plus-plus.org/community/topic/12179/regex-select-everything-before-a-particular-word-included-the-line-with-word
          • https://github.com/notepad-plus-plus/notepad-plus-plus/issues/30

          But your regex is fairly simple so I am surprised.

          1 Reply Last reply Reply Quote 3
          • guy038G
            guy038
            last edited by guy038

            Hello, @niente0, @meta-cchuh, @alan-kilborn, @eko-palypse and All,

            I did some tests with a not so big file ( a bit more than 32,4 Mb ! ) just because this is the limit where the correct and the incorrect behaviour occurs :-((

            I used the single line, below, which ends with a space character and, then, the usual CR + LF chars. Its total size is, exactly, 100 bytes

            > 2.  Switch from certificate verification to hashes verification due to "Notepad++" is rejected by CRLF
            

            After duplicating that line, a bit more than 340,130 times, I obtained a file, with the approximative size to test some regexes. Here are the results, obtained with N++ v7.6.4, on my old Win XP SP3 laptop :

            
            A) With a file WITHOUT any occurrence of the "@" character, NOR the string "abcdef" :
               ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
            
            •-----------------------------•-----------------------------------------•---------------------------------------------------•
            |     REGULAR expressions     |   Size MAXIMUM with a CORRECT Result    |       Size MINIMUM with an INCORRECT Result       |
            •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
            |  [^@]abcdef                 |  33,333,333 Bytes  |    O occurrence    |  33,333,334 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |                             |                    |                    |                    |                              |
            |  [^@\r\n]abcdef             |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |  [^@\x00-\x1F]abcdef        |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |  [\x20-?A-~]abcdef          |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |  [\x20-?@A-~]abcdef         |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |  [\x20-~]abcdef             |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |  [\x{0020}-\x{FFFF}]abcdef  |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
            •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
            |  @abdef                     |                    Tested, up to a 200 Mb file  =>  0 occurrence ( OK )                     |
            •-----------------------------•-----------------------------------------•--------------------•------------------------------•
            
            
            
            B) With a file containing only ONE occurrence of "@abcdef", near the END :
               ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
            
            •-----------------------------•-----------------------------------------•---------------------------------------------------•
            |     REGULAR expressions     |   Size MAXIMUM with a CORRECT Result    |       Size MINIMUM with an INCORRECT Result       |
            •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
            |  [^@]abcdef                 |  33,333,334 Bytes  |    O occurrence    |  33,333,335 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |                             |                    |                    |                    |                              |
            |  [^@\r\n]abcdef             |  34,013,608 Bytes  |    0 occurrence    |  34,013,609 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |  [^@\x00-\x1F]abcdef        |  34,013,608 Bytes  |    0 occurrence    |  34,013,609 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |  [\x20-?A-~]abcdef          |  34,013,608 Bytes  |    0 occurrence    |  34,013,609 Bytes  |  1 occ. = ALL File Contents  |
            •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
            |  [\x20-?@A-~]abcdef         |  34,013,617 Bytes  |  1 occ. = @abcdef  |  34,013,618 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |  [\x20-~]abcdef             |  34,013,617 Bytes  |  1 occ. = @abcdef  |  34,013,618 Bytes  |  1 occ. = ALL File Contents  |
            |                             |                    |                    |                    |                              |
            |  [\x{0020}-\x{FFFF}]abcdef  |  34,013,617 Bytes  |  1 occ. = @abcdef  |  34,013,618 Bytes  |  1 occ. = ALL File Contents  |
            •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
            |  @abdef                     |                   Tested up to a 200 Mb File  =>  1 occurrence = @abcdef                    |
            •-----------------------------•-----------------------------------------•---------------------------------------------------•
            
            

            Remark : I suppose that the problem arise because when you search, for instance, for the regex @bcdef there is only ONE possibility but when you search for [^@]bcdef there quite a lot of possible matches !!

            I also tried to search for the regex abcdef and click on the button Find All in current Document. It did show the different lines containing the “abcdef” string ! Then, right-clicking inside the Find Result panel, I chose the option Found in these Found results....

            In this new dialog I type in the regex [^@]abcdef, and tick the option Search only in found lines. Unfortunately, it didn’t show these specific lines but the first one, only, when dealing with big files :-((


            Finally, the best, regarding your specific regex, would be ( just good sense ! ) :

            • Firstly, replace @abcdef with, for instance, the string @abcdez, with the Replace All button, in Normal or Regular expression search mode

            • Secondly, search for abcdef and click on the Find Next button OR the Find All in Current Document, in order to find out, implicitly, all the occurrences of the regex [^@]abcdef

            • When your treatment of the [^@]abcdef occurrences is over, just do the reverse operation, searching for @abcdez and replacing with @abcdef

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 3
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors