• Login
Community
  • Login

RegEx bug with big files

Scheduled Pinned Locked Moved General Discussion
4 Posts 4 Posters 1.6k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • N
    niente0
    last edited by niente0 Mar 25, 2019, 1:22 PM Mar 25, 2019, 1:22 PM

    Hello, I have a problem with RegEx on big files, in particular if I search for a regex like this:
    [^@]abcdef
    in a file that has no occurrences of it and that it’s bigger than 33336155 characters, the whole text is selected and no “not found” message appears.
    Can you check?
    Thanks

    E A 2 Replies Last reply Mar 25, 2019, 1:36 PM Reply Quote 1
    • E
      Ekopalypse @niente0
      last edited by Mar 25, 2019, 1:36 PM

      @niente0

      there are already some known issues which, unfortunately, do not get the necessary attention.

      1 Reply Last reply Reply Quote 2
      • A
        Alan Kilborn @niente0
        last edited by Mar 25, 2019, 1:36 PM

        @niente0 said:

        the whole text is selected and no “not found” message appears

        This sounds like the classic bug with the regex engine described here (and other places):

        • https://github.com/notepad-plus-plus/notepad-plus-plus/issues/4761
        • https://notepad-plus-plus.org/community/topic/12179/regex-select-everything-before-a-particular-word-included-the-line-with-word
        • https://github.com/notepad-plus-plus/notepad-plus-plus/issues/30

        But your regex is fairly simple so I am surprised.

        1 Reply Last reply Reply Quote 3
        • G
          guy038
          last edited by guy038 Mar 27, 2019, 10:34 PM Mar 27, 2019, 10:27 PM

          Hello, @niente0, @meta-cchuh, @alan-kilborn, @eko-palypse and All,

          I did some tests with a not so big file ( a bit more than 32,4 Mb ! ) just because this is the limit where the correct and the incorrect behaviour occurs :-((

          I used the single line, below, which ends with a space character and, then, the usual CR + LF chars. Its total size is, exactly, 100 bytes

          > 2.  Switch from certificate verification to hashes verification due to "Notepad++" is rejected by CRLF
          

          After duplicating that line, a bit more than 340,130 times, I obtained a file, with the approximative size to test some regexes. Here are the results, obtained with N++ v7.6.4, on my old Win XP SP3 laptop :

          
          A) With a file WITHOUT any occurrence of the "@" character, NOR the string "abcdef" :
             ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
          
          •-----------------------------•-----------------------------------------•---------------------------------------------------•
          |     REGULAR expressions     |   Size MAXIMUM with a CORRECT Result    |       Size MINIMUM with an INCORRECT Result       |
          •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
          |  [^@]abcdef                 |  33,333,333 Bytes  |    O occurrence    |  33,333,334 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |                             |                    |                    |                    |                              |
          |  [^@\r\n]abcdef             |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |  [^@\x00-\x1F]abcdef        |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |  [\x20-?A-~]abcdef          |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |  [\x20-?@A-~]abcdef         |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |  [\x20-~]abcdef             |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |  [\x{0020}-\x{FFFF}]abcdef  |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
          •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
          |  @abdef                     |                    Tested, up to a 200 Mb file  =>  0 occurrence ( OK )                     |
          •-----------------------------•-----------------------------------------•--------------------•------------------------------•
          
          
          
          B) With a file containing only ONE occurrence of "@abcdef", near the END :
             ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
          
          •-----------------------------•-----------------------------------------•---------------------------------------------------•
          |     REGULAR expressions     |   Size MAXIMUM with a CORRECT Result    |       Size MINIMUM with an INCORRECT Result       |
          •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
          |  [^@]abcdef                 |  33,333,334 Bytes  |    O occurrence    |  33,333,335 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |                             |                    |                    |                    |                              |
          |  [^@\r\n]abcdef             |  34,013,608 Bytes  |    0 occurrence    |  34,013,609 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |  [^@\x00-\x1F]abcdef        |  34,013,608 Bytes  |    0 occurrence    |  34,013,609 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |  [\x20-?A-~]abcdef          |  34,013,608 Bytes  |    0 occurrence    |  34,013,609 Bytes  |  1 occ. = ALL File Contents  |
          •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
          |  [\x20-?@A-~]abcdef         |  34,013,617 Bytes  |  1 occ. = @abcdef  |  34,013,618 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |  [\x20-~]abcdef             |  34,013,617 Bytes  |  1 occ. = @abcdef  |  34,013,618 Bytes  |  1 occ. = ALL File Contents  |
          |                             |                    |                    |                    |                              |
          |  [\x{0020}-\x{FFFF}]abcdef  |  34,013,617 Bytes  |  1 occ. = @abcdef  |  34,013,618 Bytes  |  1 occ. = ALL File Contents  |
          •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
          |  @abdef                     |                   Tested up to a 200 Mb File  =>  1 occurrence = @abcdef                    |
          •-----------------------------•-----------------------------------------•---------------------------------------------------•
          
          

          Remark : I suppose that the problem arise because when you search, for instance, for the regex @bcdef there is only ONE possibility but when you search for [^@]bcdef there quite a lot of possible matches !!

          I also tried to search for the regex abcdef and click on the button Find All in current Document. It did show the different lines containing the “abcdef” string ! Then, right-clicking inside the Find Result panel, I chose the option Found in these Found results....

          In this new dialog I type in the regex [^@]abcdef, and tick the option Search only in found lines. Unfortunately, it didn’t show these specific lines but the first one, only, when dealing with big files :-((


          Finally, the best, regarding your specific regex, would be ( just good sense ! ) :

          • Firstly, replace @abcdef with, for instance, the string @abcdez, with the Replace All button, in Normal or Regular expression search mode

          • Secondly, search for abcdef and click on the Find Next button OR the Find All in Current Document, in order to find out, implicitly, all the occurrences of the regex [^@]abcdef

          • When your treatment of the [^@]abcdef occurrences is over, just do the reverse operation, searching for @abcdez and replacing with @abcdef

          Best Regards,

          guy038

          1 Reply Last reply Reply Quote 3
          3 out of 4
          • First post
            3/4
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors