RegEx bug with big files



  • Hello, I have a problem with RegEx on big files, in particular if I search for a regex like this:
    [^@]abcdef
    in a file that has no occurrences of it and that it’s bigger than 33336155 characters, the whole text is selected and no “not found” message appears.
    Can you check?
    Thanks



  • @niente0

    there are already some known issues which, unfortunately, do not get the necessary attention.



  • @niente0 said:

    the whole text is selected and no “not found” message appears

    This sounds like the classic bug with the regex engine described here (and other places):

    But your regex is fairly simple so I am surprised.



  • Hello, @niente0, @meta-cchuh, @alan-kilborn, @eko-palypse and All,

    I did some tests with a not so big file ( a bit more than 32,4 Mb ! ) just because this is the limit where the correct and the incorrect behaviour occurs :-((

    I used the single line, below, which ends with a space character and, then, the usual CR + LF chars. Its total size is, exactly, 100 bytes

    > 2.  Switch from certificate verification to hashes verification due to "Notepad++" is rejected by CRLF
    

    After duplicating that line, a bit more than 340,130 times, I obtained a file, with the approximative size to test some regexes. Here are the results, obtained with N++ v7.6.4, on my old Win XP SP3 laptop :

    
    A) With a file WITHOUT any occurrence of the "@" character, NOR the string "abcdef" :
       ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
    
    •-----------------------------•-----------------------------------------•---------------------------------------------------•
    |     REGULAR expressions     |   Size MAXIMUM with a CORRECT Result    |       Size MINIMUM with an INCORRECT Result       |
    •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
    |  [^@]abcdef                 |  33,333,333 Bytes  |    O occurrence    |  33,333,334 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |                             |                    |                    |                    |                              |
    |  [^@\r\n]abcdef             |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |  [^@\x00-\x1F]abcdef        |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |  [\x20-?A-~]abcdef          |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |  [\x20-?@A-~]abcdef         |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |  [\x20-~]abcdef             |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |  [\x{0020}-\x{FFFF}]abcdef  |  34,013,607 Bytes  |    0 occurrence    |  34,013,608 Bytes  |  1 occ. = ALL File Contents  |
    •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
    |  @abdef                     |                    Tested, up to a 200 Mb file  =>  0 occurrence ( OK )                     |
    •-----------------------------•-----------------------------------------•--------------------•------------------------------•
    
    
    
    B) With a file containing only ONE occurrence of "@abcdef", near the END :
       ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
    
    •-----------------------------•-----------------------------------------•---------------------------------------------------•
    |     REGULAR expressions     |   Size MAXIMUM with a CORRECT Result    |       Size MINIMUM with an INCORRECT Result       |
    •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
    |  [^@]abcdef                 |  33,333,334 Bytes  |    O occurrence    |  33,333,335 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |                             |                    |                    |                    |                              |
    |  [^@\r\n]abcdef             |  34,013,608 Bytes  |    0 occurrence    |  34,013,609 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |  [^@\x00-\x1F]abcdef        |  34,013,608 Bytes  |    0 occurrence    |  34,013,609 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |  [\x20-?A-~]abcdef          |  34,013,608 Bytes  |    0 occurrence    |  34,013,609 Bytes  |  1 occ. = ALL File Contents  |
    •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
    |  [\x20-?@A-~]abcdef         |  34,013,617 Bytes  |  1 occ. = @abcdef  |  34,013,618 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |  [\x20-~]abcdef             |  34,013,617 Bytes  |  1 occ. = @abcdef  |  34,013,618 Bytes  |  1 occ. = ALL File Contents  |
    |                             |                    |                    |                    |                              |
    |  [\x{0020}-\x{FFFF}]abcdef  |  34,013,617 Bytes  |  1 occ. = @abcdef  |  34,013,618 Bytes  |  1 occ. = ALL File Contents  |
    •-----------------------------•--------------------•--------------------•--------------------•------------------------------•
    |  @abdef                     |                   Tested up to a 200 Mb File  =>  1 occurrence = @abcdef                    |
    •-----------------------------•-----------------------------------------•---------------------------------------------------•
    
    

    Remark : I suppose that the problem arise because when you search, for instance, for the regex @bcdef there is only ONE possibility but when you search for [^@]bcdef there quite a lot of possible matches !!

    I also tried to search for the regex abcdef and click on the button Find All in current Document. It did show the different lines containing the “abcdef” string ! Then, right-clicking inside the Find Result panel, I chose the option Found in these Found results....

    In this new dialog I type in the regex [^@]abcdef, and tick the option Search only in found lines. Unfortunately, it didn’t show these specific lines but the first one, only, when dealing with big files :-((


    Finally, the best, regarding your specific regex, would be ( just good sense ! ) :

    • Firstly, replace @abcdef with, for instance, the string @abcdez, with the Replace All button, in Normal or Regular expression search mode

    • Secondly, search for abcdef and click on the Find Next button OR the Find All in Current Document, in order to find out, implicitly, all the occurrences of the regex [^@]abcdef

    • When your treatment of the [^@]abcdef occurrences is over, just do the reverse operation, searching for @abcdez and replacing with @abcdef

    Best Regards,

    guy038


Log in to reply