RegEx bug with big files
-
Hello, I have a problem with RegEx on big files, in particular if I search for a regex like this:
[^@]abcdef
in a file that has no occurrences of it and that it’s bigger than 33336155 characters, the whole text is selected and no “not found” message appears.
Can you check?
Thanks -
there are already some known issues which, unfortunately, do not get the necessary attention.
-
@niente0 said:
the whole text is selected and no “not found” message appears
This sounds like the classic bug with the regex engine described here (and other places):
- https://github.com/notepad-plus-plus/notepad-plus-plus/issues/4761
- https://notepad-plus-plus.org/community/topic/12179/regex-select-everything-before-a-particular-word-included-the-line-with-word
- https://github.com/notepad-plus-plus/notepad-plus-plus/issues/30
But your regex is fairly simple so I am surprised.
-
Hello, @niente0, @meta-cchuh, @alan-kilborn, @eko-palypse and All,
I did some tests with a not so big file ( a bit more than
32,4
Mb ! ) just because this is the limit where the correct and the incorrect behaviour occurs :-((I used the single line, below, which ends with a space character and, then, the usual CR + LF chars. Its total size is, exactly,
100
bytes> 2. Switch from certificate verification to hashes verification due to "Notepad++" is rejected by CRLF
After duplicating that line, a bit more than
340,130
times, I obtained a file, with the approximative size to test some regexes. Here are the results, obtained with N++v7.6.4
, on my oldWin XP SP3
laptop :A) With a file WITHOUT any occurrence of the "@" character, NOR the string "abcdef" : ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ •-----------------------------•-----------------------------------------•---------------------------------------------------• | REGULAR expressions | Size MAXIMUM with a CORRECT Result | Size MINIMUM with an INCORRECT Result | •-----------------------------•--------------------•--------------------•--------------------•------------------------------• | [^@]abcdef | 33,333,333 Bytes | O occurrence | 33,333,334 Bytes | 1 occ. = ALL File Contents | | | | | | | | | | | | | | [^@\r\n]abcdef | 34,013,607 Bytes | 0 occurrence | 34,013,608 Bytes | 1 occ. = ALL File Contents | | | | | | | | [^@\x00-\x1F]abcdef | 34,013,607 Bytes | 0 occurrence | 34,013,608 Bytes | 1 occ. = ALL File Contents | | | | | | | | [\x20-?A-~]abcdef | 34,013,607 Bytes | 0 occurrence | 34,013,608 Bytes | 1 occ. = ALL File Contents | | | | | | | | [\x20-?@A-~]abcdef | 34,013,607 Bytes | 0 occurrence | 34,013,608 Bytes | 1 occ. = ALL File Contents | | | | | | | | [\x20-~]abcdef | 34,013,607 Bytes | 0 occurrence | 34,013,608 Bytes | 1 occ. = ALL File Contents | | | | | | | | [\x{0020}-\x{FFFF}]abcdef | 34,013,607 Bytes | 0 occurrence | 34,013,608 Bytes | 1 occ. = ALL File Contents | •-----------------------------•--------------------•--------------------•--------------------•------------------------------• | @abdef | Tested, up to a 200 Mb file => 0 occurrence ( OK ) | •-----------------------------•-----------------------------------------•--------------------•------------------------------• B) With a file containing only ONE occurrence of "@abcdef", near the END : ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ •-----------------------------•-----------------------------------------•---------------------------------------------------• | REGULAR expressions | Size MAXIMUM with a CORRECT Result | Size MINIMUM with an INCORRECT Result | •-----------------------------•--------------------•--------------------•--------------------•------------------------------• | [^@]abcdef | 33,333,334 Bytes | O occurrence | 33,333,335 Bytes | 1 occ. = ALL File Contents | | | | | | | | | | | | | | [^@\r\n]abcdef | 34,013,608 Bytes | 0 occurrence | 34,013,609 Bytes | 1 occ. = ALL File Contents | | | | | | | | [^@\x00-\x1F]abcdef | 34,013,608 Bytes | 0 occurrence | 34,013,609 Bytes | 1 occ. = ALL File Contents | | | | | | | | [\x20-?A-~]abcdef | 34,013,608 Bytes | 0 occurrence | 34,013,609 Bytes | 1 occ. = ALL File Contents | •-----------------------------•--------------------•--------------------•--------------------•------------------------------• | [\x20-?@A-~]abcdef | 34,013,617 Bytes | 1 occ. = @abcdef | 34,013,618 Bytes | 1 occ. = ALL File Contents | | | | | | | | [\x20-~]abcdef | 34,013,617 Bytes | 1 occ. = @abcdef | 34,013,618 Bytes | 1 occ. = ALL File Contents | | | | | | | | [\x{0020}-\x{FFFF}]abcdef | 34,013,617 Bytes | 1 occ. = @abcdef | 34,013,618 Bytes | 1 occ. = ALL File Contents | •-----------------------------•--------------------•--------------------•--------------------•------------------------------• | @abdef | Tested up to a 200 Mb File => 1 occurrence = @abcdef | •-----------------------------•-----------------------------------------•---------------------------------------------------•
Remark : I suppose that the problem arise because when you search, for instance, for the regex
@bcdef
there is only ONE possibility but when you search for[^@]bcdef
there quite a lot of possible matches !!I also tried to search for the regex
abcdef
and click on the button Find All in current Document. It did show the different lines containing the “abcdef” string ! Then, right-clicking inside the Find Result panel, I chose the optionFound in these Found results...
.In this new dialog I type in the regex
[^@]abcdef
, and tick the optionSearch only in found lines
. Unfortunately, it didn’t show these specific lines but the first one, only, when dealing with big files :-((
Finally, the best, regarding your specific regex, would be ( just good sense ! ) :
-
Firstly, replace
@abcdef
with, for instance, the string@abcdez
, with theReplace All
button, inNormal
orRegular expression
search mode -
Secondly, search for
abcdef
and click on theFind Next
button OR theFind All in Current Document
, in order to find out, implicitly, all the occurrences of the regex[^@]abcdef
-
When your treatment of the
[^@]abcdef
occurrences is over, just do the reverse operation, searching for@abcdez
and replacing with@abcdef
Best Regards,
guy038
-