Hi, @coises and All,
You said,
My thought is that it should be the same things Scintilla recognizes as line breaks and the Notepad++ documentation states: just \n and \r.
I think that this reasoning is the right one ! More over, note that we use the same reasoning when we want to find all chars but a specific one, in each single line : we use the regex [^c\r\n], where c is the character we do not want to !
Thus, against my Total_Chars.txt file, the regex (?s). should return 325,590 occurrences and the regex (?-s). should return 325,588 occurrences
Now, regarding my question :
Just because you do not allow backward searches when choosing the Regular expression search mode ! May be you could add it among all the Columns++ options ?
I do understand all the reasons why you are not inclined to do so ! However, note that, as regularly using the regexBackward4PowerUser="yes" option, in the FindHistory node of the config.xml file, I can assure you that a lot, but not all, of regexes can be processed in backward direction ! Unfortunately, with our present Boost regex engine, you can verify my assertion :
Backward regex searches, for NON ANSI files, stops as soon as it matches a character with code-point over \x{007F}
I also tested the search of invalid UTF-8 bytes. To do so :
Open a new N++ tab. ( I assume that its current encoding is UTF-8 ! )
Run the Encoding > Convert to ANSI menu option
Paste the text below, in this new ANSI tab
ABC ퟿ XYZ \x{D7FF} ED 9F BF LAST valid char BEFORE Surrogates range
ABC í € XYZ \x{D800} ED A0 80 FIRST SURROGATE char
ABC í¿¿ XYZ \x{DFFF} ED BF BF LAST SURROGATE char
ABC  XYZ \x{E000} EE 80 80 First valid char AFTER Surrogates range
ABC € XYZ
ABC XYZ
ABC ‚ XYZ
ABC ƒ XYZ
ABC „ XYZ
ABC … XYZ
ABC † XYZ
ABC ‡ XYZ
ABC ˆ XYZ
ABC ‰ XYZ
ABC Š XYZ
ABC ‹ XYZ
ABC Œ XYZ
ABC XYZ
ABC Ž XYZ
ABC XYZ
ABC XYZ
ABC ‘ XYZ
ABC ’ XYZ
ABC “ XYZ
ABC ” XYZ
ABC • XYZ
ABC – XYZ
ABC — XYZ
ABC ˜ XYZ
ABC ™ XYZ
ABC š XYZ
ABC › XYZ
ABC œ XYZ
ABC XYZ
ABC ž XYZ
ABC Ÿ XYZ
ABC XYZ
ABC ¡ XYZ
ABC ¢ XYZ
ABC £ XYZ
ABC ¤ XYZ
ABC ¥ XYZ
ABC ¦ XYZ
ABC § XYZ
ABC ¨ XYZ
ABC © XYZ
ABC ª XYZ
ABC « XYZ
ABC ¬ XYZ
ABC XYZ
ABC ® XYZ
ABC ¯ XYZ
ABC ° XYZ
ABC ± XYZ
ABC ² XYZ
ABC ³ XYZ
ABC ´ XYZ
ABC µ XYZ
ABC ¶ XYZ
ABC · XYZ
ABC ¸ XYZ
ABC ¹ XYZ
ABC º XYZ
ABC » XYZ
ABC ¼ XYZ
ABC ½ XYZ
ABC ¾ XYZ
ABC ¿ XYZ
ABC À XYZ
ABC Á XYZ
ABC Â XYZ
ABC Ã XYZ
ABC Ä XYZ
ABC Å XYZ
ABC Æ XYZ
ABC Ç XYZ
ABC È XYZ
ABC É XYZ
ABC Ê XYZ
ABC Ë XYZ
ABC Ì XYZ
ABC Í XYZ
ABC Î XYZ
ABC Ï XYZ
ABC Ð XYZ
ABC Ñ XYZ
ABC Ò XYZ
ABC Ó XYZ
ABC Ô XYZ
ABC Õ XYZ
ABC Ö XYZ
ABC × XYZ
ABC Ø XYZ
ABC Ù XYZ
ABC Ú XYZ
ABC Û XYZ
ABC Ü XYZ
ABC Ý XYZ
ABC Þ XYZ
ABC ß XYZ
ABC à XYZ
ABC á XYZ
ABC â XYZ
ABC ã XYZ
ABC ä XYZ
ABC å XYZ
ABC æ XYZ
ABC ç XYZ
ABC è XYZ
ABC é XYZ
ABC ê XYZ
ABC ë XYZ
ABC ì XYZ
ABC í XYZ
ABC î XYZ
ABC ï XYZ
ABC ð XYZ
ABC ñ XYZ
ABC ò XYZ
ABC ó XYZ
ABC ô XYZ
ABC õ XYZ
ABC ö XYZ
ABC ÷ XYZ
ABC ø XYZ
ABC ù XYZ
ABC ú XYZ
ABC û XYZ
ABC ü XYZ
ABC ý XYZ
ABC þ XYZ
ABC ÿ XYZ
Now, choose the
Encoding > UTF-8 encoding. So all characters of this
ANSI file are
re-interpreted as they were
UTf_8 chars
=> You should see, between the strings ABC and XYZ :
-The last VALID UTF-8 char ( ED 9F BF ) before the SURROGATE range
The 3-bytes sequence of the first SURROGATE char, which is an INVALID sequence
The 3-bytes sequence of the last SURROGATE char, which is an INVALID sequence
The first VALID UTF-8 char ( EE 80 80 ) after the SURROGATE range
Then, a list of the 128 IVALID UTF-8 characters as the UTF-8 encoding does NOT allow any 1-byte character OVER \x{007F} !
Now :
Move the caret to the first empty line
Run the option Plugins > Columns++ > Search...
Enter the range [\x{DC80}-\x{DCFF}] in the Find what : zone
Click on the Find First button
=>
The Search region is set to the entire document
The first INVALID byte \xED is selected
Click on the Find Next button => It will select, one after another, all the other IVALID UTF-8 characters of this new tab !
So, @coises, your new implementation works correctly, regarding the INVALID UTF-8 chars and I’m longing for your second experimental version ;-))
Best Regards,
guy038