Hi, @coises and All,
You said,
My thought is that it should be the same things Scintilla recognizes as line breaks and the Notepad++ documentation states: just \n and \r.
I think that this reasoning is the right one ! More over, note that we use the same reasoning when we want to find all chars but a specific one, in each single line : we use the regex [^c\r\n], where c is the character we do not want to !
Thus, against my Total_Chars.txt file, the regex (?s). should return 325,590 occurrences and the regex (?-s). should return 325,588 occurrences
Now, regarding my question :
Just because you do not allow backward searches when choosing the Regular expression search mode ! May be you could add it among all the Columns++ options ?
I do understand all the reasons why you are not inclined to do so ! However, note that, as regularly using the regexBackward4PowerUser="yes" option, in the FindHistory node of the config.xml file, I can assure you that a lot, but not all, of regexes can be processed in backward direction ! Unfortunately, with our present Boost regex engine, you can verify my assertion :
Backward regex searches, for NON ANSI files, stops as soon as it matches a character with code-point over \x{007F}
I also tested the search of invalid UTF-8 bytes. To do so :
Open a new N++ tab. ( I assume that its current encoding is UTF-8 ! )
Run the Encoding > Convert to ANSI menu option
Paste the text below, in this new ANSI tab
ABC ퟿ XYZ \x{D7FF} ED 9F BF LAST valid char BEFORE Surrogates range ABC í € XYZ \x{D800} ED A0 80 FIRST SURROGATE char ABC í¿¿ XYZ \x{DFFF} ED BF BF LAST SURROGATE char ABC  XYZ \x{E000} EE 80 80 First valid char AFTER Surrogates range ABC € XYZ ABC XYZ ABC ‚ XYZ ABC ƒ XYZ ABC „ XYZ ABC … XYZ ABC † XYZ ABC ‡ XYZ ABC ˆ XYZ ABC ‰ XYZ ABC Š XYZ ABC ‹ XYZ ABC Œ XYZ ABC XYZ ABC Ž XYZ ABC XYZ ABC XYZ ABC ‘ XYZ ABC ’ XYZ ABC “ XYZ ABC ” XYZ ABC • XYZ ABC – XYZ ABC — XYZ ABC ˜ XYZ ABC ™ XYZ ABC š XYZ ABC › XYZ ABC œ XYZ ABC XYZ ABC ž XYZ ABC Ÿ XYZ ABC XYZ ABC ¡ XYZ ABC ¢ XYZ ABC £ XYZ ABC ¤ XYZ ABC ¥ XYZ ABC ¦ XYZ ABC § XYZ ABC ¨ XYZ ABC © XYZ ABC ª XYZ ABC « XYZ ABC ¬ XYZ ABC XYZ ABC ® XYZ ABC ¯ XYZ ABC ° XYZ ABC ± XYZ ABC ² XYZ ABC ³ XYZ ABC ´ XYZ ABC µ XYZ ABC ¶ XYZ ABC · XYZ ABC ¸ XYZ ABC ¹ XYZ ABC º XYZ ABC » XYZ ABC ¼ XYZ ABC ½ XYZ ABC ¾ XYZ ABC ¿ XYZ ABC À XYZ ABC Á XYZ ABC  XYZ ABC à XYZ ABC Ä XYZ ABC Å XYZ ABC Æ XYZ ABC Ç XYZ ABC È XYZ ABC É XYZ ABC Ê XYZ ABC Ë XYZ ABC Ì XYZ ABC Í XYZ ABC Î XYZ ABC Ï XYZ ABC Ð XYZ ABC Ñ XYZ ABC Ò XYZ ABC Ó XYZ ABC Ô XYZ ABC Õ XYZ ABC Ö XYZ ABC × XYZ ABC Ø XYZ ABC Ù XYZ ABC Ú XYZ ABC Û XYZ ABC Ü XYZ ABC Ý XYZ ABC Þ XYZ ABC ß XYZ ABC à XYZ ABC á XYZ ABC â XYZ ABC ã XYZ ABC ä XYZ ABC å XYZ ABC æ XYZ ABC ç XYZ ABC è XYZ ABC é XYZ ABC ê XYZ ABC ë XYZ ABC ì XYZ ABC í XYZ ABC î XYZ ABC ï XYZ ABC ð XYZ ABC ñ XYZ ABC ò XYZ ABC ó XYZ ABC ô XYZ ABC õ XYZ ABC ö XYZ ABC ÷ XYZ ABC ø XYZ ABC ù XYZ ABC ú XYZ ABC û XYZ ABC ü XYZ ABC ý XYZ ABC þ XYZ ABC ÿ XYZ Now, choose the Encoding > UTF-8 encoding. So all characters of this ANSI file are re-interpreted as they were UTf_8 chars=> You should see, between the strings ABC and XYZ :
-The last VALID UTF-8 char ( ED 9F BF ) before the SURROGATE range
The 3-bytes sequence of the first SURROGATE char, which is an INVALID sequence
The 3-bytes sequence of the last SURROGATE char, which is an INVALID sequence
The first VALID UTF-8 char ( EE 80 80 ) after the SURROGATE range
Then, a list of the 128 IVALID UTF-8 characters as the UTF-8 encoding does NOT allow any 1-byte character OVER \x{007F} !
Now :
Move the caret to the first empty line
Run the option Plugins > Columns++ > Search...
Enter the range [\x{DC80}-\x{DCFF}] in the Find what : zone
Click on the Find First button
=>
The Search region is set to the entire document
The first INVALID byte \xED is selected
Click on the Find Next button => It will select, one after another, all the other IVALID UTF-8 characters of this new tab !
So, @coises, your new implementation works correctly, regarding the INVALID UTF-8 chars and I’m longing for your second experimental version ;-))
Best Regards,
guy038