Columns++ version 1.2: better Unicode search

Coises

@Alan-Kilborn said in Columns++ version 1.2: better Unicode search:

I would go so far as to suggest something that looks and operates like the Notepad++ Find dialog and its tabs.
Then someone using the plugin would not have to learn anything new, and would feel “right at home”.
You know why I suggest this, right? :-)

I think I do, but to be honest, if and when I take on such a project, non-trivial user interface changes would be the whole point. Given that, I’m not sure I’d want to tie myself to recreating a legacy user interface and using it as an underlying model. Familiarity would be a plus, but I am unlikely to impose it on myself as a constraint.

This is all far enough down the road that someone else might well get to it before I do, anyway. I have at least two other self-assignments that would come first, and that’s just in the realm of computer programming.

guy038

Hello, @coises and All,

Refer to this FAQ that I’ve just updated with references to your last Columns++-1.2 release :

https://community.notepad-plus-plus.org/topic/15765/faq-where-to-find-regular-expressions-regex-documentation

Best Regards,

guy038

Coises

@guy038 said in Columns++ version 1.2: better Unicode search:

Refer to this FAQ that I’ve just updated with references to your last Columns++-1.2 release :

https://community.notepad-plus-plus.org/topic/15765/faq-where-to-find-regular-expressions-regex-documentation

Thank you for mentioning Columns++. Might I suggest a couple things?

Columns++ version 1.2 is available using Plugins Admin in Notepad++ 8.7.8; there is no need to do the more complicated installation procedure.
I fear that your FAQ might make some readers expect that the search in Columns++ is a replacement for Notepad++ search, when it really isn’t. There are many things Notepad++ search can do (finding in all open files, finding and replacing in multiple files, etc.) that Columns++ search does not do and almost certainly never will. Its original, and still primary, reason for existence is to make it possible to find and replace within a rectangular selection — something Notepad++ search cannot do. There is also the extension of using mathematical formulas in replacements. I would recommend perhaps a link to the online help file sections about Search and Regular Expressions to clarify when Columns++ might be useful.
It might be unclear that while the progress dialog change applies to all Count, Select and Replace All actions in Columns++ search, the other changes you mentioned apply only to searches in Unicode files. Searches in “ANSI” files work the same as always. (The ability to search in regions based on a rectangular or multiple selection also applies to all searches, and the ability to use formulas in replacements applies to all regular expression searches.)

guy038

Hello, @coises and All,

You said :

Columns++ version 1.2 is available using Plugins Admin in Notepad++ 8.7.8; there is no need to do the more complicated installation procedure.

Well, I updated the regex documentation with the N++ release v8.7.6 and, at that time, Columns++ did not seem to belong to the plugins’s list !?

You said :

I fear that your FAQ might make some readers expect that the search in Columns++ is a replacement for Notepad++ search, when it really isn’t.

I agree that I did not presented your plugin the right way. So, I did some modifications and I hope you’ll agree with the new phrasing !

You said :

… the other changes you mentioned apply only to searches in Unicode files. Searches in “ANSI” files work the same as always.

Well, your assertion is a bit paradoxal, regarding the title of this post ! Indeed, your title says :

Columns++ version 1.2: better Unicode search !!

And anyway, against an ANSI file, any search of an UNICODE property triggers an Invalid Regex message ! So, the benefit of this improved version is not so obvious for ANSI files. However, I did add a mention which clearly says that the search and replace are correct with ANSI files,too.

However, I noticed a odd thing :

Write these five characters ,¼½¾, in a new UTF-8 tab
Ask Columns++ to select all the punctuation characters with the [[:punct:]] regex

=> It correctly find the two commas only, as the fractions has the UNICODE \p{other Number} property and are not punctuation chars

Now, convert this UTF-8 file to an ANSI file, with the Encoding > Convert to ANSI option
Re-try the [[:punct:]] regex against this, from now on, ANSI file

=> This time, the five characters are selected !?

If you try the \p{other Number} regex it returns, as expected, an error message !

In your documentation, regarding your last sequence [[.x80.]]–[[.xff.]], right before the Search file section :

At first sight, it’s not a class of characters and it seems to be an invalid regular expression. Surprisingly, it’s not ! Actually, it’s a sequence of three consecutive characters :

An invalid \x80 UTF-8 byte
An EN DASH character ( \x{2013} )
An invalid \xff UTF-8 byte

So, @Coises, just modify this regex as [[.x80.]-[.xff.]], with an Hyphen-Minus character which, indeed, finds any invalid UTF-8 character !

BTW, I like your two buttons, at the bottom right of your documentation (txt- TXT+), which allows us to zoom in or out. It surely help a lot of people !

Best Regards

guy038

P.S. :

In the first of my three consecutive posts which ended up my test’s period of your plugin ( https://community.notepad-plus-plus.org/post/100087 ) I wrote :

\p{Ascii} = (?s)\o => 128 when applied against my Total_Chars.txt file ! Now, I understand that the (?s) modifier does not change anything for the Count results. Indeed, the (?s) or (?-s) modifiers are ONLY needed if there is, at least, one . regex character in the entire regex !

So, if we want to omit the \r and \n in the above regex, we must use the (?![\r\n])\p{Ascii} or the (?![\r\n])\o syntax, which correctly return 126 matches

Note that this it only true for a NON ANSI file. For an ANSI file :

The regex (?![\r\n])\p{Ascii} is invalid, as explained above.
The regex (?![\r\n])\o does work but returns just one match : the lower-case letter o !! ( the Match case option was set )

Coises

@guy038 said in Columns++ version 1.2: better Unicode search:

Well, I updated the regex documentation with the N++ release v8.7.6 and, at that time, Columns++ did not seem to belong to the plugins’s list !?

The previous “stable” version was there, but I got the pull request to update to version 1.2 in just barely in time to make it into Notepad++ 8.7.8.

I did some modifications and I hope you’ll agree with the new phrasing !

Thank you. I like that. I just didn’t want people to install it and then be disappointed that it’s no help if they want to use one of the many features of Notepad++ search that Columns++ does not attempt to replicate.

However, I noticed a odd thing :

Write these five characters ,¼½¾, in a new UTF-8 tab

Ask Columns++ to select all the punctuation characters with the [[:punct:]] regex

=> It correctly find the two commas only, as the fractions has the UNICODE \p{other Number} property and are not punctuation chars

Now, convert this UTF-8 file to an ANSI file, with the Encoding > Convert to ANSI option

Re-try the [[:punct:]] regex against this, from now on, ANSI file

=> This time, the five characters are selected !?

Yes, that is something I don’t like about my own work: there are now inconsistencies between ANSI and UTF-8, because I changed nothing about ANSI regular expressions. For example, (?i)\u still matches all alphabetic characters in ANSI files. (For obscure technical reasons involving C++ template specialization and how Boost::regex is implemented, it may prove to be more difficult to make the corresponding changes to ANSI than it was to make them to Unicode. So far, I haven’t even tried.)

In your documentation, regarding your last sequence [[.x80.]]–[[.xff.]], right before the Search file section :

At first sight, it’s not a class of characters and it seems to be an invalid regular expression. Surprisingly, it’s not ! Actually, it’s a sequence of three consecutive characters

It looks like I failed to convey what I meant in that entry. What I was trying to say was that you can use [[.xhh.]] as a symbolic character reference to find an invalid byte; so that, for example, [[.xB2.]] will find any byte 0xB2 that is part of an invalid UTF-8 sequence. (There is no way to isolate bytes 0xB2 that are parts of valid UTF-8 sequences, though; for that, you’d have to reinterpret as — not convert to — ANSI.) I added those as I was updating the documentation, because I thought it was less confusing than telling people they could use expressions like \x{DCB2} to find specific invalid bytes. This mirrors how control and invisible characters have symbolic names that match the way Scintilla displays them.

guy038

Hi, @coises,

You said :

It looks like I failed to convey what I meant in that entry

Ah…, now I understand what you meant ! Thus, may be the two following entries would just mean what you expected to :

•-----------------------------•-----------•------------------------------------------------------•
| From [[.x00.]] to [[.xff.]] | [[.x##.]] | The invalid UTF-8 byte [[.x##.]]                     |
| [[.x80.]-[.xff.]]           |           | Any invalid UTF-8 byte                               | 
•-----------------------------•-----------•------------------------------------------------------•

Like you, I’m a bit upset about the differences of behavior, of your Columns++ plugin, between ANSI and UNICODE files. So, I will do additional tests to narrow down where these differences occur ! Like my Total_Chars.txt UNICODE file, I’ll create an ANSI file containing the 256 characters of the Winsows-1252 encoding to this purpose !

https://en.wikipedia.org/wiki/Windows-1252

See you later,

BR

guy038

guy038

Hello, @coises and All,

I’ve decided to use the same canvas to describe results with an ANSI file as I did for results with a UNICODE file. This description will spread over two posts !

So, I first created this ANSI file, named Total_ANSI.txt :

•---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
|     Range     |  Description    |   Status   |  COUNT / MARK of ALL chars  |  # Chars  |  ANSI Encoding  |  # Bytes  |
•---------------•-----------------•------------•-----------------------------•-----------•-----------------•- ---------•
|  0000 - 007F  |  PLANE 0 - BMP  |  Included  |  [\x00-\x7F]                |      128  |                 |      128  |
|               |                 |            |                             |           |     1 Byte      |           |
|  0080 - 00FF  |  PLANE 0 - BMP  |  Included  |  [\x80-\xFF]                |      128  |                 |      128  |
•---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•

Against this file, the following results are correct :

    [\x00-\xFF]    =>  256 chars, coded with one byte = TOTAL of characters

    [[:unicode:]]  =>    0 char                       = Total chars OVER \x{00FF}

I tried some expressions with look-aheads and look-behinds, containing overlapping zones !

For instance, against this text aaaabaaababbbaabbabb, pasted in a new ANSI tab, with a final line-break, all the regexes, below, give the correct number of matches :

ba*(?=a)   =>  4 matches
ba*(?!a)   =>  9 matches
ba*(?=b)   =>  8 matches
ba*(?!b)   =>  5 matches

(?<=a)ba*  =>  5 matches
(?<!b)ba*  =>  5 matches

(?<=b)ba*  =>  4 matches
(?<!a)ba*  =>  4 matches

But, on the other hand, the search of the regex :

[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OCS.][.SHY.]]

Leads to an Invalid Regex message. Logical, as this kind of search concerns Unicode files, only.

Now, against the Total_ANSI.txt file, all the following results are correct :

(?s).        =  [\x00-\xFF]      =>  256     Total =  256

(?-s).       =  [^\x0A\x0C\x0D]  =>  253


\p{Unicode}  =  [[:Unicode:]]    =>    0  |
                                          |  Total =  256
\P{Unicode}  =  [[:^Unicode:]]   =>  256  |


\X                               =>  256  |
                                          |  Total =  256
(?!\X).                          =>    0  |

Here are the correct results, concerning all the Posix character classes, against the Total_ANSI.txt file

[[:unicode:]]  =  \p{unicode}                                     an OVER  \x{00FF}         character        0  =  [^\x00-\xFF}]

[[:space:]]   =  \p{space}  =  [[:s:]]  =  \p{s}  =  \ps  =  \s   a             WHITE-SPACE character        7  =  [\t\n\x0B\f\r\x20\xA0]
                               [[:h:]]  =  \p{h}  =  \ph  =  \h   an HORIZONTAL white space character        3  =  [\t\x20\xA0]
[[:blank:]]   =  \p{blank}                                        a  BLANK                  character        3  =  [\t\x20\xA0]
                               [[:v:]]  =  \p{v}  =  \pv  =  \v   a  VERTICAL   white space character        4  =  [\n\x0B\f\r]

[[:cntrl:]]   =  \p{cntrl}                                        a  CONTROL code           character       39  =  [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D\xAD]

[[:upper:]]   =  \p{upper}  =  [[:u:]]  =  \p{u}  =  \pu  =  \u   an  UPPER case    letter                  60  =  [A-ZŠŒŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞß]
[[:lower:]]   =  \p{lower}  =  [[:l:]]  =  \p{l}  =  \pl  =  \l   a   LOWER case    letter                  65  =  [a-zƒšœžŸªµºàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
[[:digit:]]   =  \p{digit}  =  [[:d:]]  =  \p{d}  =  \pd   = \d   a   DECIMAL       number                  13  =  [0-9²³¹]
 _            =  \x{005F}                                         the LOW_LINE      character                1
                                                                                                          -------
[[:word:]]    =  \p{word}   =  [[:w:]]  =  \p{w}  =  \pw   = \w   a   WORD                  character      139  =  [[:alnum:]]|\x5F  =  \p{alnum}|\x5F


(?i)[[:upper:]]  = (?i)[[:lower:]]                                a   LETTER, whatever its CASE            125  =  (?-i)[[:upper:][:lower:]]


[[:alnum:]]   =  \p{alnum}                                        an  ALPHANUMERIC          character      138  =  (?-i)[[:upper:][:lower:][:digit:]]

[[:alpha:]]   =  \p{alpha}                                        any LETTER                character      125  =  (?-i)[[:upper:][:lower:]]


[[:graph:]]   =  \p{graph}                                        any VISIBLE               character      212  =  [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0]

[[:print:]]   =  \p{print}                                        any PRINTABLE             character      219  =  [[:graph:]]|\s


[[:punct:]]   =  \p{punct}                                        any PUNCTUATION or SYMBOL character       80  =  [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x82\x84-\x87\x89\x8B\x91-\x97\x9B\xA1-\xBF\xD7\xF7]


[[:xdigit:]]                                                      an HEXADECIMAL            character       22  =  [0-9A-Fa-f]  =  (?i)[0-9A-F]

NO results regarding the Unicode character classes, against the Total_ANSI.txt file because it logically returns an Invalid Regular Expression message for any class

Remark :

A negative POSIX character class can be expressed as [^[:........:]] or [[:^........:]]

No INVALID UTF-8 chars can be found as we’re dealing with an ANSI file !

I tested ALL the Equivalence classes feature :

You can use any other equivalent character of the a letter to get the 15 matches ( for instance : ((=ª=]], [[=Å=]], [[=ã=]], … )

Here is below the list of all the equivalences of any char of the Windows-1252 code-page, from \x00 till \xDE against the Total_ANSI.txt file. Note that I did not consider the equivalence classes which returns only one match !

[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]  =  [[=alert=]]        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]   =  [[=backspace=]]    =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]

[[==]]                         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]   =   [[=IS4=]]         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]   =   [[=IS3=]]         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]   =   [[=IS2=]]         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]   =   [[=IS1=]]         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]

[[='=]]    =   [[=apostrophe=]]  =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[=-=]]    =   [[=hyphen=]]      =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]

[[==]]                          =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[=–=]]                          =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[=—=]]                          =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
[[==]]                          =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]

[[=1=]]    =   [[=one=]]         =>     2   [1¹]
[[=2=]]    =   [[=two=]]         =>     2   [2²]
[[=3=]]    =   [[=three=]]       =>     2   [3³]

[[=A=]]                          =>    15   [AaªÀÁÂÃÄÅàáâãäå]
[[=B=]]                          =>     2   [Bb]
[[=C=]]                          =>     4   [CcÇç]
[[=D=]]                          =>     4   [DdÐð]
[[=E=]]                          =>    10   [EeÈÉÊËèéêë]
[[=F=]]                          =>     3   [Ffƒ]
[[=G=]]                          =>     2   [Gg]
[[=H=]]                          =>     2   [Hh]
[[=I=]]                          =>    10   [IiÌÍÎÏìíîï]
[[=J=]]                          =>     2   [Jj]
[[=K=]]                          =>     2   [Kk]
[[=L=]]                          =>     2   [Ll]
[[=M=]]                          =>     2   [Mm]
[[=N=]]                          =>     4   [NnÑñ]
[[=O=]]                          =>    15   [OoºÒÓÔÕÖØòóôõöø]
[[=P=]]                          =>     2   [Pp]
[[=Q=]]                          =>     2   [Qq]
[[=R=]]                          =>     2   [Rr]
[[=S=]]                          =>     4   [SsŠš]
[[=T=]]                          =>     2   [Tt]
[[=U=]]                          =>    10   [UuÙÚÛÜùúûü]
[[=V=]]                          =>     2   [Vv]
[[=W=]]                          =>     2   [Ww]
[[=X=]]                          =>     2   [Xx]
[[=Y=]]                          =>     6   [YyÝýÿŸ]
[[=Z=]]                          =>     4   [ZzŹź]

[[=^=]]    =  [[=circumflex=]]   =>     2   [^ˆ]
[[=Œ=]]                          =>     2   [Œœ]
[[=Þ=]]                          =>     2   [Þþ]

Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :

[[=AE=]] = [[=Ae=]] = [[=ae=]]   =>   2   [Ææ]
[[=SS=]] = [[=Ss=]] = [[=ss=]]   =>   1   [ß]

An example : let’s suppose that we run this regex (?-i)[A-F[:lower:]], against my Total_ANSI.txt file. It does give 71 matches, so 6 UPPER letters + 65 LOWER letters

As, in an ANSI file, the Match case option or the ?i) modifier is effective for POSIX character classes, if we run the same regex, in an insensitive way, the (?i)[A-F[:lower:]] regex returns, this time, 125 matches.

And note that the regex (?-i)[[:upper:][:lower:]] or (?i)[[:upper:][:lower:]] acts as an insensitive regex and return 125 matches ( So 60 UPPER letters + 65 LOWER letters )

The regexes (?-i)\u(?<=\l) and (?-i)(?=\l)\u do not find any match. This implies that the sets of UPPER and LOWER letters are totally disjoint

Finally, for ANSI files, the regex syntax \X is rather useless. Indeed, the UNICODE block of Combining diacritical marks is cannot be used, anyway and the Emoji are UNICODE characters are totally inaccessible to ANSI files. Thus, \X regex is just equivalent to the simple regex (?s).

So, from this set of ANSI results, which ones seem quite odd, compared with the UNICODE results ?

Maybe, the regex (?-s). should just be equal to [^\x0A\x0D] and return 254 matches
The [[:cntrl:]] or \p{cntrl} should be equal to [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] and returns 38 characters or, maybe, [\x00-\x1F\x7F] so 33 chars only

Regarding the [[:graph:]] character class, I created an identical UTF-8 file, named Total_UTF-8.txt. Here are the results for the characters between \x80 and \xBF, in both files ( all the other chars being identical ) :

•---------•---------•--------------------•
|  ANSI   |  UTF-8  |  UNICODE Category  |
•---------•---------•--------------------•
|         |    €    |   Sc
|    ‚    |    ‚    |
|    ƒ    |    ƒ    |
|    „    |    „    |
|    …    |    …    |
|    †    |    †    |
|    ‡    |    ‡    |
|         |    ˆ    |   Po
|    ‰    |    ‰    |
|    Š    |    Š    |
|    ‹    |    ‹    |
|    Œ    |    Œ    |
|    Ž    |    Ž    |
|    ‘    |    ‘    |
|    ’    |    ’    |
|    “    |    “    |
|    ”    |    ”    |
|    •    |    •    |
|    –    |    –    |
|    —    |    —    |
|         |    ˜    |   Sk
|         |    ™    |   So
|    š    |    š    |
|    ›    |    ›    |
|    œ    |    œ    |
|    ž    |    ž    |
|    Ÿ    |    Ÿ    |
|    ¡    |    ¡    |
|    ¢    |    ¢    |
|    £    |    £    |
|    ¤    |    ¤    |
|    ¥    |    ¥    |
|    ¦    |    ¦    |
|    §    |    §    |
|    ¨    |    ¨    |
|    ©    |    ©    |
|    ª    |    ª    |
|    «    |    «    |
|    ¬    |    ¬    |
|        |         |   Cf
|    ®    |    ®    |
|    ¯    |    ¯    |
|    °    |    °    |
|    ±    |    ±    |
|    ²    |    ²    |
|    ³    |    ³    |
|    ´    |    ´    |
|    µ    |    µ    |
|    ¶    |    ¶    |
|    ·    |    ·    |
|    ¸    |    ¸    |
|    ¹    |    ¹    |
|    º    |    º    |
|    »    |    »    |
|    ¼    |    ¼    |
|    ½    |    ½    |
|    ¾    |    ¾    |
|    ¿    |    ¿    |
•---------•---------•

Surprisingly, the ANSI chars \x80, \x88, \x98, \x99 are not supposed to be part of the [[:graph:]] which represents the class of visible characters !?

So, to harmonize the results, the rule should be :

When using the [[:graph:]] POSIX character class, against an ANSI file :
- The [\x80\x88\x98\x99] ANSI list of characters ( corresponding the [\x{20AC}\x{02C6}\x{02DC}\x{2122}] UTF-8 list ) should be included in that class
- The \xAD character ( or \x{00AD} ) should be excluded from that class !

Now, as the [[:print:]] POSIX character class is simply identical to the regex [[:graph:]]|\s, no need to investigate about that character class !

See next post

guy038

Hi @Coises and All,

End of my reply :

In the same way, regarding the [[:punct:]] character class, here are the results for both, the Total_ANSI.txt and Total_UTF-8.txt files :

•---------•---------•--------------------•
|  ANSI   |  UTF-8  |  UNICODE Category  |
•---------•---------•--------------------•
|    !    |    !    |    Po
|    "    |    "    |    Po
|    #    |    #    |    Po
|    $    |    $    |    Sc
|    %    |    %    |    Po
|    &    |    &    |    Po
|    '    |    '    |    Po
|    (    |    (    |    Ps
|    )    |    )    |    Pe
|    *    |    *    |    Po
|    +    |    +    |    Sm
|    ,    |    ,    |    Po
|    -    |    -    |    Pd
|    .    |    .    |    Po
|    /    |    /    |    Po
|    :    |    :    |    Po
|    ;    |    ;    |    Po
|    <    |    <    |    Sm
|    =    |    =    |    Sm
|    >    |    >    |    Sm
|    ?    |    ?    |    Po
|    @    |    @    |    Po
|    [    |    [    |    Ps
|    \    |    \    |    Po
|    ]    |    ]    |    Pe
|    ^    |    ^    |    Sk
|    _    |    _    |    Pc
|    `    |    `    |    Sk
|    {    |    {    |    Ps
|    |    |    |    |    Sm
|    }    |    }    |    Pe
|    ~    |    ~    |    Sm
•---------•---------•-----------•
|         |    €    |    Sc
|    ‚    |    ‚    |    Ps
|    „    |    „    |    Ps
|    …    |    …    |    Po
|    †    |    †    |    Po
|    ‡    |    ‡    |    Po
|    ‰    |    ‰    |    Po
|    ‹    |    ‹    |    Pi
|    ‘    |    ‘    |    Pi
|    ’    |    ’    |    Pf
|    “    |    “    |    Pi
|    ”    |    ”    |    Pf
|    •    |    •    |    Po
|    –    |    –    |    Pd
|    —    |    —    |    Pd
|         |    ˜    |    Sk
|         |    ™    |    So
|    ›    |    ›    |    Pf
|    ¡    |    ¡    |    Po
|    ¢    |    ¢    |    Sc
|    £    |    £    |    Sc
|    ¤    |    ¤    |    Sc
|    ¥    |    ¥    |    Sc
|    ¦    |    ¦    |    So
|    §    |    §    |    Po
|    ¨    |    ¨    |    Sk
|    ©    |    ©    |    So
|    ª    |         |    Lo
|    «    |    «    |    Pi
|    ¬    |    ¬    |    Sm
|        |         |    Cf
|    ®    |    ®    |    So
|    ¯    |    ¯    |    Sk
|    °    |    °    |    So
|    ±    |    ±    |    Sm
|    ²    |         |    No
|    ³    |         |    No
|    ´    |    ´    |    Sk
|    µ    |         |    Ll
|    ¶    |    ¶    |    Po
|    ·    |    ·    |    Po
|    ¸    |    ¸    |    Sk
|    ¹    |         |    No
|    º    |         |    Lo
|    »    |    »    |    Pf
|    ¼    |         |    No
|    ½    |         |    No
|    ¾    |         |    No
|    ¿    |    ¿    |    Po
|    ×    |    ×    |    Sm
|    ÷    |    ÷    |    Sm
•---------•---------•-----------•

And, as we know that the [[:punct:]] POSIX character class is the union of the TWO Unicode classes \p{P*} and \p{S*}, this means that all the [[:punct:]] characters, found in Total_UTF-8.txt, are exact !

However, it’s obvious that it’s not the case for the [[:punct:]] characters found in Total_ANSI.txt :

So, again, to harmonize the results, the rule should be :

When using the [[:punct:]] POSIX character class, against an ANSI file :
- The [\xAA\xAD\xB2\xB3\xB5\xB9\xBA\xBC\xBD\xBE] list of characters should be excluded from that class !
- The [\x80\x98\x99] ANSI list of characters ( corresponding the [\x{20AC}\x{02DC}\x{2122}] UTF-8 list ) should be included in that class

And this result would confirm that the POSIX [[:punct:]] character class is equal to the \p{P*}|\p{S*} regex, in all cases !

Regarding the Equivalence classes whose results are presently 33, the rule should be :
- All the Control codes should just match their own character. For example [[==]] should return 1 match [x7F]
- [[='=]] = [[=apostrophe=]] should return 1 match [\x27]
- [[=-=]] = [[=hyphen=]] should return 1 match [\x2D]
- [[=–=]] should return 1 match [\x96]
- [[=—=]] should return 1 match [\x97]
- [[==]] should return 1 match [\xAD]

Now, when doing tests with UNICODE files, I forgot the equivalence classes of the Control C0/C1 and Control Format characters !. So the results, against my Total_Chars.txt UTF-8 file, are :

[[=nul=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cc

[[=soh=]]                            =>     1  [\x{0001}]                   Cc
[[=stx=]]                            =>     1  [\x{0002}]                   Cc
[[=etx=]]                            =>     1  [\x{0003}]                   Cc
[[=eot=]]                            =>     1  [\x{0004}]                   Cc
[[=enq=]]                            =>     1  [\x{0005}]                   Cc
[[=ack=]]                            =>     1  [\x{0006}]                   Cc
[[=bel=]]  =  [[=alert=]]            =>     1  [\x{0007}]                   Cc
[[=bs=]]   =  [[=backspace=]]        =>     1  [\x{0008}]                   Cc
[[=ht=]]   =  [[=tab=]]              =>     1  [\x{0009}]                   Cc
[[=lf=]]   =  [[=newline=]]          =>     1  [\x{000A}]                   Cc
[[=vt=]]   =  [[=vertical-tab=]]     =>     1  [\x{000B}]                   Cc
[[=ff=]]   =  [[=form-feed=]]        =>     1  [\x{000C}]                   Cc
[[=cr=]]   =  [[=carriage-return=]]  =>     1  [\x{000D}]                   Cc
[[=so=]]                             =>     1  [\x{000E}]                   Cc
[[=si=]]                             =>     1  [\x{000F}]                   Cc
[[=dle=]]                            =>     1  [\x{0010}]                   Cc
[[=dc1=]]                            =>     1  [\x{0011}]                   Cc
[[=dc2=]]                            =>     1  [\x{0012}]                   Cc
[[=dc3=]]                            =>     1  [\x{0013}]                   Cc
[[=dc4=]]                            =>     1  [\x{0014}]                   Cc
[[=nak=]]                            =>     1  [\x{0015}]                   Cc
[[=syn=]]                            =>     1  [\x{0016}]                   Cc
[[=etb=]]                            =>     1  [\x{0017}]                   Cc
[[=can=]]                            =>     1  [\x{0018}]                   Cc
[[=em=]]                             =>     1  [\x{0019}]                   Cc
[[=sub=]]                            =>     1  [\x{001A}]                   Cc
[[=esc=]]                            =>     1  [\x{001B}]                   Cc
[[=fs=]]                             =>     1  [\x{001C}]                   Cc
[[=gs=]]                             =>     1  [\x{001D}]                   Cc
[[=rs=]]                             =>     1  [\x{001E}]                   Cc
[[=us=]]                             =>     1  [\x{001F}]                   Cc

[[= =]]                              =>     3  [\x{0020}\x{205F}\x{3000}]   Zs

[[=del=]]                            =>     1  [\x{007F}]                   Cc
[[=pad=]]                            =>     1  [\x{0080}]                   Cc
[[=hop=]]                            =>     1  [\x{0081}]                   Cc
[[=bph=]]                            =>     1  [\x{0082}]                   Cc
[[=nbh=]]                            =>     1  [\x{0083}]                   Cc
[[=ind=]]                            =>     1  [\x{0084}]                   Cc
[[=nel=]]                            =>     1  [\x{0085}]                   Cc
[[=ssa=]]                            =>     1  [\x{0086}]                   Cc
[[=esa=]]                            =>     1  [\x{0087}]                   Cc
[[=hts=]]                            =>     1  [\x{0088}]                   Cc
[[=htj=]]                            =>     1  [\x{0089}]                   Cc
[[=lts=]]                            =>     1  [\x{008A}]                   Cc
[[=pld=]]                            =>     1  [\x{008B}]                   Cc
[[=plu=]]                            =>     1  [\x{008C}]                   Cc
[[=ri=]]                             =>     1  [\x{008D}]                   Cc
[[=ss2=]]                            =>     1  [\x{008E}]                   Cc
[[=ss3=]]                            =>     1  [\x{008F}]                   Cc
[[=dcs=]]                            =>     1  [\x{0090}]                   Cc
[[=pu1=]]                            =>     1  [\x{0091}]                   Cc
[[=pu2=]]                            =>     1  [\x{0092}]                   Cc
[[=sts=]]                            =>     1  [\x{0093}]                   Cc
[[=cch=]]                            =>     1  [\x{0094}]                   Cc
[[=mw=]]                             =>     1  [\x{0095}]                   Cc
[[=spa=]]                            =>     1  [\x{0096}]                   Cc
[[=epa=]]                            =>     1  [\x{0097}]                   Cc
[[=sos=]]                            =>     1  [\x{0098}]                   Cc
[[=sgci=]]                           =>     1  [\x{0099}]                   Cc
[[=sci=]]                            =>     1  [\x{009A}]                   Cc
[[=csi=]]                            =>     1  [\x{009B}]                   Cc
[[=st=]]                             =>     1  [\x{009C}]                   Cc
[[=osc=]]                            =>     1  [\x{009D}]                   Cc
[[=pm=]]                             =>     1  [\x{009E}]                   Cc
[[=apc=]]                            =>     1  [\x{009F}]                   Cc
[[=nbsp=]]                           =>     1  [\x{00A0}]                   Cc

[[=shy=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=alm=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf

[[=sam=]]                            =>     2  [\x{070F}\x{2E1A}]           Po
[[=ospm=]]                           =>     1  [\x{1680}]                   Zs
[[=mvs=]]                            =>     1  [\x{180E}]                   Cf
[[=nqsp=]]                           =>     2  [\x{2000}\X[2002}]           Zs
[[=mqsp=]]                           =>     2  [\x{2001}\X{2003}]           Zs
[[=ensp=]]                           =>     2  [\x{2000}\X[2002}]           Zs
[[=emsp=]]                           =>     2  [\x{2001}\X{2003}]           Zs
[[=3/msp=]]                          =>     1  [\x{2004}]                   Zs
[[=4/msp=]]                          =>     1  [\x{2005}]                   Zs
[[=6/msp=]]                          =>     1  [\x{2006}]                   Zs
[[=fsp=]]                            =>     1  [\x{2007}]                   Zs
[[=psp=]]                            =>     1  [\x{2008}]                   Zs
[[=thsp=]]                           =>     1  [\x{2009}]                   Zs
[[=hsp=]]                            =>     1  [\x{200A}]                   Zs
[[=zwsp=]]                           =>     1  [\x{200B}]                   Cf

[[=zwnj=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=zwj=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=lrm=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=rlm=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf

[[=ls=]]                             =>     2  [\x{2028}\x{FE47}]           Zl
[[=ps=]]                             =>     1  [\x{2029}]                   Zp

[[=lre=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=rle=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=pdf=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=lro=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=rlo=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf

[[=nnbsp=]]                          =>     1  [\x{202F}]                   Zs
[[=mmsp=]]                           =>     3  [\x{0020}\x{205F}\x{3000}]   Zs

[[=wj=]]                             => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=(fa)=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=(it)=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=(is)=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=(ip)=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=lri=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=rli=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=fsi=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=pdi=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=iss=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=ass=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=iafs=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=aafs=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=nads=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=nods=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf

[[=idsp=]]                           =>     3  [\x{0020}\x{205F}\x{3000}]   Zs

[[=zwnbsp=]]                         => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=iaa=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=ias=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
[[=iat=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf

[[=sflo=]]                           =>     1  [\x{1BCA0}]                  Cf
[[=sfco=]]                           =>     1  [\x{1BCA1}]                  Cf
[[=sfds=]]                           =>     1  [\x{1BCA2}]                  Cf
[[=sfus=]]                           =>     1  [\x{1BCA3}]                  Cf

As you can see, a lot of Format characters give the erroneous result of 3,309 occurrences. But we’re not going to bother about these wrong equivalence classes, as long as the similar collating names, with the [[.XXX.]] syntax, are totally correct !

Luckily, all the other equivalence classes are quite correct, except for [[=ls=]] which returns 2 matches \x{2028} and \x{FE47} ??

Also a detail !

Best Regards,

guy038

Coises

@guy038 said in Columns++ version 1.2: better Unicode search:

But, on the other hand, the search of the regex :

[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OCS.][.SHY.]]

Leads to an Invalid Regex message. Logical, as this kind of search concerns Unicode files, only.

There is a typo in that. The next-to-last symbolic name should be OSC, not OCS. (See list at the end of this help section.)

However, it still won’t work in ANSI search, because ANSI search only supports these POSIX symbolic names as defined by Boost::regex.

The regular expression language for ANSI files is exactly the same as it is in Notepad++ search, because I have not changed the underlying Boost::regex engine’s behavior for ANSI files. I only changed the way the engine works for UTF-8 files.

Some things, like stepwise find and replace with \K, formulas in replacement strings and counting null matches (my Count counts them, Notepad++’s doesn’t) differ for both ANSI and UTF-8 because I changed the surrounding code that uses the Boost::regex engine; but the matching itself is unchanged for ANSI.

This is why the character classes behave differently as well. Boost::regex relies on GetStringTypeExA (which is similar to GetStringTypeExW except for the third argument being char* instead of wchar_t*) to classify 8-bit characters according to the Ctype 1 list here. The classification depends on the current locale (which should imply the system default code page, which is the only code page Notepad++ ever uses as ANSI — documents in other code pages are converted to UTF-8). ANSI regular expressions, per Boost::regex design, are using whatever information Windows gives them.

guy038

Hi, @coises and All,

I think this will be the last answer concerning your Columns++_v1.2 plugin !

Here is the recapitulation of the way to access the invisible characters, whatever the file type :

For ANSI files : just one possible syntax for these collating names :

[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.IS4.][.IS3.][.IS2.][.IS1.][.DEL.]] which returns 33 matches, against the Total_ANSI.txt file, wihich contains the 256 characters of the Win-1252 encoding

Note that the lowercase syntax is NOT allowed, in ANSI files, for ANY collating names, presently in UPPER case
Note also that the four chars, from \x1c to \x1f must be referred as from IS4 to IS1, in UPPER case ( and NOT from fs to us ! )

For UTF-8 files : two possible syntaxes for these collating names :

[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.bel.][.bs.][.ht.][.lf.][.vt.][.ff.][.cr.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.bph.][.nbh.][.ind.][.nel.][.ssa.][.esa.][.hts.][.htj.][.lts.][.pld.][.plu.][.ri.][.ss2.][.ss3.][.dcs.][.pu1.][.pu2.][.sts.][.cch.][.mw.][.spa.][.epa.][.sos.][.sgci.][.sci.][.csi.][.st.][.osc.][.pm.][.apc.][.nbsp.][.shy.][.alm.][.sam.][.ospm.][.mvs.][.nqsp.][.mqsp.][.ensp.][.emsp.][.3/msp.][.4/msp.][.6/msp.][.fsp.][.psp.][.thsp.][.hsp.][.zwsp.][.zwnj.][.zwj.][.lrm.][.rlm.][.ls.][.ps.][.lre.][.rle.][.pdf.][.lro.][.rlo.][.nnbsp.][.mmsp.][.wj.][.(fa).][.(it).][.(is).][.(ip).][.lri.][.rli.][.fsi.][.pdi.][.iss.][.ass.][.iafs.][.aafs.][.nads.][.nods.][.idsp.][.zwnbsp.][.iaa.][.ias.][.iat.][.sflo.][.sfco.][.sfds.][.sfus.]] which return 120 matches, against the Total_Chars.txt file

[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.bph.][.nbh.][.ind.][.nel.][.ssa.][.esa.][.hts.][.htj.][.lts.][.pld.][.plu.][.ri.][.ss2.][.ss3.][.dcs.][.pu1.][.pu2.][.sts.][.cch.][.mw.][.spa.][.epa.][.sos.][.sgci.][.sci.][.csi.][.st.][.osc.][.pm.][.apc.][.nbsp.][.shy.][.alm.][.sam.][.ospm.][.mvs.][.nqsp.][.mqsp.][.ensp.][.emsp.][.3/msp.][.4/msp.][.6/msp.][.fsp.][.psp.][.thsp.][.hsp.][.zwsp.][.zwnj.][.zwj.][.lrm.][.rlm.][.ls.][.ps.][.lre.][.rle.][.pdf.][.lro.][.rlo.][.nnbsp.][.mmsp.][.wj.][.(fa).][.(it).][.(is).][.(ip).][.lri.][.rli.][.fsi.][.pdi.][.iss.][.ass.][.iafs.][.aafs.][.nads.][.nods.][.idsp.][.zwnbsp.][.iaa.][.ias.][.iat.][.sflo.][.sfco.][.sfds.][.sfus.]] which returns 120 matches, against the Total_Chars.txt file

-Note that the Uppercase syntax is allowed, in UTF-8 files, for ANY collating name, presently in LOWER case

Finally, for an ANSI file, containing the 256 chars of the Win-1252 encoding and converted as an UTF-8 file ( Encoding > Convert to UTF-8 ), two syntaxes are possible :

[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.bel.][.bs.][.ht.][.lf.][.vt.][.ff.][.cr.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.ri.][.ss3.][.dcs.][.osc.][.nbsp.][.shy.]] which returns 40 matches, agasint the Total_UTF-8.txt file

[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.hop.][.ri.][.ss3.][.dcs.][.osc.][.nbsp.][.shy.]] which returns 40 matches, against the Total_UTF-8.txt file

Note that the Uppercase syntax is allowed, in UTF-8 files, for ANY collating name, presently in LOWER case

Now, against the Total_ANSI.txt file, containing the first 256 UNICODE characters, we get these results :

(?s).                          ANY character                              =>  256

(?-s).                         ANY character different from LIKE-BREAKS   =>  253  =  [^\x0A\x0C\x0D]

[[:unicode:]]  =  \p{unicode}  an  OVER  \x{00FF}        character        =>    0  =  [^\x00-\xFF}]

[[:cntrl:]]    =  \p{cntrl}    a   CONTROL code          character        =>   39  =  [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D\xAD]


[[:space:]]    =  \p{space}    a   WHITE-SPACE           character        =>    7  =  [\t\n\x0B\f\r\x20\xA0]
[[:blank:]]    =  \p{blank}    a   BLANK                 character        =>    3  =  [\t\x20\xA0]

[[:upper:]]    =  \p{upper}    an  UPPER case            letter           =>   60  =  [A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]
[[:lower:]]    =  \p{lower}    a   LOWER case            letter           =>   65  =  [a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
[[:digit:]]    =  \p{digit}    a   DECIMAL               number           =>   13  =  [0-9²³¹]

[[:word:]]     =  \p{word}     a   WORD                  character        =>  139  =  [[:alnum:]]|\x5F  =  \p{alnum}|\x5F

[[:punct:]]    =  \p{punct}    any PUNCTUATION or SYMBOL character        =>   80  =  [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x82\x84-\x87\x89\x8B\x91-\x97\x9B\xA1-\xBF\xD7\xF7]

[[:alpha:]]    =  \p{alpha}    any LETTER                character        =>  125  =  (?-i)[[:upper:][:lower:]]
[[:alnum:]]    =  \p{alnum}    an  ALPHANUMERIC          character        =>  138  =  (?-i)[[:upper:][:lower:][:digit:]]

[[:graph:]]    =  \p{graph}    any VISIBLE               character        =>  212  =  [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0]

[[:print:]]    =  \p{print}    any PRINTABLE             character        =>  219  =  [[:graph:][:space:]]  =  [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0]|[[:space:]]

[[:xdigit:]]                   an  HEXADECIMAL           character        =>   22  =  [0-9A-Fa-f]  =  (?i)[0-9A-F]

Remark : the [[:unicode:]] class; for characters OVER \x{00FF}, must correspond to the C1_DEFINED type from Ctype 1 list here.

From this same article, and after I realized that the POSIX classes are not totally independent, I deduced this layout :

C1_DEFINED  Other characters     0
C1_CNTRL    Control characters  39
C1_SPACE    Space characters     2  ( only the SPACE and NBSP chars, OUT of 7, as ALL other are ALREADY included in the CNTRL chars class )
C1_UPPER    Uppercase           60
C1_LOWER    Lowercase           65
C1_DIGIT    Decimal digits      13
C1_PUNCT    Punctuation         73  ( and NOT 80, because the \xAD char           is  ALREADY included in the CNTRL chars class
                                                  because the \xAA, \xB5 and \xBA are ALREADY included in the LOWER chars class
                                                  because the \xB2, \xB3 and \xB9 are ALREADY included in the DIGIT chars class )
                              -----
                    TOTAL :    252 chars

So, if I exclude, from my Total_ANSI.txt file, all the following classes with the S/R :

FIND [[:cntrl:][:space:][:upper:][:lower:][:digit:][:punct:]]

REPLACE Leave EMPTY

Either, with your plugin or with native N++, it remains 4 characters ( 256 - 252 ) which are the € ( \x{20AC} ), ˆ ( \x{02C6} ), ˜ ( \x{02DC} ) and ™ ( \x{2122} ) characters

Moreover, absolutely no POSIX character class and no UNICODE character class, of course, can find these 4 characters !

Thus, the only way to find out one of these 4 characters, in an ANSI file, is to use the regex [\x80\x88\x98\x99] or to use the characters themselves :-((

In this article, it is also said :

Printable | Graphic characters and blanks (all C1_* types except C1_CNTRL). Thus …

So, from the previous total of chars of my Total_ANSI.txt file, the [[:print:]] class should detect 252 - 39, so 213 matches.

Thus, as [[graph:]] = [[:print:]] - [[space:]], this means that [[:graph:]] should be : 213 - 2, so 211 matches.

But current result is 212 matches. The difference of one unit comes from the \xAD char whith is, both, part of the [[:cntrl:]] and [[graph:]] POSIX character classes !

If we remember of the 4 lacking chars, which, obviously, are visible and printable, this means that [[:graph:]] and [[:print:] should return, respectively 215 ( 211 + 4 ) and 217 ( 213 + 4 ) matches, for ANSI files.

And it easy to verify that [[:print:]] + [[:cntrl:]] = 217 + 39 = 256 !

Just for info : from the Total_UTF-8.txt file, containing these same chars, we get these results :

(?s).                          ANY character                              =>  256

(?-s).                         ANY character different from LIKE-BREAKS   =>  254  =  [^\x0A\x0D]

[[:ascii:]]                    an  UNDER \x{0080}        character        =>  128  =  [\x{0000}-\x{007F}]  =  \p{ascii}
[[:unicode:]]  =  \p{unicode}  an  OVER  \x{00FF}        character        =>   27  =  [^\x00-\xFF}]  =  [\x{20AC}\x{201A}\x{0192}\x{201E}\x{2026}\x{2020}\x{2021}\x{02C6}\x{2030}\x{0160}\x{2039}\x{0152}\x{017D}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{0161}\x{203A}\x{0153}\x{017E}\x{0178}]

[[:cntrl:]]    =  \p{cntrl}    a  CONTROL code character                  =>   38  =  [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]  =  \p{Cc}

[[:space:]]    =  \p{space}    a WHITE-SPACE character                    =>    7  =  [\t\n\x0B\f\r\x20\xA0]
[[:blank:]]    =  \p{blank}    a   BLANK                 character        =>    3  =  [\t\x{0020}\x{00A0}]  =  \p{Zs}|\t

[[:upper:]]    =  \p{upper}    an  UPPER case    letter                   =>   60  =  [A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]  =  \p{Lu}
[[:lower:]]    =  \p{lower}    a   LOWER case    letter                   =>   63  =  [a-zƒšœžµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]  =  \p{Ll}
[[:digit:]]    =  \p{digit}    a   DECIMAL       number                   =>   10  =  [0-9]  =  \p{Nd}

[[:word:]]     =  \p{word}     a   WORD                  character        =>  137  =  \p{L*}|\p{Nd}|_

[[:graph:]]    =  \p{graph}    any VISIBLE     character                  =>  215  =  [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD]  =  (?![\x20\xA0\xAD])\P{Cc}

[[:print:]]    =  \p{print}    any PRINTABLE   character                  =>  222  =  [[:graph:][:space:]] = [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD]|[[:space:]]

[[:punct:]]    =  \p{punct}    any PUNCTUATION or SYMBOL character        =>   73  =  \p{P*}|\p{S*}  =  [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x{20AC}\x{201A}\x{201E}\x{2026}\x{2020}\x{2021}\x{2030}\x{2039}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{203A}\xA1-\xA9\xAB\xAC\xAE-\xB1\xB4\xB6-\xB8\xBB\xBF\xD7\xF7]

[[:alpha:]]    =  \p{alpha}    any LETTER                character        =>  126  =  \p{L*}  =  \p{Lu}|\p{Ll}|[ˆªº]
[[:alnum:]]    =  \p{alnum}    an  ALPHANUMERIC          character        =>  136  =  \p{L*}|\p{Nd}

[[:xdigit:]]                   an  HEXADECIMAL           character        =>   22  =  [0-9A-Fa-f]  =  (?i)[0-9A-F]

Best regards,

guy038