Hello, @coises and All,
I’ve found out a small anomaly concerning hexadecimal characters :
If I use the native Notepad++ search to match any hexadecimal character, with the regex [[:xdigit:]], against my Total_Chars.txt file, it returns 44 matches
If I use the Columns++ search to match any hexadecimal character, with the regex [[:xdigit:]], against my Total_Chars.txt file, it returns 22 matches
I suppose that the N++ answer is the right one. Indeed, in the https://www.unicode.org/reports/tr18/#Compatibility_Properties article , ( Annexe C about UNICODE REGULAR EXPRESSIONS ), it is said :
Hex_Digit contains 0-9 A-F fullwidth and halfwidth, upper and lowercase
Note that the \p{Hex_Digit} regex is erroneous ! The right one is \p{xdigit}, at least, within Columns++
Here is an other proof from https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt. Search for the string Hex in your browser : it clearly shows that the total should be 44 !
Now, I found out some other syntaxes about the Unicode classes :
Any Unicode class regex can be expressed with one among these four syntaxes :
\p{Xx} , \p{Xxxxxxx} , [[:Xx:]] , [[:Xxxxxxx:]]
Therefore, here is an update of my previous post https://community.notepad-plus-plus.org/post/104377 :
Against the Total_Chars.txt file, all these general results, below, are correct :
(?s). = \I = \p{Any} = [\x{0000}-\x{EFFFD}] => Total = 325,590
\p{Unicode} = [[:Unicode:]] => 325,334 |
| Total = 325,590
\P{Unicode} = [[:^Unicode:]] => 256 |
\p{Ascii} = \o => 128 |
| Total = 325,590
\P{Ascii} = \O => 325,462 |
\X => 322,586 |
| Total = 325,590
(?!\X). => 3,004 |
[\x{E000}-\x{F8FF}]|\y = [\x{E000}-\x{F8FF}]|[[:defined:]] = \p{Assigned} => 166,266 |
| Total = 325,590
(?![\x{E000}-\x{F8FF}])\Y = (?![\x{E000}-\x{F8FF}])[^[:defined:]] = \p{Not Assigned} => 159,324 |
Note : if we add, to the number of characters of Total_Chars.txt, the contents of any omitted planes ( Planes 4 to 13, 16 and 17 ), less the TWO non-characters for each, plus the Surrogate characters and all the Unicode non-chars, we obtain :
325,590 + (65536 - 2) * 12 + 2,048 + 66 = 1,114,112 which is, indeed, the total amount of Unicode chars, , both assigned or not assigned !
Here are the correct results, concerning all the Posix character classes, against the Total_Chars.txt file
[[:ascii:]] an UNDER \x{0080} char 128 = [\x{0000}-\x{007F}] = \p{ascii} = \o
[[:unicode:]] = \p{unicode an OVER \x{00FF} char 325,334 = [\x{0100}-\x{EFFFD}] ( RESTRICTED to 'Total_Chars.txt' )
[[:space:]] = \p{space} = [[:s:]] = \p{s} = \ps = \s a WHITE-SPACE char 25 = [\t\n\x{000B}\f\r\x20\x{0085}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}]
[[:h:]] = \p{h} = \ph = \h an HORIZONTAL white space char 18 = [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t
[[:blank:]] = \p{blank} a BLANK char 18 = [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t
[[:v:]] = \p{v} = \pv = \v a VERTICAL white space char 7 = [\n\x{000B}\f\r\x{0085}\x{2028}\x{2029}]
[[:cntrl:]] = \p{cntrl} a CONTROL code char 65 = [\x{0000}-\x{001F}\x{007F}\x{0080}-\x{009F}]
[[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u an UPPER case letter char 1,886 = \p{Lu}
[[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l a LOWER case letter char 2,283 = \p{Ll}
a DI-GRAPIC letter char 31 = \p{Lt}
a MODIFIER letter char 410 = \p{Lm}
an OTHER letter char 141,062 = \p{Lo}
+ SYLLABLES / IDEOGRAPHS
[[:digit:]] = \p{digit} = [[:d:]] = \p{d} = \pd = \d a DECIMAL number 770 = \p{Nd}
_ = \x{005F} the LOW_LINE char 1
---------
[[:word:]] = \p{word} = [[:w:]] = \p{w} = \pw = \w a WORD char 146,443 = \p{L*}|\p{Nd}|_ ( But it should be \p{L*}|\p{Nd}|\p{M*}|\p{Pc}|\x{200C}|\x{200D} ! )
[[:alnum:]] = \p{alnum} an ALPHANUMERIC char 146,442 = \p{L*}|\p{Nd}
[[:alpha:]] = \p{alpha} any LETTER char 145,672 = \p{L*}
[[:graph:]] = \p{graph} any VISIBLE char 159,612 = [^\s[:C*:]] = (?=\S)\P{Other}
[[:print:]] = \p{print} any PRINTABLE char 159,637 = [[:graph:]]|\s
[[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL char 9,473 = \p{P*}|\p{S*} = \p{Punctuation}|\p{Symbol} = 856 + 8,617
[[:xdigit:]] = \p{xdigit} an HEXADECIMAL char 22 = [0-9A-Fa-f] ( But it should be [\x{0030-\x{0039}\x{0041}-\x{005A}\x{0061}-\x{007A}\x{FF10}-\x{FF19}\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}] ! )
And here, are the correct results regarding the Unicode character classes, against the Total_Chars.txt file :
\p{Any} = [[:Any:]] = ANY char 325,590 = (?s). = \I = [\x{0000}-\x{EFFFD}]
\p{Ascii} = [[:Ascii:]] = an UNDER \x80 char 128 = [[:ascii:]] = \o
\p{Assigned} = [[:Assigned:]] = an ASSIGNED char 166,266 ( of Total_Chars.txt, ONLY )
\p{Cc} = \p{Control} = [[:Cc:]] = [[:Control:]] = a C0 or C1 CONTROL code char 65
\p{Cf} = \p{Format} = [[:Cf:]] = [[:Format:]] = a FORMAT CONTROL char 170
\p{Cn} = \p{Not Assigned} = [[:Cn:]] = [[:Not Assigned:]] = an UNASSIGNED or NON-CHARACTER char 159,324 ( 'Total_Chars.txt' does NOT contain the 66 NON-CHARACTER chars )
\p{Co} = \p{Private Use} = [[:Co:]] = [[:Private Use:]] = a PRIVATE-USE char 6,400
\p{Cs} = \p{Surrogate} = [[:Cs:]] = [[:Surrogate:]] = a SURROGATE char [2,048] ( 'Total_Chars.txt' does NOT contain the 2,048 SURROGATE chars )
-----------
\p{C*} = \p{Other} = [[:C*:]] = [[:Other:]] = 165,959 = \p{Cc}|\p{Cf}|\p{Cn}|\p{Co}
\p{Lu} = \p{Uppercase Letter} = [[:Lu:]] = [[:Uppercase Letter:]] = an UPPER case letter char 1,886 = \u = [[:upper:]] = \p{upper}
\p{Ll} = \p{Lowercase Letter} = [[:Ll:]] = [[:Lowercase Letter:]] = a LOWER case letter char 2,283 = \l = [[:lower:]] = \p{lower}
\p{Lt} = \p{Titlecase} = [[:Lt:]] = [[:Titlecase:]] = a DI-GRAPHIC letter char 31
\p{Lm} = \p{Modifier Letter} = [[:Lm:]] = [[:Modifier Letter:]] = a MODIFIER letter char 410
\p{Lo} = \p{Other Letter} = [[:Lo:]] = [[:Other Letter:]] = an OTHER letter char 141,062
+ SYLLABLES / IDEOGRAPHS -----------
\p{L*} = \p{Letter} = [[:L*:]] = [[:Letter:]] = 145,672 = \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo} = [[:alpha:]] = \p{alpha}
\p{Mc} = \p{Spacing Combining Mark} = [[:Mc:]] = [[:Spacing Combining Mark:]] = a SPACING COMBINING char 471
\p{Me} = \p{Enclosing Mark} = [[:Me:]] = [[:Enclosing Mark!:]] = an ENCLOSING char 13
\p{Mn} = \p{Non-Spacing Mark} = [[:Mn:]] = [[:Non-Spacing Mark:]] = a NON-SPACING COMBINING char 2,059
--------
\p{M*} = \p{Mark} = [[:M*:]] = [[:Mark:]] 2,543 = \p{Mc}|\p{Me}|\p{Mn}
\p{Nd} = \p{Decimal Digit Number} = [[:Nd:]] = [[:Decimal Digit Number:]] = a DECIMAL number char 770
\p{Nl} = \p{Letter Number} = [[:Nl:]] = [[:Letter Number:]] = a LETTERLIKE numeric char 239
\p{No} = \p{Other Number} = [[:No:]] = [[:Other Number:]] = OTHER NUMERIC char 915
--------
\p{N*} = \p{Number} = [[:N*:]] = [[:Number:]] 1,924 = \p{Nd}|\p{Nl}|\p{No}
\p{Pd} = \p{Dash Punctuation} = [[:Pd:]] = [[:Dash Punctuation:]] = a DASH or HYPHEN punctuation char 27
\p{Ps} = \p{Open Punctuation} = [[:Ps:]] = [[:Open Punctuation:]] = an OPENING PUNCTUATION char 79
\p{Pc} = \p{Connector Punctuation} = [[:Pc:]] = [[:Connector Punctuation:]] = a CONNECTING PUNCTUATION char 10
\p{Pe} = \p{Close Punctuation} = [[:Pe:]] = [[:Close Punctuation:]] = a CLOSING PUNCTUATION char 77
\p{Pi} = \p{Initial Punctuation} = [[:Pi:]] = [[:Initial Punctuation:]] = an INITIAL QUOTATION char 12
\p{Pf} = \p{Final Punctuation} = [[:Pf:]] = [[:Final Punctuation:]] = a FINAL QUOTATION char 10
\p{Po} = \p{Other Punctuation} = [[:Po:]] = [[:Other Punctuation:]] = OTHER PUNCTUATION char 641
-------
\p{P*} = \p{Punctuation} = [[:P*:]] = [[:Punctuation:]] = 856 = \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po}
\p{Sm} = \p{Math Symbol} = [[:Sm:]] = [[:Math Symbol:]] = a MATHEMATICAL symbol char 960
\p{Sc} = \p{Currency Symbol} = [[:Sc:]] = [[:Currency Symbol:]] = a CURRENCY char 64
\p{Sk} = \p{Modifier Symbol} = [[:Sk:]] = [[:Modifier Symbol:]] = a NON-LETTERLIKE MODIFIER char 125
\p{So} = \p{Other Symbol} = [[:So:]] = [[:Other Symbol:]] = OTHER SYMBOL char 7,468
---------
\p{S*} = \p{Symbol} = [[:S*:]] = [[:Symbol:]] = 8,617 = \p{Sm}|\p{Sc}|\p{Sk}|\p{So}
\p{Zs} = \p{Space Separator} = [[:Zs:]] = [[:Space Separator:]] = a NON-ZERO width SPACE char 17 = [\x{0020}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = (?!\t)\h
\p{Zl} = \p{Line Separator} = [[:Zl:]] = [[:Line Separator:]] = the LINE SEPARATOR char 1 = \x{2028}
\p{Zp} = \p{Paragraph Separator} = [[:Zp:]] = [[:Paragraph Separator:]] = the PARAGRAPH SEPARATOR char 1 = \x{2029}
------
\p{Z*} = \p{Separator} = [[:Z*:]] = [[:Separator:]] = 19 = \p{Zs}|\p{Zl}|\p{Zp}
Remark :
A negative POSIX character class can be expressed as [^[:........:]] or [[:^........:]]
A negative UNICODE character class can be expressed as \P{..}, with an uppercase letter P
Now, if you follow the procedure explained in the last part of this post :
https://community.notepad-plus-plus.org/post/99844
The regexes [\x{DC80}-\x{DCFF}] or \i or [[:invalid:]] do give 134 occurrences, which is the exact number of invalid UTF-8 characters of that example !
Best Regards,
guy038