Columns++ version 1.3: All Unicode, all the time
-
Columns++ version 1.3 brings the enhancements for regular expressions in Unicode documents to ANSI documents as well:
-
Regular expressions now match based on Unicode code points in all documents, so the syntax and semantics of regular expressions are no longer dependent on the underlying representation in Scintilla. The features added in version 1.2 for Unicode documents now work in all documents.
-
Regular expressions did not work properly in ANSI documents for the system default code pages 932 (Japanese, Shift-JIS), 936 (Chinese Simplified, GB2312), 949 (Korean, Windows-949 / Unified Hangul Code) or 950 (Chinese Traditional, Big5) in version 1.2. Regular expressions now match these documents based on Unicode code points.
-
Character information tables for regular expressions were updated to reflect Unicode 17.0.0 (released on 2025-09-09).
I haven’t yet set this as “stable” (and of course it isn’t yet in the plugins admin list). Anyone feeling adventuresome is welcome to try it and see if it behaves. There is documentation describing how Unicode-based matching in Columns++ differs from Notepad++ matching.
If anyone reading this routinely has one of the CJK code pages as their system default code page, I would be very interested to know if regular expressions in this version appear to work as expected on your locale’s ANSI files.
-
-
Hello, @coises and All,
I’ve just tried your last
ColumnsPlusPlus v1.3release and indeed, the search is now considered as a true Unicode search, whatever the individual encoding of each file !Let’s consider this simple
UTF-8text :This ‟ is a † very • small ‰ text ‱ for › test 201F 2020 2022 2030 2031 203A in Unicode UTF-8 enodingAnd this
ANSItext :This ? is a † very • small ‰ text ? for › test ? 0086 0095 0089 ? 009B in Windows-1252 encodingIMPORTANT Don’t forget, when this second text is opened in N++, to run the
Encoding > Convert to ANSIoption, first !
Now, we can create the following table, which recapitulates the
Non-ASCIIcharacters used in my examples :•--------•-----------------•-----------------• | | Windows-1252 | Unicode | | •--------•--------•--------•--------• | Char | Dec | Hex | Dec | Hex | •--------•--------•--------•--------•--------• | ‟ | ? | ? | 8223 | 201F | | | | | | | | † | 0134 | 0086 | 8224 | 2020 | | | | | | | | • | 0149 | 0095 | 8226 | 2022 | | | | | | | | ‰ | 0137 | 0089 | 8240 | 2030 | | | | | | | | ‱ | ? | ? | 8241 | 2031 | | | | | | | | › | 0155 | 009B | 8250 | 203A | •--------•--------•--------•--------•--------•
-
In
Notepad++:-
Within an
ANSIfile, the regexes[†-‰]or[\x86-\x89]would only find the characters†and‰but not the•whoseWin-1252code (\x95) is after\x89 -
Within an
UTF8file, the regexes[†-‰]or[\x{2020}-\x{2030}]would find the characters†and‰and also the•whose Unicode code-point is between2020and2030
-
-
In
Columns++:-
Within an
ANSIfile, the regexes[†-‰]or[\x{2020}-\x{2030}]would find the characters†and‰and also the•whose Unicode code-point is between2020and2030 -
Within an
UTF8file, the regexes[†-‰]or[\x{2020}-\x{2030}]would find the characters†and‰and also the•whose Unicode code-point is between2020and2030
-
Note that using the range
[†-›]within anANSIfile, a N++ search of the•char would have been successful as its code-point (2022) lies within the2020and203Arange !
Now, @coises, I cannot test easily the
CJKbehaviour of your new search engine as it’s obvious that I do not a defaultCJKcode-page, needed for such a study ! However, I do not see why your new search behavior couln’t be applied to any kind of Unicode chars ;-)Best Regards,
guy038
-
-
Hello, @coises and All,
When I first used the
v1.3release of Columns++, I did not pay attention to the fact that, among the new features, there was the point :Character information tables for regular expressions were updated to reflect Unicode 17.0.0 (released on 2025-09-09)
So, in this post, I re-tested all the regex features of the
v1.3release, that you’ll find below and I pleased to tell you that ALL results are correct, EXCEPT for one thing :Indeed, there’s a bug, somehow, regarding the
Markcharacters :-
Open this file in your browser : https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
-
Then, hit the
Ctrl + Fshortcut within your browser and search for the stringNonsp, within theDerivedGeneralCategory.txtfile -
Under the first occurrence, you should see :
# General_Category=Nonspacing_Mark ¯¯¯¯¯ 0300..036F ; Mn # [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X 0483..0487 ; Mn # [5] COMBINING CYRILLIC TITLO..COMBINING CYRILLIC POKRYTIE 0591..05BD ; Mn # [45] HEBREW ACCENT ETNAHTA..HEBREW POINT METEGThe first line clearly shows that the
112characters of the COMBINING DIACRITICAL MARKS Unicode block (refer to https://www.unicode.org/charts/PDF/U0300.pdf ) are considered, by the Unicode Consortium, asNon Spacing Markcharacters !And, indeed, if I use the regex
[\x{0300}-\x{036F}], against myTotal_Chars.txtfile, it corectly returns112occurrences and if I use the\p{Mn}regex, it correctly returns2,059occurrences, either.However, then I test the regexes
(?=[\x{0300}-\x{036F}])\p{M*}or(?=\p{M*})[\x{0300}-\x{036F}]or, more precisely, the regexes(?=[\x{0300}-\x{036F}])\p{Mn}or(?=\p{Mn})[\x{0300}-\x{036F}], it ONLY returns111occurrences and NOT112! Did I make a mistake ?
Now, against the
Total_Chars.txtfile, all these general results, below, are correct :(?s). = \I = \p{Any} = [\x{0000}-\x{EFFFD}] => Total = 325,590 \p{Unicode} = [[:Unicode:]] => 325,334 | | Total = 325,590 \P{Unicode} = [[:^Unicode:]] => 256 | \p{Ascii} = \o => 128 | | Total = 325,590 \P{Ascii} = \O => 325,462 | \X => 322,586 | | Total = 325,590 (?!\X). => 3,004 | [\x{E000}-\x{F8FF}]|\y = [\x{E000}-\x{F8FF}]|[[:defined:]] = \p{Assigned} => 166,266 | | Total = 325,590 (?![\x{E000}-\x{F8FF}])\Y = (?![\x{E000}-\x{F8FF}])[^[:defined:]] = \p{Not Assigned} => 159,324 |Note : if we add, to the number of characters of
Total_Chars.txt, the contents of any omitted planes ( Planes4to13,16and17), less the TWO non-characters for each, plus the Surrogate characters and all the Unicode non-chars, we obtain :325,590+(65536 - 2) * 12+2,048+66=1,114,112which is, indeed, the total amount of Unicode chars, , both assigned or not assigned !
Here are the correct results, concerning all the Posix character classes, against the
Total_Chars.txtfile[[:ascii:]] an UNDER \x{0080} character 128 = [\x{0000}-\x{007F}] = \p{ascii} = \o [[:unicode:]] = \p{unicode} an OVER \x{00FF} character 325,334 = [\x{0100}-\x{EFFFD}] ( restricted to 'Total_Chars.txt' ) [[:space:]] = \p{space} = [[:s:]] = \p{s} = \ps = \s a WHITE-SPACE character 25 = [\t\n\x{000B}\f\r\x20\x{0085}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}] [[:h:]] = \p{h} = \ph = \h an HORIZONTAL white space character 18 = [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t [[:blank:]] = \p{blank} a BLANK character 18 = [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t [[:v:]] = \p{v} = \pv = \v a VERTICAL white space character 7 = [\n\x{000B}\f\r\x{0085}\x{2028}\x{2029}] [[:cntrl:]] = \p{cntrl} a CONTROL code character 65 = [\x{0000}-\x{001F}\x{007F}\x{0080}-\x{009F}] [[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u an UPPER case letter 1,886 = \p{Lu} [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l a LOWER case letter 2,283 = \p{Ll} a DI-GRAPIC letter 31 = \p{Lt} a MODIFIER letter 410 = \p{Lm} an OTHER letter + SYLLABES / IDEOGRAPHS 141,062 = \p{Lo} [[:digit:]] = \p{digit} = [[:d:]] = \p{d} = \pd = \d a DECIMAL number 770 = \p{Nd} _ = \x{005F} the LOW_LINE character 1 ----------- [[:word:]] = \p{word} = [[:w:]] = \p{w} = \pw = \w a WORD character 146,443 = \p{L*}|\p{nd}|_ [[:alnum:]] = \p{alnum} an ALPHANUMERIC character 146,442 = \p{L*}|\p{nd} [[:alpha:]] = \p{alpha} any LETTER character 145,672 = \p{L*} [[:graph:]] = \p{graph} any VISIBLE character 159,612 [[:print:]] = \p{print} any PRINTABLE character 159,637 = [[:graph:]]|\s [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character 9,473 = \p{P*}|\p{S*} = \p{Punctuation}|\p{Symbol} = 856 + 8,617 [[:xdigit:]] an HEXADECIMAL character 22 = [0-9A-Fa-f]
And here, are the correct results regarding the Unicode character classes, against the
Total_Chars.txtfile :\p{Any} Any character 325,590 = (?s). = \I = [\x{0000}-\x{EFFFD}] \p{Ascii} a character UNDER \x80 128 = [[:ascii:]] = \o \p{Assigned} an ASSIGNED character 166,266 ( of Total_Chars.txt, ONLY ) \p{Cc} = \p{Control} a C0 or C1 CONTROL code character 65 \p{Cf} = \p{Format} a FORMAT CONTROL character 170 \p{Cn} = \p{Not Assigned} an UNASSIGNED or NON-CHARACTER character 159,324 ( 'Total_Chars.txt' does NOT contain the 66 NON-CHARACTER chars ) \p{Co} = \p{Private Use} a PRIVATE-USE character 6,400 \p{Cs} = \p{Surrogate} (INVALID regex) a SURROGATE character [2,048] ( 'Total_Chars.txt' does NOT contain the 2,048 SURROGATE chars ) ----------- \p{C*} = \p{Other} 165,959 = \p{Cc}|\p{Cf}|\p{Cn}|\p{Co} \p{Lu} = \p{Uppercase Letter} an UPPER case letter 1,886 = \u = [[:upper:]] = \p{upper} \p{Ll} = \p{Lowercase Letter} a LOWER case letter 2,283 = \l = [[:lower:]] = \p{lower} \p{Lt} = \p{Titlecase} a DI-GRAPHIC letter 31 \p{Lm} = \p{Modifier Letter} a MODIFIER letter 410 \p{Lo} = \p{Other Letter} OTHER LETTER, including SYLLABLES and IDEOGRAPHS 141,062 ----------- \p{L*} = \p{Letter} 145,672 = \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo} = [[:alpha:]] = \p{alpha} \p{Mc} = \p{Spacing Combining Mark} a NON-SPACING COMBINING mark (ZERO advance width) 471 \p{Me} = \p{Enclosing Mark} a SPACING COMBINING mark (POSITIVE advance width) 13 \p{Mn} = \p{Non-Spacing Mark} an ENCLOSING COMBINING mark 2,059 --------- \p{M*} = \p{Mark} 2,543 = \p{Mc}|\p{Me}|\p{Mn} \p{Nd} = \p{Decimal Digit Number} a DECIMAL number character 770 \p{Nl} = \p{Letter Number} a LETTERLIKE numeric character 239 \p{No} = \p{Other Number} OTHER NUMERIC character 915 --------- \p{N*} = \p{Number} 1,924 = \p{Nd}|\p{Nl}|\p{No} \p{Pd} = \p{Dash Punctuation} a DASH or HYPHEN punctuation mark 27 \p{Ps} = \p{Open Punctuation} an OPENING PUNCTUATION mark, in a pair 79 \p{Pc} = \p{Connector Punctuation} a CONNECTING PUNCTUATION mark 10 \p{Pe} = \p{Close Punctuation} a CLOSING PUNCTUATION mark, in a pair 77 \p{Pi} = \p{Initial Punctuation} an INITIAL QUOTATION mark 12 \p{Pf} = \p{Final Punctuation} a FINAL QUOTATION mark 10 \p{Po} = \p{Other Punctuation} OTHER PUNCTUATION mark 641 ------- \p{P*} = \p{Punctuation} 856 = \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po} \p{Sm} = \p{Math Symbol} a MATHEMATICAL symbol character 960 \p{Sc} = \p{Currency Symbol} a CURRENCY character 64 \p{Sk} = \p{Modifier Symbol} a NON-LETTERLIKE MODIFIER character 125 \p{So} = \p{Other Symbol} OTHER SYMBOL character 7,468 \p{S*} = \p{Symbol} 8,617 = \p{Sm}|\p{Sc}|\p{Sk}|\p{So} \p{Zs} = \p{Space Separator} a NON-ZERO width SPACE character 17 = [\x{0020}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = (?!\t)\h \p{Zl} = \p{Line Separator} the LINE SEPARATOR character 1 = \x{2028} \p{Zp} = \p{Paragraph Separator} the PARAGRAPH SEPARATOR character 1 = \x{2029} ------ \p{Z*} = \p{Separator} 19 = \p{Zs}|\p{Zl}|\p{Zp}Remark :
-
A negative POSIX character class can be expressed as
[^[:........:]]or[[:^........:]] -
A negative UNICODE character class can be expressed as
\P{..}, with an uppercase letterP
Now, if you follow the procedure explained in the last part of this post :
https://community.notepad-plus-plus.org/post/99844
The regexes
[\x{DC80}-\x{DCFF}]or\ior[[:invalid:]]do give134occurrences, which is the exact number of invalidUTF-8characters of that example !Continuation on next post
-
-
Hi @Coises and All,
Continuation and end of my reply :
I also tested ALL the
Equivalenceclasses feature, against theTotal_Chars.txtfile.With Columns++, we can use ANY equivalent character to get the total number of matches of the equivalence class character
For instance,
[[=Ⱥ=]]=[[=ⱥ=]]=[[=Ɐ=]]always gives86, matches whereas native N++ Boost engine is less coherent and sometimes displays a wrong number of occurrences :-((Here is, below, the list of all equivalences of any char of the
Windows-1252code-page, from\x{0020}till\x{00DE}Note that, except for the DEL character, as an example, I did not consider the equivalence classes which only return1match !I also confirm, that I did not find any character over
\x{FFFF}which would be part of a regex equivalence class, either with our Boost engine or with theColumns++search engine ![[= =]] = [[=space=]] => 3 ( ) [[=!=]] = [[=exclamation-mark=]] => 2 ( !! ) [[="=]] = [[=quotation-mark=]] => 3 ( "⁍" ) [[=#=]] = [[=number-sign=]] => 4 ( #؞⁗# ) [[=$=]] = [[=dollar-sign=]] => 3 ( $⁒$ ) [[=%=]] = [[=percent-sign=]] => 3 ( %⁏% ) [[=&=]] = [[=ampersand=]] => 3 ( &⁋& ) [[='=]] = [[=apostrophe=]] => 2 ( '' ) [[=(=]] = [[=left-parenthesis=]] => 4 ( (⁽₍( ) [[=)=]] = [[=right-parenthesis=]] => 4 ( )⁾₎) ) [[=*=]] = [[=asterisk=]] => 2 ( ** ) [[=+=]] = [[=plus-sign=]] => 6 ( +⁺₊﬩﹢+ ) [[=,=]] = [[=comma=]] => 2 ( ,, ) [[=-=]] = [[=hyphen=]] => 3 ( -﹣- ) [[=.=]] = [[=period=]] => 3 ( .․. ) [[=/=]] = [[=slash=]] => 2 ( // ) [[=0=]] = [[=zero=]] => 48 ( 0٠۟۠۰߀०০੦૦୦୵௦౦౸೦൦๐໐༠၀႐០᠐᥆᧐᪀᪐᭐᮰᱀᱐⁰₀↉⓪⓿〇㍘꘠ꛯ꠳꣐꤀꧐꩐꯰0 ) [[=1=]] = [[=one=]] => 54 ( 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁①⑴⒈⓵⚀❶➀➊〡㋀㍙㏠꘡ꛦ꣑꤁꧑꩑꯱1 ) [[=2=]] = [[=two=]] => 54 ( 2²ƻ٢۲߂२২੨૨୨௨౨౺౽೨൨๒໒༢၂႒፪២᠒᥈᧒᪂᪒᭒᮲᱂᱒₂②⑵⒉⓶⚁❷➁➋〢㋁㍚㏡꘢ꛧ꣒꤂꧒꩒꯲2 ) [[=3=]] = [[=three=]] => 53 ( 3³٣۳߃३৩੩૩୩௩౩౻౾೩൩๓໓༣၃႓፫៣᠓᥉᧓᪃᪓᭓᮳᱃᱓₃③⑶⒊⓷⚂❸➂➌〣㋂㍛㏢꘣ꛨ꣓꤃꧓꩓꯳3 ) [[=4=]] = [[=four=]] => 51 ( 4٤۴߄४৪੪૪୪௪౪೪൪๔໔༤၄႔፬៤᠔᥊᧔᪄᪔᭔᮴᱄᱔⁴₄④⑷⒋⓸⚃❹➃➍〤㋃㍜㏣꘤ꛩ꣔꤄꧔꩔꯴4 ) [[=5=]] = [[=five=]] => 53 ( 5Ƽƽ٥۵߅५৫੫૫୫௫౫೫൫๕໕༥၅႕፭៥᠕᥋᧕᪅᪕᭕᮵᱅᱕⁵₅⑤⑸⒌⓹⚄❺➄➎〥㋄㍝㏤꘥ꛪ꣕꤅꧕꩕꯵5 ) [[=6=]] = [[=six=]] => 52 ( 6٦۶߆६৬੬૬୬௬౬೬൬๖໖༦၆႖፮៦᠖᥌᧖᪆᪖᭖᮶᱆᱖⁶₆ↅ⑥⑹⒍⓺⚅❻➅➏〦㋅㍞㏥꘦ꛫ꣖꤆꧖꩖꯶6 ) [[=7=]] = [[=seven=]] => 50 ( 7٧۷߇७৭੭૭୭௭౭೭൭๗໗༧၇႗፯៧᠗᥍᧗᪇᪗᭗᮷᱇᱗⁷₇⑦⑺⒎⓻❼➆➐〧㋆㍟㏦꘧ꛬ꣗꤇꧗꩗꯷7 ) [[=8=]] = [[=eight=]] => 50 ( 8٨۸߈८৮੮૮୮௮౮೮൮๘໘༨၈႘፰៨᠘᥎᧘᪈᪘᭘᮸᱈᱘⁸₈⑧⑻⒏⓼❽➇➑〨㋇㍠㏧꘨ꛭ꣘꤈꧘꩘꯸8 ) [[=9=]] = [[=nine=]] => 50 ( 9٩۹߉९৯੯૯୯௯౯೯൯๙໙༩၉႙፱៩᠙᥏᧙᪉᪙᭙᮹᱉᱙⁹₉⑨⑼⒐⓽❾➈➒〩㋈㍡㏨꘩ꛮ꣙꤉꧙꩙꯹9 ) [[=:=]] = [[=colon=]] => 2 ( :: ) [[=;=]] = [[=semicolon=]] => 3 ( ;;; ) [[=<=]] = [[=less-than-sign=]] => 3 ( <﹤< ) [[===]] = [[=equals-sign=]] => 5 ( =⁼₌﹦= ) [[=>=]] = [[=greater-than-sign=]] => 3 ( >﹥> ) [[=?=]] = [[=question-mark=]] => 2 ( ?? ) [[=@=]] = [[=commercial-at=]] => 2 ( @@ ) [[=A=]] => 86 ( AaªÀÁÂÃÄÅàáâãäåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃȦȧȺɐɑɒͣᴀᴬᵃᵄᶏᶐᶛᷓḀḁẚẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặₐÅ⒜ⒶⓐⱥⱭⱯⱰAa ) [[=B=]] => 29 ( BbƀƁƂƃƄƅɃɓʙᴃᴮᴯᵇᵬᶀḂḃḄḅḆḇℬ⒝ⒷⓑBb ) [[=C=]] => 40 ( CcÇçĆćĈĉĊċČčƆƇƈȻȼɔɕʗͨᴄᴐᵓᶗᶜᶝᷗḈḉℂ℃ℭ⒞ⒸⓒꜾꜿCc ) [[=D=]] => 44 ( DdÐðĎďĐđƊƋƌƍɗʤͩᴅᴆᴰᵈᵭᶁᶑᶞᷘᷙḊḋḌḍḎḏḐḑḒḓⅅⅆ⒟ⒹⓓꝹꝺDd ) [[=E=]] => 82 ( EeÈÉÊËèéêëĒēĔĕĖėĘęĚěƎƏǝȄȅȆȇȨȩɆɇɘəɚͤᴇᴱᴲᵉᵊᶒᶕḔḕḖḗḘḙḚḛḜḝẸẹẺẻẼẽẾếỀềỂểỄễỆệₑₔ℮ℯℰ⅀ⅇ⒠ⒺⓔⱸⱻEe ) [[=F=]] => 22 ( FfƑƒᵮᶂᶠḞḟ℉ℱℲⅎ⒡ⒻⓕꜰꝻꝼꟻFf ) [[=G=]] => 47 ( GgĜĝĞğĠġĢģƓƔǤǥǦǧǴǵɠɡɢɣɤʛˠᴳᵍᵷᶃᶢᷚᷛḠḡℊ⅁⒢ⒼⓖꝾꝿꞠꞡꞬGg ) [[=H=]] => 42 ( HhĤĥĦħȞȟɥɦʜʰʱͪᴴᶣḢḣḤḥḦḧḨḩḪḫẖₕℋℌℍℎℏ⒣ⒽⓗⱧⱨꞍꞪHh ) [[=I=]] => 62 ( IiÌÍÎÏìíîïĨĩĪīĬĭĮįİıƖƗǏǐȈȉȊȋɨɩɪͥᴉᴵᵎᵢᵻᵼᶖᶤᶥᶦᶧḬḭḮḯỈỉỊịⁱℐℑⅈ⒤ⒾⓘꞮꟾIi ) [[=J=]] => 24 ( JjĴĵǰȷɈɉɟʄʝʲᴊᴶᶡᶨⅉ⒥ⒿⓙⱼꞲJj ) [[=K=]] => 39 ( KkĶķĸƘƙǨǩʞᴋᴷᵏᶄᷜḰḱḲḳḴḵₖK⒦ⓀⓚⱩⱪꝀꝁꝂꝃꝄꝅꞢꞣꞰKk ) [[=L=]] => 58 ( LlĹĺĻļĽľĿŀŁłƚƛȽɫɬɭɮʟˡᴌᴸᶅᶩᶪᶫᷝᷞḶḷḸḹḺḻḼḽₗℒℓ⅂⅃⒧ⓁⓛⱠⱡⱢꝆꝇꝈꝉꞀꞁꞭLl ) [[=M=]] => 33 ( MmƜɯɰɱͫᴍᴟᴹᵐᵚᵯᶆᶬᶭᷟḾḿṀṁṂṃₘℳ⒨ⓂⓜⱮꝳꟽMm ) [[=N=]] => 47 ( NnÑñŃńŅņŇňʼnƝƞǸǹȠɲɳɴᴎᴺᴻᵰᶇᶮᶯᶰᷠᷡṄṅṆṇṈṉṊṋⁿₙℕ⒩ⓃⓝꞤꞥNn ) [[=O=]] => 106 ( OoºÒÓÔÕÖØòóôõöøŌōŎŏŐőƟƠơƢƣǑǒǪǫǬǭǾǿȌȍȎȏȪȫȬȭȮȯȰȱɵɶɷͦᴏᴑᴒᴓᴕᴖᴗᴼᵒᵔᵕᶱṌṍṎṏṐṑṒṓỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợₒℴ⒪ⓄⓞⱺꝊꝋꝌꝍOo ) [[=P=]] => 33 ( PpƤƥɸᴘᴾᵖᵱᵽᶈᶲṔṕṖṗₚ℘ℙ⒫ⓅⓟⱣⱷꝐꝑꝒꝓꝔꝕꟼPp ) [[=Q=]] => 16 ( QqɊɋʠℚ℺⒬ⓆⓠꝖꝗꝘꝙQq ) [[=R=]] => 64 ( RrŔŕŖŗŘřƦȐȑȒȓɌɍɹɺɻɼɽɾɿʀʁʳʴʵʶͬᴙᴚᴿᵣᵲᵳᶉᷢᷣṘṙṚṛṜṝṞṟℛℜℝ⒭ⓇⓡⱤⱹꝚꝛꝜꝝꝵꝶꞂꞃRr ) [[=S=]] => 50 ( SsŚśŜŝŞşŠšſƧƨƩƪȘșȿʂʃʅʆˢᵴᶊᶋᶘᶳᶴᷤṠṡṢṣṤṥṦṧṨṩẛₛ⒮ⓈⓢⱾꜱꟅSs ) [[=T=]] => 47 ( TtŢţŤťƫƬƭƮȚțȶȾʇʈʧʨͭᴛᵀᵗᵵᶵᶿṪṫṬṭṮṯṰṱẗₜ⒯ⓉⓣⱦꜨꜩꝷꞆꞇꞱTt ) [[=U=]] => 82 ( UuÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųƯưƱǓǔǕǖǗǘǙǚǛǜȔȕȖȗɄʉʊͧᴜᵁᵘᵤᵾᵿᶙᶶᶷᶸṲṳṴṵṶṷṸṹṺṻỤụỦủỨứỪừỬửỮữỰự⒰ⓊⓤUu ) [[=V=]] => 29 ( VvƲɅʋʌͮᴠᵛᵥᶌᶹᶺṼṽṾṿỼỽ⒱ⓋⓥⱱⱴⱽꝞꝟVv ) [[=W=]] => 28 ( WwŴŵƿǷʍʷᴡᵂẀẁẂẃẄẅẆẇẈẉẘ⒲ⓌⓦⱲⱳWw ) [[=X=]] => 15 ( XxˣͯᶍẊẋẌẍₓ⒳ⓍⓧXx ) [[=Y=]] => 36 ( YyÝýÿŶŷŸƳƴȲȳɎɏʎʏʸẎẏẙỲỳỴỵỶỷỸỹỾỿ⅄⒴ⓎⓨYy ) [[=Z=]] => 42 ( ZzŹźŻżŽžƵƶȤȥɀʐʑᴢᵶᶎᶻᶼᶽᷦẐẑẒẓẔẕℤ℥ℨ⒵ⓏⓩⱫⱬⱿꝢꝣꟆZz ) [[=[=]] = [[=left-square-bracket=]] => 2 ( [[ ) [[=\=]] = [[=backslash=]] => 2 ( \\ ) [[=]=]] = [[=right-square-bracket=]] => 2 ( ]] ) [[=^=]] = [[=circumflex=]] => 3 ( ^ˆ^ ) [[=_=]] = [[=underscore=]] => 2 ( __ ) [[=`=]] = [[=grave-accent=]] => 4 ( `ˋ`` ) [[={=]] = [[=left-curly-bracket=]] => 2 ( {{ ) [[=|=]] = [[=vertical-line=]] => 2 ( || ) [[=}=]] = [[=right-curly-bracket=]] => 2 ( }} ) [[=~=]] = [[=tilde=]] => 2 ( ~~ ) [[==]] = [[=DEL=]] => 1 ( ) [[=Œ=]] => 2 ( Œœ ) [[=¢=]] => 3 ( ¢《¢ ) [[=£=]] => 3 ( £︽£ ) [[=¤=]] => 2 ( ¤》 ) [[=¥=]] => 3 ( ¥︾¥ ) [[=¦=]] => 2 ( ¦¦ ) [[=¬=]] => 2 ( ¬¬ ) [[=¯=]] => 2 ( ¯ ̄ ) [[=´=]] => 2 ( ´´ ) [[=·=]] => 2 ( ·· ) [[=¼=]] => 4 ( ¼୲൳꠰ ) [[=½=]] => 6 ( ½୳൴༪⳽꠱ ) [[=¾=]] => 4 ( ¾୴൵꠲ ) [[=Þ=]] => 6 ( ÞþꝤꝥꝦꝧ )
Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :
[[=AE=]] = [[=Ae=]] = [[=ae=]] => 11 ( ÆæǢǣǼǽᴁᴂᴭᵆᷔ ) [[=CH=]] = [[=Ch=]] = [[=ch=]] => 0 ( ? ) [[=DZ=]] = [[=Dz=]] = [[=dz=]] => 6 ( DŽDždžDZDzdz ) [[=LJ=]] = [[=Lj=]] = [[=lj=]] => 3 ( LJLjlj ) [[=LL=]] = [[=Ll=]] = [[=ll=]] => 2 ( Ỻỻ ) [[=NJ=]] = [[=Nj=]] = [[=nj=]] => 3 ( NJNjnj ) [[=SS=]] = [[=Ss=]] = [[=ss=]] => 2 ( ßẞ )
You said in a previous post :
With Columns++, properties (like
\p{digit}or\P{digit}), named classes (like[[:lower:]]or[[:^lower::]]) and escapes ( like\uor\U) now ignore the Match case setting and the(?i)flag: they are always casesensitiveThus :
-
The regexes
(?=[[:ascii:]])\p{punct}or(?=\p{punct})[[:ascii:]]always gives32matches -
The regexes
(?=[[:ascii:]])\uor(?=\u)[[:ascii:]]always gives26matches -
The regexes
(?=[[:ascii:]])\lor(?=\l)[[:ascii:]]always gives26matches -
The regexes
(?=[[:ascii:]])[\u\l]or(?=[\u\l])[[:ascii:]]always return52matches
Other examples :
-
The regex
[A-F[:lower:]]does give2 289matches, so6UPPER letters +2,283LOWER letters -
The regexes
[[:upper:]]|[[:lower:]]and[[:upper:][:lower:]]act as insensitive regexes and return4,169matches ( i.e.1,886UPPER letters +2,283LOWER letters )
So, everything works as expected, so far but the slight annoyance, described at beginning of my previous post !
Best Regards
guy038
-
-
@guy038 said in Columns++ version 1.3: All Unicode, all the time:
And, indeed, if I use the regex [\x{0300}-\x{036F}], against my Total_Chars.txt file, it corectly returns 112 occurrences and if I use the \p{Mn} regex, it correctly returns 2,059 occurrences, either.
However, then I test the regexes (?=[\x{0300}-\x{036F}])\p{M*} or (?=\p{M*})[\x{0300}-\x{036F}] or, more precisely, the regexes (?=[\x{0300}-\x{036F}])\p{Mn} or (?=\p{Mn})[\x{0300}-\x{036F}], it ONLY returns 111 occurrences and NOT 112 ! Did I make a mistake ?
This appears to be related to character U+0345. This character is a combining character, but it has
an uppercase equivalent (U+0399)a case folding (U+03B9) which is not a combining character.I think at least some of your tests must have been without match case checked?
I do, however, find that with match case not checked, I see a count of 111 for
[\x{0300}-\x{036F}]as well as for your other expressions. With match case checked, I see 112 for all of them.In regular Notepad++ Find, I get 112 either way for
[\x{0300}-\x{036F}]. So there is something I am doing differently that is affecting ranges. I don’t yet know what it is. I will look into it.Thank you for the alert.
Edit to add:
I think what is happening is that when processing a range with match case unchecked (or
(?i)in effect), the regex engine first does a case fold operation on both ends of the range, then does a case fold on each character to be matched to see if it falls in the range. All the characters from U+0300 to U+036F case fold to themselves except for U+0345, which case folds to U+03B9.No doubt Notepad++ native Find behaves differently because Boost::regex does not implement full Unicode case folding without either including ICU or otherwise supplying customized character traits (as Columns++ does).
I agree that it is a somewhat bizarre behavior, but it is not clear what, if anything, I can do about it. Regex ranges with case insensitive matching, I think, are prone to unanticipated quirks. For example, in Notepad++ Find,
[A-z]matches 58 characters when case sensitive and 52 characters when case insensitive. In Columns++ search, when case insensitive it matches 54 characters, because there are two non-ASCII characters,ſ, U+017F andK, U+212A, which case fold to ASCIIsandk. -
Hi, @coises and all,
Yes, @coises, you were right about it. So, in short, against my
Total_Chars.txtfile :-
The regex
\p{Mn}does return2,059occurrences, whatever thecaseoption is cheked or not -
The regexes
[\x{0300}-\x{036F}],(?=[\x{0300}-\x{036F}])\p{Mn}and(?=\p{Mn})[\x{0300}-\x{036F}]return112occurrences, when theMatch caseoption is checked -
The regexes
[\x{0300}-\x{036F}],(?=[\x{0300}-\x{036F}])\p{Mn}and(?=\p{Mn})[\x{0300}-\x{036F}]return111occurrences, when theMatch caseoption is not checked
You said :
All the characters, in range
[\x{0300}-\x{036F}], case fold to themselves, except for the single characterU+0345which case folds toU+03B9This certainly explains why
Columns++, taking account of the folding cases, in this specific range[\x{0300}-\x{036F}]ONLY, just finds111occurrences, when theMatch caseoption is not checked !
I would say that any range, with defined characters ( so, not using your restriction to be automatically sensitive ) :
-
When the
Match caseoption is checked :- Finds the exact number of Unicode chars between the two boundaries of that range. For example, the regex
[A-z]returns58occurrences and is identical to the range[ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyz]with, either, N++ and Columns++
- Finds the exact number of Unicode chars between the two boundaries of that range. For example, the regex
-
When the
Match caseoption is not checked :-
Finds ONLY the characters of that range which case fold to a character of this range. Thus, the regexes
[A-z]and[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]return52occurrences withN++( 26 + 26 ) -
Finds ALL the Unicode characters which case fold to a character of that range. Thus, the regex
[A-z]return54occurrences withColumns++: 52 + 2 chars, whose case folding (sandk) belongs to the specific range[A-z]
-
And note that the regex
[ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyzſK]and even[ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyz]return60occurrences ( 58 + 2 ), with Columns++, when theMatch caseoption is not checked !Best Regards,
guy038
-
-
Hello, @coises and All,
Now, here are the new tests regarding the
Total_ANSI.txtfile, described below :•---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------• | Range | Description | Status | COUNT / MARK of ALL chars | # Chars | ANSI Encoding | # Bytes | •---------------•-----------------•------------•-----------------------------•-----------•-----------------•- ---------• | 0000 - 007F | PLANE 0 - BMP | Included | [\x00-\x7F] | 128 | | 128 | | | | | | | 1 Byte | | | 0080 - 00FF | PLANE 0 - BMP | Included | [\x80-\xFF] | 128 | | 128 | •---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
Against this file, the following general results are correct :
(?s). = \I = \p{Any} = [\x{0000}-\x{EFFFD}] => 256 [[:unicode:]] = \p{unicode} ( Total chars with Unicode value OVER \x{00FF} ) => 27 | | Total = 256 [^[:unicode:]] = \P{unicode} ( Total chars with Unicode value UNDER \x{0100} ) => 229 | \p{Ascii} = \o => 128 | | Total = 256 \P{Ascii} = \O => 128 | \X ( Character with possible combining MARKS ) => 256 | | Total = 256 (?!\X). ( A combining mark ALONE ) => 0 | \y = [[:defined:]] = \p{Assigned} => 256 | | Total = 256 \Y = [^[:defined:]] = \p{Not Assigned} => 0 | \i = [[:invalid:]] ( NO byte in invalid UTF-8 sequence, as ANSI file ) => 0 | | Total = 256 \I = [^[:invalid:]] ( All VALID bytes, as ANSI file ) => 256 |However, note that, with the
Columns++regex engine :[\x00-\xFF] ( Total chars with Unicode value UNDER \x{0100} ) => 229 = [\x00-\x7F\x81\x8D\x8F\x90\x9D\xA0-\xFF] [\x{0000}-\x{00FF}] ( Total chars with Unicode value UNDER \x{0100} ) => 229 = [\x00-\x7F\x81\x8D\x8F\x90\x9D\xA0-\xFF] (?-s). => 254 = [^\x0A\x0D]Whereas, with the N++
Boostregex engine :[\x00-\xFF] => 256 [\x{0000}-\x{00FF}] => INVALID regex syntax ( as ANSI file ) (?-s). => 253 = [^\x0A\x0C\x0D]
I tried some expressions with look-aheads and look-behinds, containing overlapping zones !
For instance, against this text
aaaabaaababbbaabbabb, pasted in a newANSItab, with a final line-break, all the regexes, below, give the correct number of matches :ba*(?=a) => 4 matches ba*(?!a) => 9 matches ba*(?=b) => 8 matches ba*(?!b) => 5 matches (?<=a)ba* => 5 matches (?<!b)ba* => 5 matches (?<=b)ba* => 4 matches (?<!a)ba* => 4 matches
Here are the correct results, concerning all the Posix character classes, against the
Total_ANSI.txtfile[[:ascii:]] an UNDER \x{0080} character 128 = [\x{0000}-\x{007F}] = [\x{00}-\x{7F}] = [\x00-\x7F] [[:unicode:]] = \p{unicode} an OVER \x{00FF} character 27 = [\x{0100}-\x{EFFFD}] = [^\x{0000}-\x{00FF}] = [^\x{00}-\x{FF}] = [^\x00-\xFF] = [[:space:]] = \p{space} = [[:s:]] = \p{s} = \ps = \s a WHITE-SPACE character 7 = [\t\n\x0B\f\r\x20\xA0] [[:h:]] = \p{h} = \ph = \h an HORIZONTAL white space character 3 = [\t\x20\xA0] [[:blank:]] = \p{blank} a BLANK character 3 = [\t\x20\xA0] [[:v:]] = \p{v} = \pv = \v a VERTICAL white space character 4 = [\n\x0B\f\r] [[:cntrl:]] = \p{cntrl} a CONTROL code character 38 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] = [[.NUL.]-[.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OSC.]] [[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u an UPPER case letter 60 = [A-ZŠŒŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞß] [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l a LOWER case letter 63 = [a-zƒšœžµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] [ªº] = [\xAA\xBA] 2 OTHER Letters 2 ˆ = \x{02C6} a MODIFIER letter 1 [[:digit:]] = \p{digit} = [[:d:]] = \p{d} = \pd = \d a DECIMAL number 10 = [0-9] _ = \x5F the LOW_LINE character 1 ------- [[:word:]] = \p{word} = [[:w:]] = \p{w} = \pw = \w a WORD character 137 = [0-9A-Z_a-zƒˆŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] = [[:alnum:]]|\x5F = \p{alnum}|\x5F [[:upper:]]|[[:lower:]] = [[:upper:][:lower:]] = \u|\l Any LETTER, whatever its CASE 123 [[:alnum:]] = \p{alnum} an ALPHANUMERIC character 136 = [0-9A-Za-zƒˆŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] = [[:upper:][:lower:][:digit:]\xAA\xBA\x{02C6}] [[:alpha:]] = \p{alpha} any LETTER character 126 = [[:upper:][:lower:]\xAA\xBA\x{02C6}] [[:graph:]] = \p{graph} any VISIBLE character 215 = [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD] [[:print:]] = \p{print} any PRINTABLE character 222 = [[:graph:]]|\s [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character 73 = \p{Punctuation}|\p{Symbol} = [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x{20AC}\x{201A}\x{201E}\x{2026}\x{2020}\x{2021}\x{2030}\x{2039}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{203A}\xA1-\xA9\xAB\xAC\xAE-\xB1\xB4\xB6-\xB8\xBB\xBF\xD7\xF7] = [^[:cntrl:]\w\x20\xA0\xAD\xB2\xB3\xB9\xBC\xBD\xBE]|\x5F [[:xdigit:]] an HEXADECIMAL character 22 = [0-9A-Fa-f] = (?i)[0-9A-F]
Below, the correct results for all Unicode character classes, against the
Total_ANSI.txtfile ( sinceColumns++ v1.3, Unicode classes work inANSIfiles, as well ) :\p{Any} Any character 256 = (?s). = \I = [\x{0000}-\x{EFFFD}] \p{Ascii} a character UNDER \x80 128 = [[:ascii:]] = \o \p{Assigned} an ASSIGNED character 256 \p{Cc} = \p{Control} a C0 or C1 CONTROL code character 38 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] \p{Cf} = \p{Format} a FORMAT CONTROL character 1 = \xAD \p{Cn} = \p{Not Assigned} an UNASSIGNED or NON-CHARACTER character 0 \p{Co} = \p{Private Use} a PRIVATE-USE character 0 \p{Cs} = \p{Surrogate} (INVALID regex) a SURROGATE character 0 ------ \p{C*} = \p{Other} 39 = \p{Cc}|\p{Cf}|\p{Cn}|\p{Co} \p{Lu} = \p{Uppercase Letter} an UPPER case letter 60 = \u = [[:upper:]] = \p{upper} \p{Ll} = \p{Lowercase Letter} a LOWER case letter 63 = \l = [[:lower:]] = \p{lower} \p{Lt} = \p{Titlecase} a DI-GRAPHIC letter 0 \p{Lm} = \p{Modifier Letter} a MODIFIER letter 1 = \x{02C6} \p{Lo} = \p{Other Letter} OTHER letter 2 = [\xAA\xBA] ------- \p{L*} = \p{Letter} 126 = \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo} = [[:alpha:]] = \p{alpha} \p{Mc} = \p{Spacing Combining Mark} a NON-SPACING COMBINING mark (ZERO advance width) 0 \p{Me} = \p{Enclosing Mark} a SPACING COMBINING mark (POSITIVE advance width) 0 \p{Mn} = \p{Non-Spacing Mark} an ENCLOSING COMBINING mark 0 ----- \p{M*} = \p{Mark} 0 = \p{Mc}|\p{Me}|\p{Mn} \p{Nd} = \p{Decimal Digit Number} a DECIMAL number character 10 = \d = [[:digit:]] = \p{digit} \p{Nl} = \p{Letter Number} a LETTERLIKE numeric character 0 \p{No} = \p{Other Number} OTHER NUMERIC character 6 = [\xB2\xB3\xB9\xBC\xBD\xBE] ------ \p{N*} = \p{Number} 16 = \p{Nd}|\p{Nl}|\p{No} = [0-9\xB2\xB3\xB9\xBC\xBD\xBE] \p{Pd} = \p{Dash Punctuation} a DASH or HYPHEN punctuation mark 3 = [\x2D\x{2013}\x{2014}] \p{Ps} = \p{Open Punctuation} an OPENING PUNCTUATION mark, in a pair 5 = [\x28\x5B\x7B\x{201A}\x{201E}] \p{Pc} = \p{Connector Punctuation} a CONNECTING PUNCTUATION mark 1 = \x5F \p{Pe} = \p{Close Punctuation} a CLOSING PUNCTUATION mark, in a pair 3 = [\x29\x5D\x7D] \p{Pi} = \p{Initial Punctuation} an INITIAL QUOTATION mark 4 = [\x{2039}\x{2018}\x{201C}\xAB] \p{Pf} = \p{Final Punctuation} a FINAL QUOTATION mark 4 = [\x{2019}\x{201D}\x{203A}\xBB] \p{Po} = \p{Other Punctuation} OTHER PUNCTUATION mark 25 = [\x21-\x23\x25-\x27\x2A\x2C\x2E\x2F\x3A\x3B\x3F\x40\x5C\x{2026}\x{2020}\x{2021}\x{2030}\x{2022}\xA1\xA7\xB6\xB7\xBF] ------ \p{P*} = \p{Punctuation} 45 = \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po} \p{Sm} = \p{Math Symbol} a MATHEMATICAL symbol character 10 = [\x2B\x3C-\x3E\x7C\x7E\xAC\xB1\xD7\xF7] \p{Sc} = \p{Currency Symbol} a CURRENCY character 6 = [\x24\x{20AC}\xA2-\xA5] \p{Sk} = \p{Modifier Symbol} a NON-LETTERLIKE MODIFIER character 7 = [\x5E\x60\x{02DC}\xA8\xAF\xB4\xB8] \p{So} = \p{Other Symbol} OTHER SYMBOL character 5 = [\x{2122}\xA6\xA9\xAE\xB0] ------ \p{S*} = \p{Symbol} 28 = \p{Sm}|\p{Sc}|\p{Sk}|\p{So} \p{Zs} = \p{Space Separator} a NON-ZERO width SPACE character 2 = [\x20\xA0] = (?!\t)\h \p{Zl} = \p{Line Separator} the LINE SEPARATOR character 0 \p{Zp} = \p{Paragraph Separator} the PARAGRAPH SEPARATOR character 0 ----- \p{Z*} = \p{Separator} 2 = \p{Zs}|\p{Zl}|\p{Zp}Remark :
-
A negative POSIX character class can be expressed as
[^[:........:]]or[[:^........:]] -
A negative UNICODE character class can be expressed as
\P{..}, with an uppercase letterP
With this last release, @coises, results are totally coherent between
ANSIandUTF-8files !Continuation on next post
-
-
Hello, @coises and All,
Continuation and end of my post
I also tested ALL the `equivalence class feature :
You can use ANY equivalent character to get the total number of matches of the equivalence class character. For example,
[[=ª=]]=[[=Å=]]=[[=ã=]]= … )Here is, below, the list of all the equivalences of any char of the
Windows-1252code-page, against theTotal_ANSI.txtfile. Note that I did not consider the equivalence classes which returns only one match ![[=1=]] = [[=one=]] => 2 [1¹] [[=2=]] = [[=two=]] => 2 [2²] [[=3=]] = [[=three=]] => 2 [3³] [[=A=]] => 15 [AaªÀÁÂÃÄÅàáâãäå] [[=B=]] => 2 [Bb] [[=C=]] => 4 [CcÇç] [[=D=]] => 4 [DdÐð] [[=E=]] => 10 [EeÈÉÊËèéêë] [[=F=]] => 3 [Ffƒ] [[=G=]] => 2 [Gg] [[=H=]] => 2 [Hh] [[=I=]] => 10 [IiÌÍÎÏìíîï] [[=J=]] => 2 [Jj] [[=K=]] => 2 [Kk] [[=L=]] => 2 [Ll] [[=M=]] => 2 [Mm] [[=N=]] => 4 [NnÑñ] [[=O=]] => 15 [OoºÒÓÔÕÖØòóôõöø] [[=P=]] => 2 [Pp] [[=Q=]] => 2 [Qq] [[=R=]] => 2 [Rr] [[=S=]] => 4 [SsŠš] [[=T=]] => 2 [Tt] [[=U=]] => 10 [UuÙÚÛÜùúûü] [[=V=]] => 2 [Vv] [[=W=]] => 2 [Ww] [[=X=]] => 2 [Xx] [[=Y=]] => 6 [YyÝýÿŸ] [[=Z=]] => 4 [ZzŽž] [[=^=]] = [[=circumflex=]] => 2 [^ˆ] = [\x5E\x{02C6}] [[=Œ=]] => 2 [Œœ] = [\x{0152}\x{0153}] [[==]] => 2 [[.NUL.][.SHY.]] = [\x00\xAD] [[=Þ=]] => 2 [Þþ] = [\xDE\xFE]
Some double-letter characters equivalences :
[[=AE=]] = [[=Ae=]] = [[=ae=]] => 2 [Ææ] = [\xC6\xE6] [[=SS=]] = [[=Ss=]] = [[=ss=]] => 1 [ß] = [\xDF]
An example : let’s suppose that we run this regex
[A-F[:lower:]], against myTotal_ANSI.txtfile. It does give69matches, so6UPPER letters +63LOWER lettersThe regexes
[[:upper:]]|[[:lower:]]and[[:upper:][:lower:]]act as insensitive regexes and return123matches ( So60UPPER letters +63LOWER letters )The regexes
(?=\u)\land(?=\l)\udo not find anything. This implies that the sets of UPPER and LOWER letters, inTotal_ANSI.twt, are totally disjointBest Regards
guy038
P.S. :
BTW, I forgot to list the equivalence classes,
> 1, of theControl C0/C1andControl Formatcharacters, against theTotal_Chars.txtfile ! Here are the results, below :[[=nul=]] => 3,240 [\x{0000}\X{00AD}....] Cc [[= =]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=mmsp=]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=idsp=]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=shy=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=alm=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=sam=]] => 2 [\x{070F}\x{2E1A}] Po [[=nqsp=]] => 2 [\x{2000}\X[2002}] Zs [[=ensp=]] => 2 [\x{2000}\X[2002}] Zs [[=mqsp=]] => 2 [\x{2001}\X{2003}] Zs [[=emsp=]] => 2 [\x{2001}\X{2003}] Zs [[=zwnj=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=zwj=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=lrm=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=rlm=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=ls=]] => 2 [\x{2028}\x{FE47}] Zl [[=lre=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=rle=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=pdf=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=lro=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=rlo=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=wj=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=(fa)=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=(it)=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=(is)=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=(ip)=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=lri=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=rli=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=fsi=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=pdi=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=iss=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=ass=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=iafs=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=aafs=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=nads=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=nods=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=zwnbsp=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=iaa=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=ias=]] => 3,240 [\x{0000}\X{00AD}....] Cf [[=iat=]] => 3,240 [\x{0000}\X{00AD}....] CfAs you can see, a lot of
Formatcharacters return an erroneous result of3,240occurrences. But we’re not going to bother about these wrongequivalenceclasses, as long as the similarcollatingnames, with the[[.XXX.]]syntax, are totally correct !Luckily, all the other equivalence classes are also correct, except for
[[=ls=]]which returns2matches\x{2028}and\x{FE47}?? -
@guy038 said in Columns++ version 1.3: All Unicode, all the time:
As you can see, a lot of Format characters return an erroneous result of 3,240 occurrences. But we’re not going to bother about these wrong equivalence classes, as long as the similar collating names, with the [[.XXX.]] syntax, are totally correct !
Luckily, all the other equivalence classes are also correct, except for [[=ls=]] which returns 2 matches \x{2028} and \x{FE47} ??
Thank you for the observation. I will have to look into this more closely. I believe the Boost::regex engine uses the transform_primary member function of the character traits class to determine equivalence: if the sort key returned by that function for two characters is the same, then they are equivalent. I implemented transform_primary using LCMapStringEx, as that is normally how one does Unicode sorting. But how is sorting relevant to regular expressions?
It could be — despite the documented requirement for the function — that what is needed from transform_primary isn’t a sort key, but rather a case folding followed by a compatibility decomposition.
Again, thank you for all your testing, and for calling this to my attention.