Columns++ version 1.3: All Unicode, all the time
-
Hello, @coises and All,
When I first used the
v1.3release of Columns++, I did not pay attention to the fact that, among the new features, there was the point :Character information tables for regular expressions were updated to reflect Unicode 17.0.0 (released on 2025-09-09)
So, in this post, I re-tested all the regex features of the
v1.3release, that you’ll find below and I pleased to tell you that ALL results are correct, EXCEPT for one thing :Indeed, there’s a bug, somehow, regarding the
Markcharacters :-
Open this file in your browser : https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt
-
Then, hit the
Ctrl + Fshortcut within your browser and search for the stringNonsp, within theDerivedGeneralCategory.txtfile -
Under the first occurrence, you should see :
# General_Category=Nonspacing_Mark ¯¯¯¯¯ 0300..036F ; Mn # [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X 0483..0487 ; Mn # [5] COMBINING CYRILLIC TITLO..COMBINING CYRILLIC POKRYTIE 0591..05BD ; Mn # [45] HEBREW ACCENT ETNAHTA..HEBREW POINT METEGThe first line clearly shows that the
112characters of the COMBINING DIACRITICAL MARKS Unicode block (refer to https://www.unicode.org/charts/PDF/U0300.pdf ) are considered, by the Unicode Consortium, asNon Spacing Markcharacters !And, indeed, if I use the regex
[\x{0300}-\x{036F}], against myTotal_Chars.txtfile, it corectly returns112occurrences and if I use the\p{Mn}regex, it correctly returns2,059occurrences, either.However, then I test the regexes
(?=[\x{0300}-\x{036F}])\p{M*}or(?=\p{M*})[\x{0300}-\x{036F}]or, more precisely, the regexes(?=[\x{0300}-\x{036F}])\p{Mn}or(?=\p{Mn})[\x{0300}-\x{036F}], it ONLY returns111occurrences and NOT112! Did I make a mistake ?
Now, against the
Total_Chars.txtfile, all these general results, below, are correct :(?s). = \I = \p{Any} = [\x{0000}-\x{EFFFD}] => Total = 325,590 \p{Unicode} = [[:Unicode:]] => 325,334 | | Total = 325,590 \P{Unicode} = [[:^Unicode:]] => 256 | \p{Ascii} = \o => 128 | | Total = 325,590 \P{Ascii} = \O => 325,462 | \X => 322,586 | | Total = 325,590 (?!\X). => 3,004 | [\x{E000}-\x{F8FF}]|\y = [\x{E000}-\x{F8FF}]|[[:defined:]] = \p{Assigned} => 166,266 | | Total = 325,590 (?![\x{E000}-\x{F8FF}])\Y = (?![\x{E000}-\x{F8FF}])[^[:defined:]] = \p{Not Assigned} => 159,324 |Note : if we add, to the number of characters of
Total_Chars.txt, the contents of any omitted planes ( Planes4to13,16and17), less the TWO non-characters for each, plus the Surrogate characters and all the Unicode non-chars, we obtain :325,590+(65536 - 2) * 12+2,048+66=1,114,112which is, indeed, the total amount of Unicode chars, , both assigned or not assigned !
Here are the correct results, concerning all the Posix character classes, against the
Total_Chars.txtfile[[:ascii:]] an UNDER \x{0080} character 128 = [\x{0000}-\x{007F}] = \p{ascii} = \o [[:unicode:]] = \p{unicode} an OVER \x{00FF} character 325,334 = [\x{0100}-\x{EFFFD}] ( restricted to 'Total_Chars.txt' ) [[:space:]] = \p{space} = [[:s:]] = \p{s} = \ps = \s a WHITE-SPACE character 25 = [\t\n\x{000B}\f\r\x20\x{0085}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}] [[:h:]] = \p{h} = \ph = \h an HORIZONTAL white space character 18 = [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t [[:blank:]] = \p{blank} a BLANK character 18 = [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t [[:v:]] = \p{v} = \pv = \v a VERTICAL white space character 7 = [\n\x{000B}\f\r\x{0085}\x{2028}\x{2029}] [[:cntrl:]] = \p{cntrl} a CONTROL code character 65 = [\x{0000}-\x{001F}\x{007F}\x{0080}-\x{009F}] [[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u an UPPER case letter 1,886 = \p{Lu} [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l a LOWER case letter 2,283 = \p{Ll} a DI-GRAPIC letter 31 = \p{Lt} a MODIFIER letter 410 = \p{Lm} an OTHER letter + SYLLABES / IDEOGRAPHS 141,062 = \p{Lo} [[:digit:]] = \p{digit} = [[:d:]] = \p{d} = \pd = \d a DECIMAL number 770 = \p{Nd} _ = \x{005F} the LOW_LINE character 1 ----------- [[:word:]] = \p{word} = [[:w:]] = \p{w} = \pw = \w a WORD character 146,443 = [\p{L*}\p{Nd}_] [[:alnum:]] = \p{alnum} an ALPHANUMERIC character 146,442 = [\p{L*}\p{Nd}] [[:alpha:]] = \p{alpha} any LETTER character 145,672 = \p{L*} [[:graph:]] = \p{graph} any VISIBLE character 159,612 [[:print:]] = \p{print} any PRINTABLE character 159,637 = [[:graph:]]|\s [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character 9,473 = \p{P*}|\p{S*} = \p{Punctuation}|\p{Symbol} = 856 + 8,617 [[:xdigit:]] an HEXADECIMAL character 22 = [0-9A-Fa-f]
And here, are the correct results regarding the Unicode character classes, against the
Total_Chars.txtfile :\p{Any} Any character 325,590 = (?s). = \I = [\x{0000}-\x{EFFFD}] \p{Ascii} a character UNDER \x80 128 = [[:ascii:]] = \o \p{Assigned} an ASSIGNED character 166,266 ( of Total_Chars.txt, ONLY ) \p{Cc} = \p{Control} a C0 or C1 CONTROL code character 65 \p{Cf} = \p{Format} a FORMAT CONTROL character 170 \p{Cn} = \p{Not Assigned} an UNASSIGNED or NON-CHARACTER character 159,324 ( 'Total_Chars.txt' does NOT contain the 66 NON-CHARACTER chars ) \p{Co} = \p{Private Use} a PRIVATE-USE character 6,400 \p{Cs} = \p{Surrogate} (INVALID regex) a SURROGATE character [2,048] ( 'Total_Chars.txt' does NOT contain the 2,048 SURROGATE chars ) ----------- \p{C*} = \p{Other} 165,959 = \p{Cc}|\p{Cf}|\p{Cn}|\p{Co} \p{Lu} = \p{Uppercase Letter} an UPPER case letter 1,886 = \u = [[:upper:]] = \p{upper} \p{Ll} = \p{Lowercase Letter} a LOWER case letter 2,283 = \l = [[:lower:]] = \p{lower} \p{Lt} = \p{Titlecase} a DI-GRAPHIC letter 31 \p{Lm} = \p{Modifier Letter} a MODIFIER letter 410 \p{Lo} = \p{Other Letter} OTHER LETTER, including SYLLABLES and IDEOGRAPHS 141,062 ----------- \p{L*} = \p{Letter} 145,672 = \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo} = [[:alpha:]] = \p{alpha} \p{Mc} = \p{Spacing Combining Mark} a SPACING COMBINING mark 471 \p{Me} = \p{Enclosing Mark} an ENCLOSING mark (POSITIVE advance width) 13 \p{Mn} = \p{Non-Spacing Mark} a NON-SPACING COMBINING mark (ZERO advance width) 2,059 --------- \p{M*} = \p{Mark} 2,543 = \p{Mc}|\p{Me}|\p{Mn} \p{Nd} = \p{Decimal Digit Number} a DECIMAL number character 770 \p{Nl} = \p{Letter Number} a LETTERLIKE numeric character 239 \p{No} = \p{Other Number} OTHER NUMERIC character 915 --------- \p{N*} = \p{Number} 1,924 = \p{Nd}|\p{Nl}|\p{No} \p{Pd} = \p{Dash Punctuation} a DASH or HYPHEN punctuation mark 27 \p{Ps} = \p{Open Punctuation} an OPENING PUNCTUATION mark, in a pair 79 \p{Pc} = \p{Connector Punctuation} a CONNECTING PUNCTUATION mark 10 \p{Pe} = \p{Close Punctuation} a CLOSING PUNCTUATION mark, in a pair 77 \p{Pi} = \p{Initial Punctuation} an INITIAL QUOTATION mark 12 \p{Pf} = \p{Final Punctuation} a FINAL QUOTATION mark 10 \p{Po} = \p{Other Punctuation} OTHER PUNCTUATION mark 641 ------- \p{P*} = \p{Punctuation} 856 = \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po} \p{Sm} = \p{Math Symbol} a MATHEMATICAL symbol character 960 \p{Sc} = \p{Currency Symbol} a CURRENCY character 64 \p{Sk} = \p{Modifier Symbol} a NON-LETTERLIKE MODIFIER character 125 \p{So} = \p{Other Symbol} OTHER SYMBOL character 7,468 \p{S*} = \p{Symbol} 8,617 = \p{Sm}|\p{Sc}|\p{Sk}|\p{So} \p{Zs} = \p{Space Separator} a NON-ZERO width SPACE character 17 = [\x{0020}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = (?!\t)\h \p{Zl} = \p{Line Separator} the LINE SEPARATOR character 1 = \x{2028} \p{Zp} = \p{Paragraph Separator} the PARAGRAPH SEPARATOR character 1 = \x{2029} ------ \p{Z*} = \p{Separator} 19 = \p{Zs}|\p{Zl}|\p{Zp}Remark :
-
A negative POSIX character class can be expressed as
[^[:........:]]or[[:^........:]] -
A negative UNICODE character class can be expressed as
\P{..}, with an uppercase letterP
Now, if you follow the procedure explained in the last part of this post :
https://community.notepad-plus-plus.org/post/99844
The regexes
[\x{DC80}-\x{DCFF}]or\ior[[:invalid:]]do give134occurrences, which is the exact number of invalidUTF-8characters of that example !Continuation on next post
-
-
Hi @Coises and All,
Continuation and end of my reply :
I also tested ALL the
Equivalenceclasses feature, against theTotal_Chars.txtfile.With Columns++, we can use ANY equivalent character to get the total number of matches of the equivalence class character
For instance,
[[=Ⱥ=]]=[[=ⱥ=]]=[[=Ɐ=]]always gives86, matches whereas native N++ Boost engine is less coherent and sometimes displays a wrong number of occurrences :-((Here is, below, the list of all equivalences of any char of the
Windows-1252code-page, from\x{0020}till\x{00DE}Note that, except for the DEL character, as an example, I did not consider the equivalence classes which only return1match !I also confirm, that I did not find any character over
\x{FFFF}which would be part of a regex equivalence class, either with our Boost engine or with theColumns++search engine ![[= =]] = [[=space=]] => 3 ( ) [[=!=]] = [[=exclamation-mark=]] => 2 ( !! ) [[="=]] = [[=quotation-mark=]] => 3 ( "⁍" ) [[=#=]] = [[=number-sign=]] => 4 ( #؞⁗# ) [[=$=]] = [[=dollar-sign=]] => 3 ( $⁒$ ) [[=%=]] = [[=percent-sign=]] => 3 ( %⁏% ) [[=&=]] = [[=ampersand=]] => 3 ( &⁋& ) [[='=]] = [[=apostrophe=]] => 2 ( '' ) [[=(=]] = [[=left-parenthesis=]] => 4 ( (⁽₍( ) [[=)=]] = [[=right-parenthesis=]] => 4 ( )⁾₎) ) [[=*=]] = [[=asterisk=]] => 2 ( ** ) [[=+=]] = [[=plus-sign=]] => 6 ( +⁺₊﬩﹢+ ) [[=,=]] = [[=comma=]] => 2 ( ,, ) [[=-=]] = [[=hyphen=]] => 3 ( -﹣- ) [[=.=]] = [[=period=]] => 3 ( .․. ) [[=/=]] = [[=slash=]] => 2 ( // ) [[=0=]] = [[=zero=]] => 48 ( 0٠۟۠۰߀०০੦૦୦୵௦౦౸೦൦๐໐༠၀႐០᠐᥆᧐᪀᪐᭐᮰᱀᱐⁰₀↉⓪⓿〇㍘꘠ꛯ꠳꣐꤀꧐꩐꯰0 ) [[=1=]] = [[=one=]] => 54 ( 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁①⑴⒈⓵⚀❶➀➊〡㋀㍙㏠꘡ꛦ꣑꤁꧑꩑꯱1 ) [[=2=]] = [[=two=]] => 54 ( 2²ƻ٢۲߂२২੨૨୨௨౨౺౽೨൨๒໒༢၂႒፪២᠒᥈᧒᪂᪒᭒᮲᱂᱒₂②⑵⒉⓶⚁❷➁➋〢㋁㍚㏡꘢ꛧ꣒꤂꧒꩒꯲2 ) [[=3=]] = [[=three=]] => 53 ( 3³٣۳߃३৩੩૩୩௩౩౻౾೩൩๓໓༣၃႓፫៣᠓᥉᧓᪃᪓᭓᮳᱃᱓₃③⑶⒊⓷⚂❸➂➌〣㋂㍛㏢꘣ꛨ꣓꤃꧓꩓꯳3 ) [[=4=]] = [[=four=]] => 51 ( 4٤۴߄४৪੪૪୪௪౪೪൪๔໔༤၄႔፬៤᠔᥊᧔᪄᪔᭔᮴᱄᱔⁴₄④⑷⒋⓸⚃❹➃➍〤㋃㍜㏣꘤ꛩ꣔꤄꧔꩔꯴4 ) [[=5=]] = [[=five=]] => 53 ( 5Ƽƽ٥۵߅५৫੫૫୫௫౫೫൫๕໕༥၅႕፭៥᠕᥋᧕᪅᪕᭕᮵᱅᱕⁵₅⑤⑸⒌⓹⚄❺➄➎〥㋄㍝㏤꘥ꛪ꣕꤅꧕꩕꯵5 ) [[=6=]] = [[=six=]] => 52 ( 6٦۶߆६৬੬૬୬௬౬೬൬๖໖༦၆႖፮៦᠖᥌᧖᪆᪖᭖᮶᱆᱖⁶₆ↅ⑥⑹⒍⓺⚅❻➅➏〦㋅㍞㏥꘦ꛫ꣖꤆꧖꩖꯶6 ) [[=7=]] = [[=seven=]] => 50 ( 7٧۷߇७৭੭૭୭௭౭೭൭๗໗༧၇႗፯៧᠗᥍᧗᪇᪗᭗᮷᱇᱗⁷₇⑦⑺⒎⓻❼➆➐〧㋆㍟㏦꘧ꛬ꣗꤇꧗꩗꯷7 ) [[=8=]] = [[=eight=]] => 50 ( 8٨۸߈८৮੮૮୮௮౮೮൮๘໘༨၈႘፰៨᠘᥎᧘᪈᪘᭘᮸᱈᱘⁸₈⑧⑻⒏⓼❽➇➑〨㋇㍠㏧꘨ꛭ꣘꤈꧘꩘꯸8 ) [[=9=]] = [[=nine=]] => 50 ( 9٩۹߉९৯੯૯୯௯౯೯൯๙໙༩၉႙፱៩᠙᥏᧙᪉᪙᭙᮹᱉᱙⁹₉⑨⑼⒐⓽❾➈➒〩㋈㍡㏨꘩ꛮ꣙꤉꧙꩙꯹9 ) [[=:=]] = [[=colon=]] => 2 ( :: ) [[=;=]] = [[=semicolon=]] => 3 ( ;;; ) [[=<=]] = [[=less-than-sign=]] => 3 ( <﹤< ) [[===]] = [[=equals-sign=]] => 5 ( =⁼₌﹦= ) [[=>=]] = [[=greater-than-sign=]] => 3 ( >﹥> ) [[=?=]] = [[=question-mark=]] => 2 ( ?? ) [[=@=]] = [[=commercial-at=]] => 2 ( @@ ) [[=A=]] => 86 ( AaªÀÁÂÃÄÅàáâãäåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃȦȧȺɐɑɒͣᴀᴬᵃᵄᶏᶐᶛᷓḀḁẚẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặₐÅ⒜ⒶⓐⱥⱭⱯⱰAa ) [[=B=]] => 29 ( BbƀƁƂƃƄƅɃɓʙᴃᴮᴯᵇᵬᶀḂḃḄḅḆḇℬ⒝ⒷⓑBb ) [[=C=]] => 40 ( CcÇçĆćĈĉĊċČčƆƇƈȻȼɔɕʗͨᴄᴐᵓᶗᶜᶝᷗḈḉℂ℃ℭ⒞ⒸⓒꜾꜿCc ) [[=D=]] => 44 ( DdÐðĎďĐđƊƋƌƍɗʤͩᴅᴆᴰᵈᵭᶁᶑᶞᷘᷙḊḋḌḍḎḏḐḑḒḓⅅⅆ⒟ⒹⓓꝹꝺDd ) [[=E=]] => 82 ( EeÈÉÊËèéêëĒēĔĕĖėĘęĚěƎƏǝȄȅȆȇȨȩɆɇɘəɚͤᴇᴱᴲᵉᵊᶒᶕḔḕḖḗḘḙḚḛḜḝẸẹẺẻẼẽẾếỀềỂểỄễỆệₑₔ℮ℯℰ⅀ⅇ⒠ⒺⓔⱸⱻEe ) [[=F=]] => 22 ( FfƑƒᵮᶂᶠḞḟ℉ℱℲⅎ⒡ⒻⓕꜰꝻꝼꟻFf ) [[=G=]] => 47 ( GgĜĝĞğĠġĢģƓƔǤǥǦǧǴǵɠɡɢɣɤʛˠᴳᵍᵷᶃᶢᷚᷛḠḡℊ⅁⒢ⒼⓖꝾꝿꞠꞡꞬGg ) [[=H=]] => 42 ( HhĤĥĦħȞȟɥɦʜʰʱͪᴴᶣḢḣḤḥḦḧḨḩḪḫẖₕℋℌℍℎℏ⒣ⒽⓗⱧⱨꞍꞪHh ) [[=I=]] => 62 ( IiÌÍÎÏìíîïĨĩĪīĬĭĮįİıƖƗǏǐȈȉȊȋɨɩɪͥᴉᴵᵎᵢᵻᵼᶖᶤᶥᶦᶧḬḭḮḯỈỉỊịⁱℐℑⅈ⒤ⒾⓘꞮꟾIi ) [[=J=]] => 24 ( JjĴĵǰȷɈɉɟʄʝʲᴊᴶᶡᶨⅉ⒥ⒿⓙⱼꞲJj ) [[=K=]] => 39 ( KkĶķĸƘƙǨǩʞᴋᴷᵏᶄᷜḰḱḲḳḴḵₖK⒦ⓀⓚⱩⱪꝀꝁꝂꝃꝄꝅꞢꞣꞰKk ) [[=L=]] => 58 ( LlĹĺĻļĽľĿŀŁłƚƛȽɫɬɭɮʟˡᴌᴸᶅᶩᶪᶫᷝᷞḶḷḸḹḺḻḼḽₗℒℓ⅂⅃⒧ⓁⓛⱠⱡⱢꝆꝇꝈꝉꞀꞁꞭLl ) [[=M=]] => 33 ( MmƜɯɰɱͫᴍᴟᴹᵐᵚᵯᶆᶬᶭᷟḾḿṀṁṂṃₘℳ⒨ⓂⓜⱮꝳꟽMm ) [[=N=]] => 47 ( NnÑñŃńŅņŇňʼnƝƞǸǹȠɲɳɴᴎᴺᴻᵰᶇᶮᶯᶰᷠᷡṄṅṆṇṈṉṊṋⁿₙℕ⒩ⓃⓝꞤꞥNn ) [[=O=]] => 106 ( OoºÒÓÔÕÖØòóôõöøŌōŎŏŐőƟƠơƢƣǑǒǪǫǬǭǾǿȌȍȎȏȪȫȬȭȮȯȰȱɵɶɷͦᴏᴑᴒᴓᴕᴖᴗᴼᵒᵔᵕᶱṌṍṎṏṐṑṒṓỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợₒℴ⒪ⓄⓞⱺꝊꝋꝌꝍOo ) [[=P=]] => 33 ( PpƤƥɸᴘᴾᵖᵱᵽᶈᶲṔṕṖṗₚ℘ℙ⒫ⓅⓟⱣⱷꝐꝑꝒꝓꝔꝕꟼPp ) [[=Q=]] => 16 ( QqɊɋʠℚ℺⒬ⓆⓠꝖꝗꝘꝙQq ) [[=R=]] => 64 ( RrŔŕŖŗŘřƦȐȑȒȓɌɍɹɺɻɼɽɾɿʀʁʳʴʵʶͬᴙᴚᴿᵣᵲᵳᶉᷢᷣṘṙṚṛṜṝṞṟℛℜℝ⒭ⓇⓡⱤⱹꝚꝛꝜꝝꝵꝶꞂꞃRr ) [[=S=]] => 50 ( SsŚśŜŝŞşŠšſƧƨƩƪȘșȿʂʃʅʆˢᵴᶊᶋᶘᶳᶴᷤṠṡṢṣṤṥṦṧṨṩẛₛ⒮ⓈⓢⱾꜱꟅSs ) [[=T=]] => 47 ( TtŢţŤťƫƬƭƮȚțȶȾʇʈʧʨͭᴛᵀᵗᵵᶵᶿṪṫṬṭṮṯṰṱẗₜ⒯ⓉⓣⱦꜨꜩꝷꞆꞇꞱTt ) [[=U=]] => 82 ( UuÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųƯưƱǓǔǕǖǗǘǙǚǛǜȔȕȖȗɄʉʊͧᴜᵁᵘᵤᵾᵿᶙᶶᶷᶸṲṳṴṵṶṷṸṹṺṻỤụỦủỨứỪừỬửỮữỰự⒰ⓊⓤUu ) [[=V=]] => 29 ( VvƲɅʋʌͮᴠᵛᵥᶌᶹᶺṼṽṾṿỼỽ⒱ⓋⓥⱱⱴⱽꝞꝟVv ) [[=W=]] => 28 ( WwŴŵƿǷʍʷᴡᵂẀẁẂẃẄẅẆẇẈẉẘ⒲ⓌⓦⱲⱳWw ) [[=X=]] => 15 ( XxˣͯᶍẊẋẌẍₓ⒳ⓍⓧXx ) [[=Y=]] => 36 ( YyÝýÿŶŷŸƳƴȲȳɎɏʎʏʸẎẏẙỲỳỴỵỶỷỸỹỾỿ⅄⒴ⓎⓨYy ) [[=Z=]] => 42 ( ZzŹźŻżŽžƵƶȤȥɀʐʑᴢᵶᶎᶻᶼᶽᷦẐẑẒẓẔẕℤ℥ℨ⒵ⓏⓩⱫⱬⱿꝢꝣꟆZz ) [[=[=]] = [[=left-square-bracket=]] => 2 ( [[ ) [[=\=]] = [[=backslash=]] => 2 ( \\ ) [[=]=]] = [[=right-square-bracket=]] => 2 ( ]] ) [[=^=]] = [[=circumflex=]] => 3 ( ^ˆ^ ) [[=_=]] = [[=underscore=]] => 2 ( __ ) [[=`=]] = [[=grave-accent=]] => 4 ( `ˋ`` ) [[={=]] = [[=left-curly-bracket=]] => 2 ( {{ ) [[=|=]] = [[=vertical-line=]] => 2 ( || ) [[=}=]] = [[=right-curly-bracket=]] => 2 ( }} ) [[=~=]] = [[=tilde=]] => 2 ( ~~ ) [[==]] = [[=DEL=]] => 1 ( ) [[=Œ=]] => 2 ( Œœ ) [[=¢=]] => 3 ( ¢《¢ ) [[=£=]] => 3 ( £︽£ ) [[=¤=]] => 2 ( ¤》 ) [[=¥=]] => 3 ( ¥︾¥ ) [[=¦=]] => 2 ( ¦¦ ) [[=¬=]] => 2 ( ¬¬ ) [[=¯=]] => 2 ( ¯ ̄ ) [[=´=]] => 2 ( ´´ ) [[=·=]] => 2 ( ·· ) [[=¼=]] => 4 ( ¼୲൳꠰ ) [[=½=]] => 6 ( ½୳൴༪⳽꠱ ) [[=¾=]] => 4 ( ¾୴൵꠲ ) [[=Þ=]] => 6 ( ÞþꝤꝥꝦꝧ )
Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :
[[=AE=]] = [[=Ae=]] = [[=ae=]] => 11 ( ÆæǢǣǼǽᴁᴂᴭᵆᷔ ) [[=CH=]] = [[=Ch=]] = [[=ch=]] => 0 ( ? ) [[=DZ=]] = [[=Dz=]] = [[=dz=]] => 6 ( DŽDždžDZDzdz ) [[=LJ=]] = [[=Lj=]] = [[=lj=]] => 3 ( LJLjlj ) [[=LL=]] = [[=Ll=]] = [[=ll=]] => 2 ( Ỻỻ ) [[=NJ=]] = [[=Nj=]] = [[=nj=]] => 3 ( NJNjnj ) [[=SS=]] = [[=Ss=]] = [[=ss=]] => 2 ( ßẞ )
You said in a previous post :
With Columns++, properties (like
\p{digit}or\P{digit}), named classes (like[[:lower:]]or[[:^lower::]]) and escapes ( like\uor\U) now ignore the Match case setting and the(?i)flag: they are always casesensitiveThus :
-
The regexes
(?=[[:ascii:]])\p{punct}or(?=\p{punct})[[:ascii:]]always gives32matches -
The regexes
(?=[[:ascii:]])\uor(?=\u)[[:ascii:]]always gives26matches -
The regexes
(?=[[:ascii:]])\lor(?=\l)[[:ascii:]]always gives26matches -
The regexes
(?=[[:ascii:]])[\u\l]or(?=[\u\l])[[:ascii:]]always return52matches
Other examples :
-
The regex
[A-F[:lower:]]does give2 289matches, so6UPPER letters +2,283LOWER letters -
The regexes
[[:upper:]]|[[:lower:]]and[[:upper:][:lower:]]act as insensitive regexes and return4,169matches ( i.e.1,886UPPER letters +2,283LOWER letters )
So, everything works as expected, so far but the slight annoyance, described at beginning of my previous post !
Best Regards
guy038
-
-
@guy038 said in Columns++ version 1.3: All Unicode, all the time:
And, indeed, if I use the regex [\x{0300}-\x{036F}], against my Total_Chars.txt file, it corectly returns 112 occurrences and if I use the \p{Mn} regex, it correctly returns 2,059 occurrences, either.
However, then I test the regexes (?=[\x{0300}-\x{036F}])\p{M*} or (?=\p{M*})[\x{0300}-\x{036F}] or, more precisely, the regexes (?=[\x{0300}-\x{036F}])\p{Mn} or (?=\p{Mn})[\x{0300}-\x{036F}], it ONLY returns 111 occurrences and NOT 112 ! Did I make a mistake ?
This appears to be related to character U+0345. This character is a combining character, but it has
an uppercase equivalent (U+0399)a case folding (U+03B9) which is not a combining character.I think at least some of your tests must have been without match case checked?
I do, however, find that with match case not checked, I see a count of 111 for
[\x{0300}-\x{036F}]as well as for your other expressions. With match case checked, I see 112 for all of them.In regular Notepad++ Find, I get 112 either way for
[\x{0300}-\x{036F}]. So there is something I am doing differently that is affecting ranges. I don’t yet know what it is. I will look into it.Thank you for the alert.
Edit to add:
I think what is happening is that when processing a range with match case unchecked (or
(?i)in effect), the regex engine first does a case fold operation on both ends of the range, then does a case fold on each character to be matched to see if it falls in the range. All the characters from U+0300 to U+036F case fold to themselves except for U+0345, which case folds to U+03B9.No doubt Notepad++ native Find behaves differently because Boost::regex does not implement full Unicode case folding without either including ICU or otherwise supplying customized character traits (as Columns++ does).
I agree that it is a somewhat bizarre behavior, but it is not clear what, if anything, I can do about it. Regex ranges with case insensitive matching, I think, are prone to unanticipated quirks. For example, in Notepad++ Find,
[A-z]matches 58 characters when case sensitive and 52 characters when case insensitive. In Columns++ search, when case insensitive it matches 54 characters, because there are two non-ASCII characters,ſ, U+017F andK, U+212A, which case fold to ASCIIsandk. -
Hi, @coises and all,
Yes, @coises, you were right about it. So, in short, against my
Total_Chars.txtfile :-
The regex
\p{Mn}does return2,059occurrences, whatever thecaseoption is cheked or not -
The regexes
[\x{0300}-\x{036F}],(?=[\x{0300}-\x{036F}])\p{Mn}and(?=\p{Mn})[\x{0300}-\x{036F}]return112occurrences, when theMatch caseoption is checked -
The regexes
[\x{0300}-\x{036F}],(?=[\x{0300}-\x{036F}])\p{Mn}and(?=\p{Mn})[\x{0300}-\x{036F}]return111occurrences, when theMatch caseoption is not checked
You said :
All the characters, in range
[\x{0300}-\x{036F}], case fold to themselves, except for the single characterU+0345which case folds toU+03B9This certainly explains why
Columns++, taking account of the folding cases, in this specific range[\x{0300}-\x{036F}]ONLY, just finds111occurrences, when theMatch caseoption is not checked !
I would say that any range, with defined characters ( so, not using your restriction to be automatically sensitive ) :
-
When the
Match caseoption is checked :- Finds the exact number of Unicode chars between the two boundaries of that range. For example, the regex
[A-z]returns58occurrences and is identical to the range[ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyz]with, either, N++ and Columns++
- Finds the exact number of Unicode chars between the two boundaries of that range. For example, the regex
-
When the
Match caseoption is not checked :-
Finds ONLY the characters of that range which case fold to a character of this range. Thus, the regexes
[A-z]and[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]return52occurrences withN++( 26 + 26 ) -
Finds ALL the Unicode characters which case fold to a character of that range. Thus, the regex
[A-z]return54occurrences withColumns++: 52 + 2 chars, whose case folding (sandk) belongs to the specific range[A-z]
-
And note that the regex
[ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyzſK]and even[ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyz]return60occurrences ( 58 + 2 ), with Columns++, when theMatch caseoption is not checked !Best Regards,
guy038
-
-
Hello, @coises and All,
Now, here are the new tests regarding the
Total_ANSI.txtfile, described below :•---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------• | Range | Description | Status | COUNT / MARK of ALL chars | # Chars | ANSI Encoding | # Bytes | •---------------•-----------------•------------•-----------------------------•-----------•-----------------•- ---------• | 0000 - 007F | PLANE 0 - BMP | Included | [\x00-\x7F] | 128 | | 128 | | | | | | | 1 Byte | | | 0080 - 00FF | PLANE 0 - BMP | Included | [\x80-\xFF] | 128 | | 128 | •---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
Against this file, the following general results are correct :
(?s). = \I = \p{Any} = [\x{0000}-\x{EFFFD}] => 256 [[:unicode:]] = \p{unicode} ( Total chars with Unicode value OVER \x{00FF} ) => 27 | | Total = 256 [^[:unicode:]] = \P{unicode} ( Total chars with Unicode value UNDER \x{0100} ) => 229 | \p{Ascii} = \o => 128 | | Total = 256 \P{Ascii} = \O => 128 | \X ( Character with possible combining MARKS ) => 256 | | Total = 256 (?!\X). ( A combining mark ALONE ) => 0 | \y = [[:defined:]] = \p{Assigned} => 256 | | Total = 256 \Y = [^[:defined:]] = \p{Not Assigned} => 0 | \i = [[:invalid:]] ( NO byte in invalid UTF-8 sequence, as ANSI file ) => 0 | | Total = 256 \I = [^[:invalid:]] ( All VALID bytes, as ANSI file ) => 256 |However, note that, with the
Columns++regex engine :[\x00-\xFF] ( Total chars with Unicode value UNDER \x{0100} ) => 229 = [\x00-\x7F\x81\x8D\x8F\x90\x9D\xA0-\xFF] [\x{0000}-\x{00FF}] ( Total chars with Unicode value UNDER \x{0100} ) => 229 = [\x00-\x7F\x81\x8D\x8F\x90\x9D\xA0-\xFF] (?-s). => 254 = [^\x0A\x0D]Whereas, with the N++
Boostregex engine :[\x00-\xFF] => 256 [\x{0000}-\x{00FF}] => INVALID regex syntax ( as ANSI file ) (?-s). => 253 = [^\x0A\x0C\x0D]
I tried some expressions with look-aheads and look-behinds, containing overlapping zones !
For instance, against this text
aaaabaaababbbaabbabb, pasted in a newANSItab, with a final line-break, all the regexes, below, give the correct number of matches :ba*(?=a) => 4 matches ba*(?!a) => 9 matches ba*(?=b) => 8 matches ba*(?!b) => 5 matches (?<=a)ba* => 5 matches (?<!b)ba* => 5 matches (?<=b)ba* => 4 matches (?<!a)ba* => 4 matches
Here are the correct results, concerning all the Posix character classes, against the
Total_ANSI.txtfile[[:ascii:]] an UNDER \x{0080} character 128 = [\x{0000}-\x{007F}] = [\x{00}-\x{7F}] = [\x00-\x7F] [[:unicode:]] = \p{unicode} an OVER \x{00FF} character 27 = [\x{0100}-\x{EFFFD}] = [^\x{0000}-\x{00FF}] = [^\x{00}-\x{FF}] = [^\x00-\xFF] = [[:space:]] = \p{space} = [[:s:]] = \p{s} = \ps = \s a WHITE-SPACE character 7 = [\t\n\x0B\f\r\x20\xA0] [[:h:]] = \p{h} = \ph = \h an HORIZONTAL white space character 3 = [\t\x20\xA0] [[:blank:]] = \p{blank} a BLANK character 3 = [\t\x20\xA0] [[:v:]] = \p{v} = \pv = \v a VERTICAL white space character 4 = [\n\x0B\f\r] [[:cntrl:]] = \p{cntrl} a CONTROL code character 38 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] = [[.NUL.]-[.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OSC.]] [[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u an UPPER case letter 60 = [A-ZŠŒŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞß] [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l a LOWER case letter 63 = [a-zƒšœžµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] [ªº] = [\xAA\xBA] 2 OTHER Letters 2 ˆ = \x{02C6} a MODIFIER letter 1 [[:digit:]] = \p{digit} = [[:d:]] = \p{d} = \pd = \d a DECIMAL number 10 = [0-9] _ = \x5F the LOW_LINE character 1 ------- [[:word:]] = \p{word} = [[:w:]] = \p{w} = \pw = \w a WORD character 137 = [0-9A-Z_a-zƒˆŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] = [[:alnum:]]|\x5F = \p{alnum}|\x5F [[:upper:]]|[[:lower:]] = [[:upper:][:lower:]] = \u|\l Any LETTER, whatever its CASE 123 [[:alnum:]] = \p{alnum} an ALPHANUMERIC character 136 = [0-9A-Za-zƒˆŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] = [[:upper:][:lower:][:digit:]\xAA\xBA\x{02C6}] [[:alpha:]] = \p{alpha} any LETTER character 126 = [[:upper:][:lower:]\xAA\xBA\x{02C6}] [[:graph:]] = \p{graph} any VISIBLE character 215 = [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD] [[:print:]] = \p{print} any PRINTABLE character 222 = [[:graph:]]|\s [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character 73 = \p{Punctuation}|\p{Symbol} = [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x{20AC}\x{201A}\x{201E}\x{2026}\x{2020}\x{2021}\x{2030}\x{2039}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{203A}\xA1-\xA9\xAB\xAC\xAE-\xB1\xB4\xB6-\xB8\xBB\xBF\xD7\xF7] = [^[:cntrl:]\w\x20\xA0\xAD\xB2\xB3\xB9\xBC\xBD\xBE]|\x5F [[:xdigit:]] an HEXADECIMAL character 22 = [0-9A-Fa-f] = (?i)[0-9A-F]
Below, the correct results for all Unicode character classes, against the
Total_ANSI.txtfile ( sinceColumns++ v1.3, Unicode classes work inANSIfiles, as well ) :\p{Any} Any character 256 = (?s). = \I = [\x{0000}-\x{EFFFD}] \p{Ascii} a character UNDER \x80 128 = [[:ascii:]] = \o \p{Assigned} an ASSIGNED character 256 \p{Cc} = \p{Control} a C0 or C1 CONTROL code character 38 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] \p{Cf} = \p{Format} a FORMAT CONTROL character 1 = \xAD \p{Cn} = \p{Not Assigned} an UNASSIGNED or NON-CHARACTER character 0 \p{Co} = \p{Private Use} a PRIVATE-USE character 0 \p{Cs} = \p{Surrogate} (INVALID regex) a SURROGATE character 0 ------ \p{C*} = \p{Other} 39 = \p{Cc}|\p{Cf}|\p{Cn}|\p{Co} \p{Lu} = \p{Uppercase Letter} an UPPER case letter 60 = \u = [[:upper:]] = \p{upper} \p{Ll} = \p{Lowercase Letter} a LOWER case letter 63 = \l = [[:lower:]] = \p{lower} \p{Lt} = \p{Titlecase} a DI-GRAPHIC letter 0 \p{Lm} = \p{Modifier Letter} a MODIFIER letter 1 = \x{02C6} \p{Lo} = \p{Other Letter} OTHER letter 2 = [\xAA\xBA] ------- \p{L*} = \p{Letter} 126 = \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo} = [[:alpha:]] = \p{alpha} \p{Mc} = \p{Spacing Combining Mark} a SPACING COMBINING mark 0 \p{Me} = \p{Enclosing Mark} an ENCLOSING mark (POSITIVE advance width) 0 \p{Mn} = \p{Non-Spacing Mark} a NON-SPACING COMBINING mark (ZERO advance width) 0 ----- \p{M*} = \p{Mark} 0 = \p{Mc}|\p{Me}|\p{Mn} \p{Nd} = \p{Decimal Digit Number} a DECIMAL number character 10 = \d = [[:digit:]] = \p{digit} \p{Nl} = \p{Letter Number} a LETTERLIKE numeric character 0 \p{No} = \p{Other Number} OTHER NUMERIC character 6 = [\xB2\xB3\xB9\xBC\xBD\xBE] ------ \p{N*} = \p{Number} 16 = \p{Nd}|\p{Nl}|\p{No} = [0-9\xB2\xB3\xB9\xBC\xBD\xBE] \p{Pd} = \p{Dash Punctuation} a DASH or HYPHEN punctuation mark 3 = [\x2D\x{2013}\x{2014}] \p{Ps} = \p{Open Punctuation} an OPENING PUNCTUATION mark, in a pair 5 = [\x28\x5B\x7B\x{201A}\x{201E}] \p{Pc} = \p{Connector Punctuation} a CONNECTING PUNCTUATION mark 1 = \x5F \p{Pe} = \p{Close Punctuation} a CLOSING PUNCTUATION mark, in a pair 3 = [\x29\x5D\x7D] \p{Pi} = \p{Initial Punctuation} an INITIAL QUOTATION mark 4 = [\x{2039}\x{2018}\x{201C}\xAB] \p{Pf} = \p{Final Punctuation} a FINAL QUOTATION mark 4 = [\x{2019}\x{201D}\x{203A}\xBB] \p{Po} = \p{Other Punctuation} OTHER PUNCTUATION mark 25 = [\x21-\x23\x25-\x27\x2A\x2C\x2E\x2F\x3A\x3B\x3F\x40\x5C\x{2026}\x{2020}\x{2021}\x{2030}\x{2022}\xA1\xA7\xB6\xB7\xBF] ------ \p{P*} = \p{Punctuation} 45 = \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po} \p{Sm} = \p{Math Symbol} a MATHEMATICAL symbol character 10 = [\x2B\x3C-\x3E\x7C\x7E\xAC\xB1\xD7\xF7] \p{Sc} = \p{Currency Symbol} a CURRENCY character 6 = [\x24\x{20AC}\xA2-\xA5] \p{Sk} = \p{Modifier Symbol} a NON-LETTERLIKE MODIFIER character 7 = [\x5E\x60\x{02DC}\xA8\xAF\xB4\xB8] \p{So} = \p{Other Symbol} OTHER SYMBOL character 5 = [\x{2122}\xA6\xA9\xAE\xB0] ------ \p{S*} = \p{Symbol} 28 = \p{Sm}|\p{Sc}|\p{Sk}|\p{So} \p{Zs} = \p{Space Separator} a NON-ZERO width SPACE character 2 = [\x20\xA0] = (?!\t)\h \p{Zl} = \p{Line Separator} the LINE SEPARATOR character 0 \p{Zp} = \p{Paragraph Separator} the PARAGRAPH SEPARATOR character 0 ----- \p{Z*} = \p{Separator} 2 = \p{Zs}|\p{Zl}|\p{Zp}Remark :
-
A negative POSIX character class can be expressed as
[^[:........:]]or[[:^........:]] -
A negative UNICODE character class can be expressed as
\P{..}, with an uppercase letterP
With this last release, @coises, results are totally coherent between
ANSIandUTF-8files !Continuation on next post
-
-
Hello, @coises and All,
Continuation and end of my post
I also tested ALL the `equivalence class feature :
You can use ANY equivalent character to get the total number of matches of the equivalence class character. For example,
[[=ª=]]=[[=Å=]]=[[=ã=]]= … )Here is, below, the list of all the equivalences of any char of the
Windows-1252code-page, against theTotal_ANSI.txtfile. Note that I did not consider the equivalence classes which returns only one match ![[=1=]] = [[=one=]] => 2 [1¹] [[=2=]] = [[=two=]] => 2 [2²] [[=3=]] = [[=three=]] => 2 [3³] [[=A=]] => 15 [AaªÀÁÂÃÄÅàáâãäå] [[=B=]] => 2 [Bb] [[=C=]] => 4 [CcÇç] [[=D=]] => 4 [DdÐð] [[=E=]] => 10 [EeÈÉÊËèéêë] [[=F=]] => 3 [Ffƒ] [[=G=]] => 2 [Gg] [[=H=]] => 2 [Hh] [[=I=]] => 10 [IiÌÍÎÏìíîï] [[=J=]] => 2 [Jj] [[=K=]] => 2 [Kk] [[=L=]] => 2 [Ll] [[=M=]] => 2 [Mm] [[=N=]] => 4 [NnÑñ] [[=O=]] => 15 [OoºÒÓÔÕÖØòóôõöø] [[=P=]] => 2 [Pp] [[=Q=]] => 2 [Qq] [[=R=]] => 2 [Rr] [[=S=]] => 4 [SsŠš] [[=T=]] => 2 [Tt] [[=U=]] => 10 [UuÙÚÛÜùúûü] [[=V=]] => 2 [Vv] [[=W=]] => 2 [Ww] [[=X=]] => 2 [Xx] [[=Y=]] => 6 [YyÝýÿŸ] [[=Z=]] => 4 [ZzŽž] [[=^=]] = [[=circumflex=]] => 2 [^ˆ] = [\x5E\x{02C6}] [[=Œ=]] => 2 [Œœ] = [\x{0152}\x{0153}] [[==]] => 2 [[.NUL.][.SHY.]] = [\x00\xAD] [[=Þ=]] => 2 [Þþ] = [\xDE\xFE]
Some double-letter characters equivalences :
[[=AE=]] = [[=Ae=]] = [[=ae=]] => 2 [Ææ] = [\xC6\xE6] [[=SS=]] = [[=Ss=]] = [[=ss=]] => 1 [ß] = [\xDF]
An example : let’s suppose that we run this regex
[A-F[:lower:]], against myTotal_ANSI.txtfile. It does give69matches, so6UPPER letters +63LOWER lettersThe regexes
[[:upper:]]|[[:lower:]]and[[:upper:][:lower:]]act as insensitive regexes and return123matches ( So60UPPER letters +63LOWER letters )The regexes
(?=\u)\land(?=\l)\udo not find anything. This implies that the sets of UPPER and LOWER letters, inTotal_ANSI.twt, are totally disjointBest Regards
guy038
P.S. :
BTW, I forgot to list the equivalence classes,
> 1, of theControl C0/C1andControl Formatcharacters, against theTotal_Chars.txtfile ! Here are the results, below :[[=nul=]] => 3,240 [\x{0000}\x{00AD}....] Cc [[= =]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=mmsp=]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=idsp=]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=shy=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=alm=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=sam=]] => 2 [\x{070F}\x{2E1A}] Po [[=nqsp=]] => 2 [\x{2000}\x[2002}] Zs [[=ensp=]] => 2 [\x{2000}\x[2002}] Zs [[=mqsp=]] => 2 [\x{2001}\x{2003}] Zs [[=emsp=]] => 2 [\x{2001}\x{2003}] Zs [[=zwnj=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=zwj=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=lrm=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=rlm=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=ls=]] => 2 [\x{2028}\x{FE47}] Zl [[=lre=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=rle=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=pdf=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=lro=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=rlo=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=wj=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=(fa)=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=(it)=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=(is)=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=(ip)=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=lri=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=rli=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=fsi=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=pdi=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=iss=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=ass=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=iafs=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=aafs=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=nads=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=nods=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=zwnbsp=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=iaa=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=ias=]] => 3,240 [\x{0000}\x{00AD}....] Cf [[=iat=]] => 3,240 [\x{0000}\x{00AD}....] CfAs you can see, a lot of
Formatcharacters return an erroneous result of3,240occurrences. But we’re not going to bother about these wrongequivalenceclasses, as long as the similarcollatingnames, with the[[.XXX.]]syntax, are totally correct !Luckily, all the other equivalence classes are also correct, except for
[[=ls=]]which returns2matches\x{2028}and\x{FE47}?? -
@guy038 said in Columns++ version 1.3: All Unicode, all the time:
As you can see, a lot of Format characters return an erroneous result of 3,240 occurrences. But we’re not going to bother about these wrong equivalence classes, as long as the similar collating names, with the [[.XXX.]] syntax, are totally correct !
Luckily, all the other equivalence classes are also correct, except for [[=ls=]] which returns 2 matches \x{2028} and \x{FE47} ??
Thank you for the observation. I will have to look into this more closely. I believe the Boost::regex engine uses the transform_primary member function of the character traits class to determine equivalence: if the sort key returned by that function for two characters is the same, then they are equivalent. I implemented transform_primary using LCMapStringEx, as that is normally how one does Unicode sorting. But how is sorting relevant to regular expressions?
It could be — despite the documented requirement for the function — that what is needed from transform_primary isn’t a sort key, but rather a case folding followed by a compatibility decomposition.
Again, thank you for all your testing, and for calling this to my attention.
-
Hi, @coises,
If you need my
Total_Chars.txtfile, simply extract it from theUnicode.ziparchive, within my Google Drive account :https://drive.google.com/file/d/1kYtbIGPRLdypY7hNMI-vAJXoE7ilRMOC/view?usp=sharing
You do not need the other files of this archive, as the main information is described below !
The
Total_Chars.txtfile is a trueUTF-8file with a BOM, which contains each Unicode assigned and unassigned code-point, once only, from\x{0000}to\x{EFFFD}Pysically, it contains
3lines :-
A first line, from
\x{0000}to\x{0009}, with the\x{000A}line-break -
A second line, from
\x{000B}to\x{000C}, with the\x{000D}line-break -
A third very LONG line with all characters, from
\x{000E}to\x{EFFFD}, without some excluded ones ( refer below )
In
UTF-8terms, theTotal_Chars.txtfile can be decomposed as :• [\x{0000}-\x{007F}] 128 chars coded with 1 byte => 128 • [\x{0080}-\x{07FF}] 1,920 chars coded with 2 bytes => 3,840 • [\x{0800}-\x{FFFD}] 61,406 chars coded with 3 bytes => 184,218 • Planes 1, 2, 3, 14 = 4 × 65,534 = 262,136 chars coded with 4 bytes => 1,048,544 ----------- -------------- 325,590 chars 1 236 730 bytes • BOM 3 bytes ----------- -------------- 325,590 chars 1 236 733 bytes
As mentionned above, the
Total_Chars.txtdoes NOT contain the following zones :• The SURROGATES block, from \x{D800} to \x{DFFF} • The 32 NOT-Unicode chars, from \x{FDD0} to \x{FDEF} • The two NOT-Unicode chars, ending the Plane 0 \x{FFFE} and \x{FFFF} • The two NOT-Unicode chars, ending the Plane 1 \x{1FFFE} and \x{1FFFF} • The two NOT-Unicode chars, ending the Plane 2 \x{2FFFE} and \x{2FFFF} • The two NOT-Unicode chars, ending the Plane 3 \x{3FFFE} and \x{3FFFF} • The COMPLETE planes 4 to 13, from \x{40000} to \x{DFFFF} • The two NOT-Unicode chars, ending the plane 14 \x{EFFFE} and \x{EFFFF} • The PRIVATE-USE planes 15 to 16, from \x{F0000} to \x{10FFFF}
Here is, below, the list of all INCLUDED planes, followed with all the EXCLUDED zones of the
Total_Chars.txtfile :•=========================================•=======================================• | Zones INCLUDED in 'Total_Chars.txt' | Range | Plane | # Chars | •=========================================•================•=========•============• | | 0000..FFFD | 0 | 63,454 | •-----------------------------------------•----------------•---------•------------• | | 10000..1FFFD | 1 | 65,534 | •-----------------------------------------•----------------•---------•------------• | | 20000..2FFFD | 2 | 65,534 | •-----------------------------------------•----------------•---------•------------• | | 30000..3FFFD | 3 | 65,534 | •-----------------------------------------•----------------•---------•------------• | | E0000..EFFFD | 14 | 65,534 | •=========================================•================•=========•============• | Total INCLUDED characters | | | 325,590 | •=========================================•================•=========•============• •=========================================•================•=========•===========• | Zones EXCLUDED from 'Total_Chars.txt' | Range | Plane | # Chars | •=========================================•================•=========•===========• | Surrogates | D800..DFFF | 0 | 2,048 | | Not Unicode | FDD0..FDEF | 0 | 32 | | Not Unicode | FFFE..FFFF | 0 | 2 | •----------------------------------------------------------•---------•-----------• | Not Unicode | 1FFFE..1FFFF | 1 | 2 | •----------------------------------------------------------•---------•-----------• | Not Unicode | 2FFFE..2FFFF | 2 | 2 | •----------------------------------------------------------•---------•-----------• | Not Unicode | 3FFFE..3FFFF | 3 | 2 | •----------------------------------------------------------•---------•-----------• | Unassigned | 40000..4FFFD | 4 | 65,534 | | Not Unicode | 4FFFE..4FFFF | 4 | 2 | •----------------------------------------------------------•---------•-----------• | Unassigned | 50000..5FFFD | 5 | 65,534 | | Not Unicode | 5FFFE..5FFFF | 5 | 2 | •----------------------------------------------------------•---------•-----------• | Unassigned | 60000..6FFFD | 6 | 65,534 | | Not Unicode | 6FFFE..6FFFF | 6 | 2 | •----------------------------------------------------------•---------•-----------• | Unassigned | 70000..7FFFD | 7 | 65,534 | | Not Unicode | 7FFFE..7FFFF | 7 | 2 | •----------------------------------------------------------•---------•-----------• | Unassigned | 80000..8FFFD | 8 | 65,534 | | Not Unicode | 8FFFE..8FFFF | 8 | 2 | •----------------------------------------------------------•---------•-----------• | Unassigned | 90000..9FFFD | 9 | 65,534 | | Not Unicode | 9FFFE..9FFFF | 9 | 2 | •----------------------------------------------------------•---------•-----------• | Unassigned | A0000..AFFFD | 10 | 65,534 | | Not Unicode | AFFFE..AFFFF | 10 | 2 | •----------------------------------------------------------•---------•-----------• | Unassigned | B0000..BFFFD | 11 | 65,534 | | Not Unicode | BFFFE..BFFFF | 11 | 2 | •----------------------------------------------------------•---------•-----------• | Unassigned | C0000..CFFFD | 12 | 65,534 | | Not Unicode | CFFFE..CFFFF | 12 | 2 | •----------------------------------------------------------•---------•-----------• | Unassigned | D0000..DFFFD | 13 | 65,534 | | Not Unicode | DFFFE..DFFFF | 13 | 2 | •----------------------------------------------------------•---------•-----------• | Not Unicode | EFFFE..EFFFF | 14 | 2 | •----------------------------------------------------------•---------•-----------• | Supplementary_Private_Use_Area-A | F0000..FFFFD | 15 | 65,534 | | Not Unicode | FFFFE..FFFFF | 15 | 2 | •----------------------------------------------------------•---------•-----------• | Supplementary_Private_Use_Area-B | 100000..10FFFD | 16 | 65,534 | | Not Unicode | 10FFFE..10FFFF | 16 | 2 | •=========================================•================•=========•===========• | Total EXCLUDED characters | | | 788,522 | •=========================================•================•=========•===========• •-----------------------------------------•----------------•---------•-----------• | Total UNICODE characters | 0000..10FFFF | 0 - 16 | 1,114,112 | •-----------------------------------------•----------------•---------•-----------•Best Regards,
guy038
-
-
@guy038 said in Columns++ version 1.3: All Unicode, all the time:
As you can see, a lot of Format characters return an erroneous result of 3,240 occurrences. But we’re not going to bother about these wrong equivalence classes, as long as the similar collating names, with the [[.XXX.]] syntax, are totally correct !
Luckily, all the other equivalence classes are also correct, except for [[=ls=]] which returns 2 matches \x{2028} and \x{FE47} ??
Still looking into this, I find this statement in the Boost::regex documentation (emphasis mine):
An expression of the form [[=col=]], matches any character or collating element whose primary sort key is the same as that for collating element col, as with collating elements the name col may be a symbolic name. A primary sort key is one that ignores case, accentation, or locale-specific tailorings; so for example [[=a=]] matches any of the characters: a, À, Á, Â, Ã, Ä, Å, A, à, á, â, ã, ä and å. Unfortunately implementation of this is reliant on the platform’s collation and localisation support; this feature can not be relied upon to work portably across all platforms, or even all locales on one platform.
LCMapStringEx(locale.data(), LCMAP_SORTKEY | LINGUISTIC_IGNOREDIACRITIC | NORM_IGNORECASE | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH | NORM_LINGUISTIC_CASING, ...as my best guess at how to do this.
There are some differences other than the format characters between my search and Notepad++. For example,
[[=k=]]matches Ʞ (U+A7B0) in Columns++ search, but not in Notepad++ native search; though both match its lower-case counterpart, ʞ (U+029E).I do wonder why
[[=ls=]]matches ﹇ (U+FE47) as well as U+2028. Though Notepad++ native search does not accept the[[=ls=]]syntax, substituting the actual U+2028 character,[[= =]](you can copy that even though you can’t see it), yields 12 matches, including U+FE47.Do you know if there is a precise definition of what should count as an equivalence class in Unicode regular expressions? It is unclear to me for what target I should be aiming.
-
Hello, @coises and All,
I’m elaborating a list of ALL the word characters of ANY Unicode block and I’ve noticed a strange behavior in three Unicode blocks (
Latin Extended-A,GeorgianandLatin Extended-C)Indeed, when you use the following regexes, against my
Total_Chars.txtfile, with theColumns++plugin :-
(?=\w)[\x{0100}-\x{017F}] -
(?=\w)[\x{10A0}-\x{10FF}] -
(?=\w)[\x{2C60}-\x{2C7F}]
They all return an error ?!
However, note that the regexes :
-
(?=\w)[\x{0100}-\x{017E}]return127word chars -
(?=\w)\x{017F}return1word char
Giving the exact total of word chars of the
Latin Extended-AUnicode block (128)
Note also that the regexes :
-
(?=\w)[\x{10A0}-\x{10C7}]return39word chars -
(?=\w)[\x{10C8}-\x{10FF}]return48word chars
Giving the exact number of word chars of the
GeorgianUnicode block (87)
Finally, note that the regexes :
-
(?=\w)[\x{2C60}-\x{2C7D}]return30word chars -
(?=\w)[\x{2C7E}-\x{2C7F}]return2word chars
Giving the exact number of word chars of the
Latin Extended-CUnicode block (32)TIA, @coises, for investigating !
Best Regards,
guy038
-
-
@guy038 said in Columns++ version 1.3: All Unicode, all the time:
They all return an error ?!
Thank you for discovering this!
I’ve identified the problem. It is an error in how I handle match case. If you test with
(?-i)before the expressions you’ll find that they work.To follow the explanation, note these characteristics of ranges in Boost::regex:
-
Ranges must have the lower bound first and the upper bound second. Reverse order is not allowed and produces an error message.
-
Case insensitive ranges are processed by first case folding both ends of the range, then accepting any character which case folds to a character within the range.
The reason the ranges you tried don’t work with match case checked is that I neglected to include that switch when testing the validity of a regex, thinking (wrongly) that case sensitivity could not affect the validity of a regex.
I am reasonably certain (but haven’t yet verified in detail) that the reason the first and third expressions work case-insensitive in Notepad++ native search, but don’t work case-insensitive in Columns++ search, is that Columns++ uses Unicode-defined case folding, while I believe Notepad++ (as a Boost::regex default) uses Windows lower-casing. Those two aren’t always the same.
I will prepare a new version of Columns++ to fix this. In the meantime, you can work around it by prefixing
(?-i)to case sensitive searches instead of depending on the match case check box. -
-
@guy038 said in Columns++ version 1.3: All Unicode, all the time:
Indeed, when you use the following regexes, against my Total_Chars.txt file, with the Columns++ plugin :
(?=\w)[\x{0100}-\x{017F}] (?=\w)[\x{10A0}-\x{10FF}] (?=\w)[\x{2C60}-\x{2C7F}]They all return an error ?!
Columns++ version 1.3.1 should fix this (when Match case is checked; odd behavior for ranges seems unavoidable when case insensitive mode is in effect; note that Notepad++ native search also gives an error on the second expression with Match case not checked).
Notepad++ version 8.9.1 release candidate is expected any day now, so I rushed this in… hopefully I didn’t make any major mistakes.
Thank you again, @guy038, for catching this bug.
-
Hello, @coises and All,
I’ve found out a small anomaly concerning
hexadecimalcharacters :-
If I use the native Notepad++ search to match any hexadecimal character, with the regex
[[:xdigit:]], against myTotal_Chars.txtfile, it returns44matches -
If I use the Columns++ search to match any hexadecimal character, with the regex
[[:xdigit:]], against myTotal_Chars.txtfile, it returns22matches
I suppose that the N++ answer is the right one. Indeed, in the https://www.unicode.org/reports/tr18/#Compatibility_Properties article , (
Annexe Cabout UNICODE REGULAR EXPRESSIONS ), it is said :Hex_Digit contains 0-9 A-F fullwidth and halfwidth, upper and lowercase
Note that the
\p{Hex_Digit}regex is erroneous ! The right one is\p{xdigit}, at least, withinColumns++Here is an other proof from https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt. Search for the string
Hexin your browser : it clearly shows that the total should be44!
Now, I found out some other syntaxes about the Unicode classes :
Any Unicode
class regexcan be expressed with one among these four syntaxes :\p{Xx},\p{Xxxxxxx},[[:Xx:]],[[:Xxxxxxx:]]Therefore, here is an update of my previous post https://community.notepad-plus-plus.org/post/104377 :
Against the
Total_Chars.txtfile, all these general results, below, are correct :(?s). = \I = \p{Any} = [\x{0000}-\x{EFFFD}] => Total = 325,590 \p{Unicode} = [[:Unicode:]] => 325,334 | | Total = 325,590 \P{Unicode} = [[:^Unicode:]] => 256 | \p{Ascii} = \o => 128 | | Total = 325,590 \P{Ascii} = \O => 325,462 | \X => 322,586 | | Total = 325,590 (?!\X). => 3,004 | [\x{E000}-\x{F8FF}]|\y = [\x{E000}-\x{F8FF}]|[[:defined:]] = \p{Assigned} => 166,266 | | Total = 325,590 (?![\x{E000}-\x{F8FF}])\Y = (?![\x{E000}-\x{F8FF}])[^[:defined:]] = \p{Not Assigned} => 159,324 |Note : if we add, to the number of characters of
Total_Chars.txt, the contents of any omitted planes ( Planes4to13,16and17), less the TWO non-characters for each, plus the Surrogate characters and all the Unicode non-chars, we obtain :325,590+(65536 - 2) * 12+2,048+66=1,114,112which is, indeed, the total amount of Unicode chars, , both assigned or not assigned !
Here are the correct results, concerning all the Posix character classes, against the
Total_Chars.txtfile[[:ascii:]] an UNDER \x{0080} char 128 = [\x{0000}-\x{007F}] = \p{ascii} = \o [[:unicode:]] = \p{unicode an OVER \x{00FF} char 325,334 = [\x{0100}-\x{EFFFD}] ( RESTRICTED to 'Total_Chars.txt' ) [[:space:]] = \p{space} = [[:s:]] = \p{s} = \ps = \s a WHITE-SPACE char 25 = [\t\n\x{000B}\f\r\x20\x{0085}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}] [[:h:]] = \p{h} = \ph = \h an HORIZONTAL white space char 18 = [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t [[:blank:]] = \p{blank} a BLANK char 18 = [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t [[:v:]] = \p{v} = \pv = \v a VERTICAL white space char 7 = [\n\x{000B}\f\r\x{0085}\x{2028}\x{2029}] [[:cntrl:]] = \p{cntrl} a CONTROL code char 65 = [\x{0000}-\x{001F}\x{007F}\x{0080}-\x{009F}] [[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u an UPPER case letter char 1,886 = \p{Lu} [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l a LOWER case letter char 2,283 = \p{Ll} a DI-GRAPIC letter char 31 = \p{Lt} a MODIFIER letter char 410 = \p{Lm} an OTHER letter char 141,062 = \p{Lo} + SYLLABLES / IDEOGRAPHS [[:digit:]] = \p{digit} = [[:d:]] = \p{d} = \pd = \d a DECIMAL number 770 = \p{Nd} _ = \x{005F} the LOW_LINE char 1 --------- [[:word:]] = \p{word} = [[:w:]] = \p{w} = \pw = \w a WORD char 146,443 = \p{L*}|\p{Nd}|_ [[:alnum:]] = \p{alnum} an ALPHANUMERIC char 146,442 = \p{L*}|\p{Nd} [[:alpha:]] = \p{alpha} any LETTER char 145,672 = \p{L*} [[:graph:]] = \p{graph} any VISIBLE char 159,612 = [^\s[:C*:]] = (?=\S)\P{Other} [[:print:]] = \p{print} any PRINTABLE char 159,637 = [[:graph:]]|\s [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL char 9,473 = \p{P*}|\p{S*} = \p{Punctuation}|\p{Symbol} = 856 + 8,617 [[:xdigit:]] = \p{xdigit} an HEXADECIMAL char 22 = [0-9A-Fa-f]
And here, are the correct results regarding the Unicode character classes, against the
Total_Chars.txtfile :\p{Any} = [[:Any:]] = ANY char 325,590 = (?s). = \I = [\x{0000}-\x{EFFFD}] \p{Ascii} = [[:Ascii:]] = an UNDER \x80 char 128 = [[:ascii:]] = \o \p{Assigned} = [[:Assigned:]] = an ASSIGNED char 166,266 ( of Total_Chars.txt, ONLY ) \p{Cc} = \p{Control} = [[:Cc:]] = [[:Control:]] = a C0 or C1 CONTROL code char 65 \p{Cf} = \p{Format} = [[:Cf:]] = [[:Format:]] = a FORMAT CONTROL char 170 \p{Cn} = \p{Not Assigned} = [[:Cn:]] = [[:Not Assigned:]] = an UNASSIGNED or NON-CHARACTER char 159,324 ( 'Total_Chars.txt' does NOT contain the 66 NON-CHARACTER chars ) \p{Co} = \p{Private Use} = [[:Co:]] = [[:Private Use:]] = a PRIVATE-USE char 6,400 \p{Cs} = \p{Surrogate} = [[:Cs:]] = [[:Surrogate:]] = a SURROGATE char [2,048] ( 'Total_Chars.txt' does NOT contain the 2,048 SURROGATE chars ) ----------- \p{C*} = \p{Other} = [[:C*:]] = [[:Other:]] = 165,959 = \p{Cc}|\p{Cf}|\p{Cn}|\p{Co} \p{Lu} = \p{Uppercase Letter} = [[:Lu:]] = [[:Uppercase Letter:]] = an UPPER case letter char 1,886 = \u = [[:upper:]] = \p{upper} \p{Ll} = \p{Lowercase Letter} = [[:Ll:]] = [[:Lowercase Letter:]] = a LOWER case letter char 2,283 = \l = [[:lower:]] = \p{lower} \p{Lt} = \p{Titlecase} = [[:Lt:]] = [[:Titlecase:]] = a DI-GRAPHIC letter char 31 \p{Lm} = \p{Modifier Letter} = [[:Lm:]] = [[:Modifier Letter:]] = a MODIFIER letter char 410 \p{Lo} = \p{Other Letter} = [[:Lo:]] = [[:Other Letter:]] = an OTHER letter char 141,062 + SYLLABLES / IDEOGRAPHS ----------- \p{L*} = \p{Letter} = [[:L*:]] = [[:Letter:]] = 145,672 = \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo} = [[:alpha:]] = \p{alpha} \p{Mc} = \p{Spacing Combining Mark} = [[:Mc:]] = [[:Spacing Combining Mark:]] = a SPACING COMBINING char 471 \p{Me} = \p{Enclosing Mark} = [[:Me:]] = [[:Enclosing Mark!:]] = an ENCLOSING char 13 \p{Mn} = \p{Non-Spacing Mark} = [[:Mn:]] = [[:Non-Spacing Mark:]] = a NON-SPACING COMBINING char 2,059 -------- \p{M*} = \p{Mark} = [[:M*:]] = [[:Mark:]] 2,543 = \p{Mc}|\p{Me}|\p{Mn} \p{Nd} = \p{Decimal Digit Number} = [[:Nd:]] = [[:Decimal Digit Number:]] = a DECIMAL number char 770 \p{Nl} = \p{Letter Number} = [[:Nl:]] = [[:Letter Number:]] = a LETTERLIKE numeric char 239 \p{No} = \p{Other Number} = [[:No:]] = [[:Other Number:]] = OTHER NUMERIC char 915 -------- \p{N*} = \p{Number} = [[:N*:]] = [[:Number:]] 1,924 = \p{Nd}|\p{Nl}|\p{No} \p{Pd} = \p{Dash Punctuation} = [[:Pd:]] = [[:Dash Punctuation:]] = a DASH or HYPHEN punctuation char 27 \p{Ps} = \p{Open Punctuation} = [[:Ps:]] = [[:Open Punctuation:]] = an OPENING PUNCTUATION char 79 \p{Pc} = \p{Connector Punctuation} = [[:Pc:]] = [[:Connector Punctuation:]] = a CONNECTING PUNCTUATION char 10 \p{Pe} = \p{Close Punctuation} = [[:Pe:]] = [[:Close Punctuation:]] = a CLOSING PUNCTUATION char 77 \p{Pi} = \p{Initial Punctuation} = [[:Pi:]] = [[:Initial Punctuation:]] = an INITIAL QUOTATION char 12 \p{Pf} = \p{Final Punctuation} = [[:Pf:]] = [[:Final Punctuation:]] = a FINAL QUOTATION char 10 \p{Po} = \p{Other Punctuation} = [[:Po:]] = [[:Other Punctuation:]] = OTHER PUNCTUATION char 641 ------- \p{P*} = \p{Punctuation} = [[:P*:]] = [[:Punctuation:]] = 856 = \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po} \p{Sm} = \p{Math Symbol} = [[:Sm:]] = [[:Math Symbol:]] = a MATHEMATICAL symbol char 960 \p{Sc} = \p{Currency Symbol} = [[:Sc:]] = [[:Currency Symbol:]] = a CURRENCY char 64 \p{Sk} = \p{Modifier Symbol} = [[:Sk:]] = [[:Modifier Symbol:]] = a NON-LETTERLIKE MODIFIER char 125 \p{So} = \p{Other Symbol} = [[:So:]] = [[:Other Symbol:]] = OTHER SYMBOL char 7,468 --------- \p{S*} = \p{Symbol} = [[:S*:]] = [[:Symbol:]] = 8,617 = \p{Sm}|\p{Sc}|\p{Sk}|\p{So} \p{Zs} = \p{Space Separator} = [[:Zs:]] = [[:Space Separator:]] = a NON-ZERO width SPACE char 17 = [\x{0020}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = (?!\t)\h \p{Zl} = \p{Line Separator} = [[:Zl:]] = [[:Line Separator:]] = the LINE SEPARATOR char 1 = \x{2028} \p{Zp} = \p{Paragraph Separator} = [[:Zp:]] = [[:Paragraph Separator:]] = the PARAGRAPH SEPARATOR char 1 = \x{2029} ------ \p{Z*} = \p{Separator} = [[:Z*:]] = [[:Separator:]] = 19 = \p{Zs}|\p{Zl}|\p{Zp}Remark :
-
A negative POSIX character class can be expressed as
[^[:........:]]or[[:^........:]] -
A negative UNICODE character class can be expressed as
\P{..}, with an uppercase letterP
Now, if you follow the procedure explained in the last part of this post :
https://community.notepad-plus-plus.org/post/99844
The regexes
[\x{DC80}-\x{DCFF}]or\ior[[:invalid:]]do give134occurrences, which is the exact number of invalidUTF-8characters of that example !Best Regards,
guy038
-
-
@guy038 said in Columns++ version 1.3: All Unicode, all the time:
Note that the
\p{Hex_Digit}regex is erroneous ! The right one is\p{xdigit}, at least, withinColumns++What’s going on there is that I followed the structure of Boost::regex character classes:
Character Classes that are Always Supported
Character classes that are supported by Unicode Regular Expressions
which are mainly the POSIX character classes plus Unicode General Categories interpreted as character classes. Also, note that in Boost::regex, character classes and character properties are the same thing. I didn’t make any attempt to change that. I believe this is different both from Unicode regular expressions and from PCRE.
(I did add a couple new character classes unique to Columns++:
[:defined:]and[:invalid:], and aliases\i,\oand\yfor[:invalid:],[:ASCII:]and[:defined:]. Also, Columns++ does not support[:Cs:]/[:Surrogate:]since Unicode in Scintilla can only be UTF-8, which cannot contain surrogates — though it can contain invalid byte sequences which appear to encode surrogates, as in WTF-8; Scintilla treats these as invalid UTF-8 bytes, and so does Columns++.)Hex_Digitisn’t one of the Boost::regex character classes, and I never defined it. Defining it to be equivalent toxdigitwould be trivial; re-definingxdigitto include non-ASCII characters is a bit more complicated:I’ve found out a small anomaly concerning
hexadecimalcharacters :-
If I use the native Notepad++ search to match any hexadecimal character, with the regex
[[:xdigit:]], against myTotal_Chars.txtfile, it returns44matches -
If I use the Columns++ search to match any hexadecimal character, with the regex
[[:xdigit:]], against myTotal_Chars.txtfile, it returns22matches
I suppose that the N++ answer is the right one. Indeed, in the https://www.unicode.org/reports/tr18/#Compatibility_Properties article , (
Annexe Cabout UNICODE REGULAR EXPRESSIONS ), it is said :Hex_Digit contains 0-9 A-F fullwidth and halfwidth, upper and lowercase
Yes, it would seem the standard is to include those non-ASCII characters as hex digits. Further, the comments at your link under
lowerandupperare troublesome, as Columns++ treats them as aliases forLlandLu. Word and word boundaries are probably faulty as well.I followed the Boost::regex principle that to extend the traditional POSIX mappings, the only Unicode property that is used to determine membership in a character class is the General Category.
I hard-coded (that is, they are written explicitly rather than being derived from Unicode tables) the POSIX mappings for ASCII characters, since that’s the only place they are really well-defined; plus there is a hard-coded exception for the non-ASCII character U+0085, the Next Line control character, because it should be part of
\v, which is implemented in Boost::regex as[[:v:]]. I don’t see any reason[[:xdigit:]]can’t be extended with similar hard-coded logic; I just didn’t know until now that I should do it.The other parts, though: whatever they are saying is supposed to be included in
[:lower:]and[:upper:]besides letters, and whatever they are talking about in regard to word characters and boundaries… that might be problematic. I have a condensed set of tables built from a few Unicode files, instead of trying to import the ghastly large and complex ICU. Those tables include the General Category, but if that is not enough to determine membership in a character class… reorganizing them to include whatever additional information I need (it’s not yet clear to me what that will be) is not likely to be simple.Thank you for your observation. Indeed, there are flaws. It is not yet clear to me if and how it will be practical to address them, though I can probably fix the
[:xdigit:]behavior without much difficulty. -
-
Hello, @coises and All,
@Coises, it’s been a while since I last replied to you. In the meantime, I’ve discovered two very useful tools in the world of Unicode :
-
First, the https://codepoints.net/ site, still with the
v16.0release and which should migrate to thev17.0very soon. Look at the search part ! -
Secondly, a powerful CMD command tool, called
uni.exe, at https://github.com/arp242/uni. Download the lastuni-v2.9.0-windows-amd64.exe.gzarchive. Extract the onlyuni-v2.9.0-windows-amd64.exefile which can be renamed asuni.exeand run it in a command line window ! Awesome, indeed !
However, I found some issues, reported in https://github.com/arp242/uni/issues/58 and https://github.com/arp242/uni/issues/59
But let’s get back to your plugin before someone tells me I’m OFF TOPIC. Ha ha!
In the https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt, within the Binary Properties section, there are two lines :
# ================================================ # Binary Properties # ================================================ AHex ; ASCII_Hex_Digit Alpha ; Alphabetic Bidi_C ; Bidi_Control Bidi_M ; Bidi_Mirrored ........ ; .................. Hex ; Hex_Digit ........ ; .................. XO_NFC ; Expands_On_NFC XO_NFD ; Expands_On_NFD XO_NFKC ; Expands_On_NFKC XO_NFKD ; Expands_On_NFKDThus, I suppose that the
[[:xdigit:]]and\p{xdigit}properties are rather a POSIX property which, naturally, correspond to the[0-9A-Fa-f]class characterAnd the Hex_Digit property is rather an Unicode property =
[\x{0030}-\x{0039}\x{0041}-\x{0046}\x{0061}-\x{0066}\x{FF10}-\x{FF19}\x{FF21}-\x{FF26}\x{FF41}-\x{FF46}]So, the present definition of an hex-digit character is OK and I updated my last post !
Now, from https://unicode.org/reports/tr18/tr18-23.html#Compatibility_Properties, I tried to re-formulate these Compatibility Properties and add some comments about them, in the list below :
•----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | Property | UNICODE Standard | Comments •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{Uppercase} | Thus = \p[Lu}|\p{Other_Uppercase} = 1,886 + 120 = 2,006 chars | Uppercase includes more than gc = Uppercase_Letter (Lu). See "PropList.txt" for "Other_Uppercase" definition •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{Lowercase} | Thus = \p{Ll}|\p{Other_Lowercase} = 2,283 + 312 = 2,595 chars | Lowercase includes more than gc = Lowercase_Letter (Ll). See "PropList.txt" for "Other_Lowercase" definition •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{Alphabetic} | Thus = \p{Uppercase Letter}|\p{Lowercase Letter}|\p{Titlecase}|\p{Modifier Letter}|\p{Other Letter}|\p{Letter Number}|\p{Other_Alphabetic} | Alphabetic includes more than gc = Letter. See "PropList.txt" for "Other_Alphabetic" definition | | = 1,886 + 2,283 + 31 + 410 + 141,062 + 239 + 1,510 = 147,421 chars | | | = \p{Letter}|\p{Letter Number}|\p{Other_Alphabetic} | Note that combining marks (Me, Mn, Mc) are required for words of many languages | | 145,672 + 239 + 1,510 = 147,421 chars | While they could be applied to non-alphabetics, their principal use is on alphabetics. | | | | | Note that : \p{Other_Alphabetic} contains some [but not all] \p{Mark} chars ( 1,380 ) and some [but not all] \p{Other Symbol} chars ( 130 ) | Alphabetic should not be used as an approximation for word boundaries. See "word" below. | | : \p{Other_Alphabetic} contains some [but not all] \p{Other_Uppercase ) chars ( 104 ) and some [but not all] \p{Other_Lowercase} chars ( 27 ) | •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{punct} | \p{gc=Punctuation} Thus = \p{Punctuation} = 856 chars | POSIX adds symbols. Not recommended generally, | | | due to the confusion of having 'punct' include non-punctuation marks. •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{digit} \d | \p{gc=Decimal_Number} Thus = \p{Decimal Digit Number} = 770 chars | Non-decimal numbers (like Roman numerals) are normally excluded. •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{xdigit} | An Hex_Digit char Thus = \p{xdigit} = [0-9A-Fa-f]|[\x{FF10}-\x{FF19}]|[\x{FF21}-\x{FF26}]|[\x{FF41}-\x{FF46}] = 44 chars | Hex_Digit contains 0-9 A-F, fullwidth and halfwidth, upper and lowercase. •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{alnum} | Thus = \p{alpha}|\p{digit} = \p{L*}|\p{Nd} = 145,672 + 770 = 146,442 chars ( = Columns++ WORD chars minus \x{005F} ) | Simple combination of other properties •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{space} \s | A whitespace character Thus = [\t\n\x{000B}\f\r\x20\x{0085}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}] = 25 chars | See "PropList.txt" for the definition of Whitespace. •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{blank} | \p{gc=Space_Separator}|\N{CHARACTER TABULATION}. Thus = \p{Space Separator}|\t = 18 chars | "horizontal" whitespace: space separators plus U+0009 tab character •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{cntrl} | \p{gc=Control} Thus = \p{Control} = 65 characters | The characters in \p{gc=Format} share some, but not all aspects of control characters. | | | Many format characters are required in the representation of plain text. •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{graph} | NON ( \p{space} OR ( \p{gc=Control} AND \p{gc=Surrogate} AND \p{Private Use} AND \p{gc=Unassigned} ) ) | Warning: the set shown here is defined by excluding space, controls, and so on, with ^. | | Thus: = (?!\p{Cc}|\p{Co}|\p{Cn})\S = 159,782 chars | Note the negative look-ahead (?!\p{Cc}|\p{Co}|\p{Cn}) and the negative Unicode class \S •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{print} | \p{graph}|\p{blank} -- \p{cntrl} => (?!\p{Control})(\p{graph}|\p{blank}) = 159,629 chars = (?![\t\n\x{000B}\f\r\x{0085}\x{2028}\x{2029}])(\p{graph}|\s) | Includes graph and space-like characters. •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- | \p{word} \w | \p{alpha}|\p{gc=Mark}|\p{digit}|\p{gc=Connector_Punctuation}|\p{Join_Control} | This is only an approximation to Word Boundaries. The Connector Punctuation is added in | | Thus = \p{alpha}|\p{digit}|\p{Mark}|\p{Connector Punctuation}|\x{200C}|\x{200D} | for programming language identifiers, thus adding `_` and similar characters. | | 145,672 + 770 + 2,543 + 10 + 1 + 1 = 148,997 chars | Note : \p{Connector Punnctuation} includes \x{005F} •----------------•---------------------------------------------------------------------------------------------------------------------------------------------------------------•--------------------------------------------------------------------------------------------------------------- •----------------•----------------------------------------------------------------------------------------------------------------------------------------- | Property | POSIX compatible •----------------•----------------------------------------------------------------------------------------------------------------------------------------- | \p{punct} | \p{gc=Punctuation}|\p{gc=Symbol} -- p{alpha} Thus = (?!\p{Alpha})\p{Punctuation}|\p{Symbol} = 9,473 chars •----------------•----------------------------------------------------------------------------------------------------------------------------------------- | \p{digit} \d | = [0-9] = 10 chars •----------------•----------------------------------------------------------------------------------------------------------------------------------------- | \p{xdigit} | = [0-9A-Fa-f] = 22 chars •----------------•----------------------------------------------------------------------------------------------------------------------------------------- | \p{graph} | NON ( \p{space} OR ( \p{gc=Control} AND \p{gc=Format} AND \p{gc=Surrogate} AND \p{Private Use} AND \p{gc=Unassigned} ) ) Thus : | | = [^\s[:C*:]] = (?=\S)\P{Other} = 159,612 chars / Note the negative POSIX class [^\s[:C*:]] and Unicode classes \S and \P{other} •----------------•----------------------------------------------------------------------------------------------------------------------------------------- | \p{print} | = \p{graph}|\s = 159,637 chars •----------------•----------------------------------------------------------------------------------------------------------------------------------------- | \p{word} \w | = \p{alpha}|\p{digit}|\x{005F} | | = 145,672 + 770 + 1 = 146,443 chars | | •----------------•-----------------------------------------------------------------------------------------------------------------------------------------As you can see, I listed the UNICODE properties in a first table. Then, I ONLY listed the POSIX properties which have a different meaning, in a second table.
Now, from all these links :
• https://unicode.org/reports/tr18/tr18-23.html#word
• https://github.com/frohoff/jdk8u-jdk/blob/master/src/share/classes/java/util/regex/UnicodeProp.java
It happens that the correct interpretation of Word character class should be :
\p{alpha}|\p{digit}|\p{Mark}|\p{Connector Punctuation}|\x{200C}|\x{200D}(148,997chars ) whereas the current value, used in Columns++, is the POSIX one :\p{alpha}|\p{digit}|\x{005F}(146,443chars )And from this link :
• https://stackoverflow.com/questions/47361430/about-the-meaning-of-perl-w/47361944#47361944
It would appear that this class should even be extended to
\p{alphabetic}|\p{digit}|\p{Mark}|\p{Connector Punctuation}|\x{200C}|\x{200D}(150,746chars )So, I don’t see exactly which rule should be applied, regarding the word definition !?
Best Regards
guy038
-
-
Hi, @coises and All,
From this link https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt, I created a list of all Unicode blocks and I verified the number of word characters of each block with, either :
-
Columns++ -
Notepad++ -
MultiReplace
Just download the text file
Words_in_Blocks.txt, from myGoogle Driveaccount below :https://drive.google.com/file/d/1hFXLBhrKghjoMTvDk46QSk4BjlzOAPKP/view?usp=sharing
As you can see, from left to right :
-
Column
1: regex needed to get the number of Word characters -
Column
2: name of each Unicode block -
Column
3: total number of characters of each block -
Column
4: number of assigned numbers of each block, so far -
Column
5:Columns++number of Word characters found -
Column
6:N++ SearchandMultiReplacenumber of Word chars found
At this point, We can deduce some major points :
- First, for any character over the BMP, the
N++ searchandMultireplacealways return the0value whereasColumns++, implemented inUTF-32, give the correct results ! So, from now on, I’ll speak about results regarding the BMP Unicode plane, ONLY !
Secondly, in the table below, I listed all blocks where the
N++ searchandMultiReplacereturn0for Word chars. As I added a column which shows in which release, each block was created, it’s easy to see that any block after the Unicode release5.2have not been updated in ourBoostregex engine !•---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------• | Block range | Block name | Total | Assigned | Columns++ | N++ / MRep | Unicode | | | | Code-Pts | Code-Pts | Word Chrs | Word Chrs | Version | •---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------• | (?=\w)[\x{0800}-\x{083F}] | Samaritan | 64 | 61 | 25 | 0 | 5.2 | | (?=\w)[\x{18B0}-\x{18FF}] | Unified Canadian Aboriginal Syllabics Extended | 80 | 70 | 70 | 0 | 5.2 | | (?=\w)[\x{1A20}-\x{1AAF}] | Tai Tham | 144 | 127 | 74 | 0 | 5.2 | | (?=\w)[\x{1CD0}-\x{1CFF}] | Vedic Extensions | 48 | 43 | 13 | 0 | 5.2 | | (?=\w)[\x{A4D0}-\x{A4FF}] | Lisu | 48 | 48 | 46 | 0 | 5.2 | | (?=\w)[\x{A6A0}-\x{A6FF}] | Bamum | 96 | 88 | 70 | 0 | 5.2 | | (?=\w)[\x{A8E0}-\x{A8FF}] | Devanagari Extended | 32 | 32 | 9 | 0 | 5.2 | | (?=\w)[\x{A960}-\x{A97F}] | Hangul Jamo Extended-A | 32 | 29 | 29 | 0 | 5.2 | | (?=\w)[\x{A980}-\x{A9DF}] | Javanese | 96 | 91 | 58 | 0 | 5.2 | | (?=\w)[\x{AA60}-\x{AA7F}] | Myanmar Extended-A | 32 | 32 | 26 | 0 | 5.2 | | (?=\w)[\x{AA80}-\x{AADF}] | Tai Viet | 96 | 72 | 61 | 0 | 5.2 | | (?=\w)[\x{ABC0}-\x{ABFF}] | Meetei Mayek | 64 | 56 | 45 | 0 | 5.2 | | (?=\w)[\x{D7B0}-\x{D7FF}] | Hangul Jamo Extended-B | 80 | 72 | 72 | 0 | 5.2 | | (?=\w)[\x{0840}-\x{085F}] | Mandaic | 32 | 29 | 25 | 0 | 6.0 | | (?=\w)[\x{1BC0}-\x{1BFF}] | Batak | 64 | 56 | 38 | 0 | 6.0 | | (?=\w)[\x{AB00}-\x{AB2F}] | Ethiopic Extended-A | 48 | 32 | 32 | 0 | 6.0 | | (?=\w)[\x{08A0}-\x{08FF}] | Arabic Extended-A | 96 | 96 | 42 | 0 | 6.1 | | (?=\w)[\x{AAE0}-\x{AAFF}] | Meetei Mayek Extensions | 32 | 23 | 14 | 0 | 6.1 | | (?=\w)[\x{A9E0}-\x{A9FF}] | Myanmar Extended-B | 32 | 31 | 30 | 0 | 7.0 | | (?=\w)[\x{AB30}-\x{AB6F}] | Latin Extended-E | 64 | 60 | 57 | 0 | 7.0 | | (?=\w)[\x{AB70}-\x{ABBF}] | Cherokee Supplement | 80 | 80 | 80 | 0 | 8.0 | | (?=\w)[\x{1C80}-\x{1C8F}] | Cyrillic Extended-C | 16 | 11 | 11 | 0 | 9.0 | | (?=\w)[\x{0860}-\x{086F}] | Syriac Supplement | 16 | 11 | 11 | 0 | 10.0 | | (?=\w)[\x{1C90}-\x{1CBF}] | Georgian Extended | 48 | 46 | 46 | 0 | 11.0 | | (?=\w)[\x{0870}-\x{089F}] | Arabic Extended-B | 48 | 43 | 31 | 0 | 14.0 | •---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------•I did a quick test with N++
v8.9.1which says :- Update to Boost 1.90.0.
But the results do not change at all. So, if I understand correctly, the
Boostregex engine hasn’t updated Unicode since version5.2? Very surprising !
Thirdly, in the table below, I listed all blocks where the
N++ searchandMultiReplacereturn a number of WORD chars smaller than in theColumns++column :•---------------------------•----------------------------------------------- •----------•----------•-----------•------------•---------• | Block range | Block name | Total | Assigned | Columns++ | N++ / MRep | Unicode | | | | Code-Pts | Code-Pts | Word Chrs | Word Chrs | Version | •---------------------------•----------------------------------------------- •----------•----------•-----------•------------•---------• | (?=\w)[\x{02B0}-\x{02FF}] | Spacing Modifier Letters | 80 | 80 | 37 | 24 | 1.0 | | (?=\w)[\x{0370}-\x{03FF}] | Greek and Coptic | 144 | 135 | 129 | 127 | 1.0 | | (?=\w)[\x{0530}-\x{058F}] | Armenian | 96 | 91 | 80 | 78 | 1.0 | | (?=\w)[\x{0590}-\x{05FF}] | Hebrew | 112 | 88 | 31 | 30 | 1.0 | | (?=\w)[\x{0600}-\x{06FF}] | Arabic | 256 | 256 | 173 | 172 | 1.0 | | (?=\w)[\x{0900}-\x{097F}] | Devanagari | 128 | 128 | 91 | 83 | 1.0 | | (?=\w)[\x{0980}-\x{09FF}] | Bengali | 128 | 96 | 65 | 63 | 1.0 | | (?=\w)[\x{0A80}-\x{0AFF}] | Gujarati | 128 | 91 | 63 | 62 | 1.0 | | (?=\w)[\x{0C00}-\x{0C7F}] | Telugu | 128 | 101 | 68 | 64 | 1.0 | | (?=\w)[\x{0C80}-\x{0CFF}] | Kannada | 128 | 92 | 68 | 63 | 1.0 | | (?=\w)[\x{0D00}-\x{0D7F}] | Malayalam | 128 | 118 | 77 | 69 | 1.0 | | (?=\w)[\x{0D80}-\x{0DFF}] | Sinhala | 128 | 91 | 69 | 59 | 1.0 | | (?=\w)[\x{0E80}-\x{0EFF}] | Lao | 128 | 83 | 66 | 50 | 1.0 | | (?=\w)[\x{0F00}-\x{0FFF}] | Tibetan | 256 | 211 | 60 | 59 | 1.0 | | (?=\w)[\x{10A0}-\x{10FF}] | Georgian | 96 | 88 | 87 | 82 | 1.0 | | (?=\w)[\x{2070}-\x{209F}] | Superscripts and Subscripts | 48 | 42 | 15 | 7 | 1.0 | | (?=\w)[\x{3100}-\x{312F}] | Bopomofo | 48 | 43 | 43 | 41 | 1.0 | | (?=\w)[\x{4E00}-\x{9FFF}] | CJK Unified Ideographs | 20992 | 20992 | 20992 | 20932 | 1.0.1 | | (?=\w)[\x{F900}-\x{FAFF}] | CJK Compatibility Ideographs | 512 | 472 | 472 | 467 | 1.0.1 | | (?=\w)[\x{16A0}-\x{16FF}] | Runic | 96 | 89 | 83 | 78 | 3.0 | | (?=\w)[\x{13A0}-\x{13FF}] | Cherokee | 96 | 92 | 92 | 85 | 3.0 | | (?=\w)[\x{1400}-\x{167F}] | Unified Canadian Aboriginal Syllabics | 640 | 640 | 637 | 628 | 3.0 | | (?=\w)[\x{3400}-\x{4DBF}] | CJK Unified Ideographs Extension A | 6592 | 6592 | 6592 | 6582 | 3.0 | | (?=\w)[\x{31A0}-\x{31BF}] | Bopomofo Extended | 32 | 32 | 32 | 24 | 3.0 | | (?=\w)[\x{1100}-\x{11FF}] | Hangul Jamo | 256 | 256 | 256 | 240 | 3.1 | | (?=\w)[\x{1700}-\x{171F}] | Tagalog | 32 | 23 | 19 | 17 | 3.2 | | (?=\w)[\x{0500}-\x{052F}] | Cyrillic Supplement | 48 | 48 | 48 | 36 | 3.2 | | (?=\w)[\x{1900}-\x{194F}] | Limbu | 80 | 68 | 41 | 39 | 4.0 | | (?=\w)[\x{2C00}-\x{2C5F}] | Glagolitic | 96 | 96 | 96 | 94 | 4.1 | | (?=\w)[\x{2C80}-\x{2CFF}] | Coptic | 128 | 123 | 107 | 101 | 4.1 | | (?=\w)[\x{2D00}-\x{2D2F}] | Georgian Supplement | 48 | 40 | 40 | 38 | 4.1 | | (?=\w)[\x{2E00}-\x{2E7F}] | Supplemental Punctuation | 128 | 94 | 1 | 0 | 4.1 | | (?=\w)[\x{1980}-\x{19DF}] | New Tai Lue | 96 | 83 | 80 | 59 | 4.1 | | (?=\w)[\x{2D30}-\x{2D7F}] | Tifinagh | 80 | 59 | 57 | 55 | 4.1 | | (?=\w)[\x{A700}-\x{A71F}] | Modifier Tone Letters | 32 | 32 | 9 | 0 | 4.1 | | (?=\w)[\x{2C60}-\x{2C7F}] | Latin Extended-C | 32 | 32 | 32 | 29 | 5.0 | | (?=\w)[\x{1B00}-\x{1B7F}] | Balinese | 128 | 127 | 65 | 64 | 5.0 | | (?=\w)[\x{A720}-\x{A7FF}] | Latin Extended-D | 224 | 204 | 200 | 109 | 5.0 | | (?=\w)[\x{1B80}-\x{1BBF}] | Sundanese | 64 | 64 | 48 | 42 | 5.1 | | (?=\w)[\x{A640}-\x{A69F}] | Cyrillic Extended-B | 96 | 96 | 78 | 69 | 5.1 | •---------------------------•----------------------------------------------- •----------•----------•-----------•------------•---------•This time, we can see that the **Unicode releases, listed in this table, are all inferior to the Unicode
5.2release. I haven’t exactly identified the problem, so far, for these blocks !
Fourthly, in the table below, I listed all blocks where the
N++ searchandMultiReplacereturn a number of WORD chars greater than in theColumns++column :•---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------• | Block range | Block name | Total | Assigned | Columns++ | N++ / MRep | Unicode | | | | Code-Pts | Code-Pts | Word Chrs | Word Chrs | Version | •---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------• | (?=\w)[\x{0080}-\x{00FF}] | Latin-1 Supplement | 128 | 128 | 65 | 68 | 1.0 | | (?=\w)[\x{0E00}-\x{0E7F}] | Thai | 128 | 87 | 67 | 83 | 1.0 | | (?=\w)[\x{2150}-\x{218F}] | Number Forms | 64 | 60 | 2 | 41 | 1.0 | | (?=\w)[\x{3000}-\x{303F}] | CJK Symbols and Punctuation | 64 | 64 | 9 | 22 | 1.0 | | (?=\w)[\x{1800}-\x{18AF}] | Mongolian | 176 | 158 | 139 | 140 | 3.0 | •---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------•Again, I don’t understand clearly these differences between the two last columns !
Best Regards,
guy038
-
-
@guy038 said in Columns++ version 1.3: All Unicode, all the time:
So, if I understand correctly, the Boost regex engine hasn’t updated Unicode since version 5.2 ? Very surprising !
That part, at least, is easy to answer.
For the most part, Boost::regex doesn’t directly implement Unicode properties. It relies on either the operating system’s character classification routines or ICU.
It’s also possible to define a custom character traits class in C++ for use by Boost::regex.
Notepad++ and (I think) MultiReplace let Boost::regex fall back to Windows’ character classification. So that will update when and only when Windows updates.
Windows only handles “ANSI” and UTF-16. To work with the full range of Unicode code points, Boost::regex requires either ICU or a custom character traits class.
I wanted to use ICU in Columns++, but after searching and asking in a couple forums, I could not find a way to incorporate ICU in a plugin. Everything I could find talked about installing ICU on the operating system. I finally gave up, never having determined if it is even possible to deploy ICU at the application/plugin level as opposed to installing it as an operating system component.
Instead, Columns++ uses the custom character traits class approach to provide character traits for 32-bit Unicode characters — which means I had to invent my own process for analyzing the Unicode character files, compiling them into something reasonably compact and fast, and translating that into character properties. So that’s why it was possible for me to update to Unicode 17.0. That wouldn’t apply to Notepad++/MultiReplace or Boost::regex itself, because they don’t directly include anything to do with Unicode character properties; they’re dependent on Windows.
-
@guy038 said in Columns++ version 1.3: All Unicode, all the time:
So, I don’t see exactly which rule should be applied, regarding the word definition !?
and in Columns++ version 1.3: All Unicode, all the time:
Again, I don’t understand clearly these differences between the two last columns !
This is not going to be a complete response yet, but some further explanation.
Even when using ICU, Boost::regex does not implement the same regex language as described in Unicode Technical Standard #18: Unicode Regular Expressions. Some of the differences are more-or-less dictated by the architecture of Boost::regex; others appear to be choices.
This is a list of category definitions used by Boost::regex when using ICU; the table comes from matching up char_pointer_range in get_default_class_id and char_class_type in lookup_classname:
alnum U_GC_L_MASK | U_GC_ND_MASK alpha U_GC_L_MASK blank mask_blank cntrl U_GC_CC_MASK | U_GC_CF_MASK | U_GC_ZL_MASK | U_GC_ZP_MASK d U_GC_ND_MASK digit U_GC_ND_MASK graph (0x3FFFFFFFu) & ~(U_GC_CC_MASK | U_GC_CF_MASK | U_GC_CS_MASK | U_GC_CN_MASK | U_GC_Z_MASK) h mask_horizontal l U_GC_LL_MASK lower U_GC_LL_MASK print ~(U_GC_C_MASK) punct U_GC_P_MASK s U_GC_Z_MASK | mask_space space U_GC_Z_MASK | mask_space u U_GC_LU_MASK unicode mask_unicode upper U_GC_LU_MASK v mask_vertical w U_GC_L_MASK | U_GC_ND_MASK | U_GC_MN_MASK | mask_underscore word U_GC_L_MASK | U_GC_ND_MASK | U_GC_MN_MASK | mask_underscore xdigit U_GC_ND_MASK | mask_xdigitComparison with the table you referenced shows that Boost::regex does not use the same definitions. In particular, lower and upper are defined to be identical to General Categories Ll and Lu, alpha is defined to be identical to General Category L, and word does not contain all the characters mentioned in the Unicode specification.
For the most part, Columns++ follows the Boost::regex definitions, though I did not include Mn in word. Also the Boost::regex code for isctype implements some of the classifications directly; I think I am close, but not necessarily identical, for those. It looks as if Boost::regex does define xdigit according to the Unicode spec.
I think that Boost::regex defines word boundaries in terms of word characters (i.e.
\bis equivalent to(?<!\w)(?=\w)|(?<=\w)(?!\w)) and that I wouldn’t be able to change that without forking and modifying Boost::regex code.I think the questions are whether Boost::regex is more accurately considered wrong, or just different in its implementation of character classes; and if the latter, which is preferable.
At present, my estimation is that it would be time-consuming, but not impossible or fragile, to implement the Unicode definitions (aside from word boundaries) as listed in Annex C: Compatibility Properties in Columns++.
Whether that’s what should be done might still be an open question.
-
Hello, @coises, @thomas-knoefel, @peterjones and All,
@coises, many thanks for your additional info. But, please, don’t be too upset by these regex oddities ! Of course, some class definitions seems different but, in all cases,
Columns++gives more accurate results than native N++ search, anyway !In fact, I did all these researches on the Unicode world as I wanted to clarify the status about identifiers, particularly with Perl, in order to find out a simplified formulation for the
Function ListPerl parser created by @peterjones and improved with your help, by using atomic structures !My first attempt was clearly insufficient because I only took
ASCIIcharacters into account. Peter adviced me to refer to the article, below :https://perldoc.perl.org/perldata#Identifier-parsing
which explains that, when using
UTF-8, thePerlidentifier syntax should be :/ (?[ ( \p{Word} & \p{XID_Start} ) + [_] ]) (?[ ( \p{Word} & \p{XID_Continue} ) ]) * /x or in a SINGLE line (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])(?[ ( \p{Word} & \p{XID_Continue} ) ]) *Although the properties
\p{XID_Start}and\p{XID_Continue}are NOT part of theGeneral Categorylist and are not functional with theBoostregex engine, thisPerlsyntax could be expressed, in theory, with ourBoostregex engine as :(?:(?=\p{XID_Start})\w|_)(?=\p{XID_Continue})\w*
Now, with the
v17.0release of BabelMap software, I was able to get the complete and exact list of these properties :\p{WORD},\p{ID_Start},\p{ID_Continue},\p{XID_Start},\p{XID_Continue},Then, from these lists, I could deduce the Unicode characters count of the regexes
(?:(?=\p{XID_Start})\w|_)and(?=\p{XID_Continue})\w. Refer below :# ================================================================================================== # # Unicode 17.0.0 # # From article https://unicode.org/reports/tr18/tr18-23.html#word # # # Derived Property WORD : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 = \p{lettter} or [[:alpha:]] # # + Decimal_Number # Nd 770 = \p{Decimal Digit Number} # ----------- # Total : 146,442 = Columns++ WORD chars - \x{005F} # # + Mc + Me + Mn # M* 2,543 = \p{Mark} # # + Connector_Punctuation # Pc 10 ( including the LOW LINE character \x{005F} ) # # + 200C ; Other_ID_Continue # Cf 1 ZERO WIDTH NON-JOINER ( JOIN-CONTROL character ) # # + 200D ; Other_ID_Continue # Cf 1 ZERO WIDTH JOINER ( JOIN-CONTROL character ) # # => Total = 148,997 characters # # ================================================================================================== # # From file 'DerivedCoreProperties.txt' : # # https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt # # # Derived Property ID_Start : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 ( = [[:alpha:]] ) # # + Letter_Number # Nl 239 # # + 1885 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI BALUDA # # + 1886 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI THREE BALUDA # # + 2118 ; Other_ID_Start # Sm 1 SCRIPT CAPITAL P # # + 212E ; Other_ID_Start # So 1 ESTIMATED SYMBOL # # + 309B ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA VOICED SOUND MARK # # + 309C ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # # - 2E2F ; # Lm 1 VERTICAL TILDE ( as INCLUDED in L* ) # # => Total = 145,916 characters # # ================================================================================================== # # Derived Property XID_Start ( ID_Start MODIFIED for closure under NFKx ) : # # # ID_Start 145,916 # # - 037A ; ID_Start # Lm 1 GREEK YPOGEGRAMMENI # # - 0E33 ; ID_Start # Lo 1 THAI CHARACTER SARA AM # # - 0EB3 ; ID_Start # Lo 1 LAO VOWEL SIGN AM # # - 309B ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA VOICED SOUND MARK # # - 309C ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # # - FC5E ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # # - FDFA ; ID_Start # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Start # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Start # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Start # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Start # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Start # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Start # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Start # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Start # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Start # Lo 1 ARABIC SUKUN ISOLATED FORM # # - FF9E ; ID_Start # Lm 1 HALFWIDTH KATAKANA VOICED SOUND MARK # - FF9F ; ID_Start # Lm 1 HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK # # => Total = 145,893 characters # # ================================================================================================== # # Derived Property ID_Continue : # # # ID_Start = 145,916 # # - 1885 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI BALUDA # # - 1886 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI THREE BALUDA # # The TWO characters above must be SUBTRACTED because they are, both, INCLUDED in 'Other_ID_Start' and in 'Nonspacing Mark' # # + Nonspacing_Mark # Mn 2,059 # # + Spacing_Mark # Mc 471 # # + Decimal_Number # Nd 770 # # + Connector_Punctuation # Pc 10 ( including the LOW LINE char : 005F _ ) # # + 00B7 ; Other_ID_Continue # Po 1 MIDDLE DOT # + 0387 ; Other_ID_Continue # Po 1 GREEK ANO TELEIA # + 1369 ; Other_ID_Continue # No 1 ETHIOPIC DIGIT ONE # + 136A ; Other_ID_Continue # No 1 ETHIOPIC DIGIT TWO # + 136B ; Other_ID_Continue # No 1 ETHIOPIC DIGIT THREE # + 136C ; Other_ID_Continue # No 1 ETHIOPIC DIGIT FOUR # + 136D ; Other_ID_Continue # No 1 ETHIOPIC DIGIT FIVE # + 136E ; Other_ID_Continue # No 1 ETHIOPIC DIGIT SIX # + 136F ; Other_ID_Continue # No 1 ETHIOPIC DIGIT SEVEN # + 1370 ; Other_ID_Continue # No 1 ETHIOPIC DIGIT EIGHT # + 1371 ; Other_ID_Continue # No 1 ETHIOPIC DIGIT NINE # + 19DA ; Other_ID_Continue # No 1 NEW TAI LUE THAM DIGIT ONE # + 200C ; Other_ID_Continue # Cf 1 ZERO WIDTH NON-JOINER # + 200D ; Other_ID_Continue # Cf 1 ZERO WIDTH JOINER # + 30FB ; Other_ID_Continue # Po 1 KATAKANA MIDDLE DOT # + FF65 ; Other_ID_Continue # Po 1 HALFWIDTH KATAKANA MIDDLE DOT # # => Total = 149,240 characters # # ================================================================================================== # # Derived Property XID_Continue ( ID_Continue MODIFIED for closure under NFKx ) : # # # ID_Continue 149,240 # # - 037A ; ID_Continue # Lm 1 GREEK YPOGEGRAMMENI # # - 309B ; ID_Continue # Sk 1 KATAKANA-HIRAGANA VOICED SOUND MARK # # - 309C ; ID_Continue # Sk 1 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # # - FC5E ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # - FDFA ; ID_Continue # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Continue # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Continue # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Continue # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Continue # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Continue # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Continue # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Continue # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Continue # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Continue # Lo 1 ARABIC SUKUN ISOLATED FORM # # => Total = 149,221 characters # # ================================================================================================== # # From https://perldoc.perl.org/perldate/#identifier-parsing # # # Intersection of WORD and XID_Start properties + LOW LINE char : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 ( = \p{lettter} or [[:alpha:]] ) # # # + 005F ; Connector_Punctuation # Pc 1 LOW LINE # # + 1885 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI BALUDA ( NON-SPACING mark, common in WORD and XID_Start ) # # + 1886 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI THREE BALUDA ( NON-SPACING mark, common in WORD and XID_Start ) # # # - 037A ; ID_Start # Lm 1 GREEK YPOGEGRAMMENI # # - 0E33 ; ID_Start # Lo 1 THAI CHARACTER SARA AM # # - 0EB3 ; ID_Start # Lo 1 LAO VOWEL SIGN AM # # - 2E2F ; # Lm 1 VERTICAL TILDE ( as ALREADY included in L* ) # # - FC5E ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # # - FDFA ; ID_Start # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Start # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Start # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Start # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Start # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Start # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Start # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Start # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Start # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Start # Lo 1 ARABIC SUKUN ISOLATED FORM # # - FF9E ; ID_Start # Lm 1 HALFWIDTH KATAKANA VOICED SOUND MARK # - FF9F ; ID_Start # Lm 1 HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK # # => Total = 145,653 characters, which can START an IDENTIFIER # # ================================================================================================== # # From https://perldoc.perl.org/perldate/#identifier-parsing # # # Intersection of WORD and XID_Continue properties : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 ( = \p{lettter} or [[:alpha:]] ) # # + Nonspacing_Mark # Mn 2,059 # # + Spacing_Mark # Mc 471 # # + Decimal_Number # Nd 770 # # + Connector_Punctuation # Pc 10 ( including the LOW LINE char : 005F _ ) # # + 200C ; Other_ID_Continue # Cf 1 ZERO WIDTH NON-JOINER ( FORMAT character, common in common in WORD and XID_Continue ) # # + 200D ; Other_ID_Continue # Cf 1 ZERO WIDTH JOINER ( FORMAT character, common in common in WORD and XID_Continue ) # # # - 037A ; ID_Continue # Lm 1 GREEK YPOGEGRAMMENI # # - 2E2F ; # Lm 1 VERTICAL TILDE ( as ALREADY included in L* ) # # - FC5E ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # - FDFA ; ID_Continue # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Continue # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Continue # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Continue # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Continue # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Continue # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Continue # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Continue # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Continue # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Continue # Lo 1 ARABIC SUKUN ISOLATED FORM # # => Total = 148,966 characters, which can CONTINUE an IDENTIFIER #
However, the last two results
(?:(?=\p{XID_Start})\w|_)and(?=\p{XID_Continue})\w, above, are true ONLY IF the regex engine would respect all Unicode properties. Unfortunately, from a Boost point of view, which :-
Only considers that word characters are all in the BMP
-
Generally considers that word characters are those defined prior to the Unicode
5.3release !
I verified that, presently, only
47,681characters can begin an PERL identifier and only48,011characters can continue a PERL identifier !So, @Peterjones, in all cases, the regex rules, used in
Function Listfor Perl, are a rough approximation of what they should be !Now, Peter, the goal is to get a
Perlparser using the approximative BOOST\wdefinition, without the help of atomic structures.Refer to https://community.notepad-plus-plus.org/post/104861
Best Regards,
guy038
-