Hi, @coises and All,
First, here is the summary of the contents of the Total_Chars.txt file :
•----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
| Range | Plane | COUNT / MARK of ALL characters | # Chars | COUNT / MARK of ALL UNASSIGNED characters | # Unas. |
•----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
| 0000...FFFD | 0 | [\x{0000}-\x{FFFD}] | 63,454 | (?=[\x{0000}-\x{D7FF}]|[\x{F900}-\x{FFFD}])\Y | 1,398 |
| 10000..1FFFD | 1 | [\x{10000}-\x{1FFFD}] | 65,534 | (?=[\x{10000}-\x{1FFFD}])\Y | 37,090 |
| 20000..2FFFD | 2 | [\x{20000}-\x{2FFFD}] | 65,534 | (?=[\x{20000}-\x{2FFFD}])\Y | 4,039 |
| 30000..3FFFD | 3 | [\x{30000}-\x{3FFFD}] | 65,534 | (?=[\x{30000}-\x{3FFFD}])\Y | 56,403 |
| E0000..EFFFD | 14 | [\x{E01F0}-\x{EFFFD}] | 65,534 | (?=[\x{E0000}-\x{EFFFD}])\Y | 65,197 |
•----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
| 00000..EFFFD | | (?s). \I \p{Any} [\x0-\x{EFFFD}] | 325,590 | (?![\x{E000}-\x{F8FF}])\Y \p{Not Assigned} | 164,127 |
•----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
Indeed, I cannot post my new Unicode_Col++.txt file, in its entirety, with the detail of all the Unicode blocks ( Too large ! ). However, it will be part of my future Unicode.zip archive that I’ll post on my Google Drive account !
Now, I tested your third experimental version of Columns++ and everything works as you surely expect to !!
You said :
Search in Columns++ shows a progress dialog when it estimates that a count, select or replace all operation will take more than two seconds…
I pleased to tell you that with this new feature, my laptop did not hang on any more ! For example, I tried to select all the matches of the regex (?s)., against my Total_Chars.txt file and, with the process dialog on my HP ProBook 450 G8 / Windows 10 Pro 64 / Version 21H1 / Intel® Core™ i7 / RAM 32 GB DDR4-3200 MHz, after 8 m 21s, the green zone was complete and it said : 325590 matches selected ! I even copied all this selection on a new tab and, after suppression of all \r\n line-breaks, the ComparePlus plugin did not find any difference between Total_Chars.txt and this new tab !
You said :
[[:cntrl:]] matches only Unicode General Category Cc characters. Mnemonics for formatting characters [[.sflo.]], [[.sfco.]], [[.sfds.]] and [[.sfus.]] work.
I confirm that these two changes are effective
Now, I particularly tested the Equivalence classes feature. You can refer to the following link :
https://unicode.org/charts/collation/index.html
For the letter a, it detects 160 equivalences of a a letter
However, against the Total_Chars.txt file, the regex [[=a=]] returns 86 matches. So we can deduce that :
A lot of equivalences are not found with the [[=a=]] regex
Some equivalents, not shown from this link, can be found with the [[=a=]] regex. it’s the case with the \x{249C} character ( PARENTHESIZED LATIN SMALL LETTER A ) !
This situation happens with any character : for example, the regex [[=1=]] finds 54 matches, but, on the site, it shows 209 equivalences to the digit 1
Now, with your experimental UTF-32 version, you can use any other equivalent character of the a letter to get the 86 matches ( ((=Ⱥ=]], [[=ⱥ=]], [[=Ɐ=]], … ). Note that, with our present Boost regex engine, some equivalences do not return the 86 matches. It’s the case for the regexes :
[[=ɐ=]], [[=ɑ=]], [[=ɒ=]], [[=ͣ=]] , [[=ᵃ=]], [[=ᵄ=]], [[=ⱥ=]], [[=Ɑ=]], [[=Ɐ=]], [[=Ɒ=]]
Thus, your version is more coherent, as it does give the same result, whatever the char used in the equivalence class regex !
Here is below the list of all the equivalences of any char of the Windows-1252 code-page, from \x{0020} till \x{00DE} Note that, except for the DEL character, as en example, I did not consider the equivalence classes which return only one match !
I also confirm, that I did not find any character over \x{FFFF} which would be part of a regex equivalence class, either with our Boost engine or with your Columns++ experimental version !
[[= =]] = [[=space=]] => 3 ( )
[[=!=]] = [[=exclamation-mark=]] => 2 ( !! )
[[="=]] = [[=quotation-mark=]] => 3 ( "⁍" )
[[=#=]] = [[=number-sign=]] => 4 ( #؞⁗# )
[[=$=]] = [[=dollar-sign=]] => 3 ( $⁒$ )
[[=%=]] = [[=percent-sign=]] => 3 ( %⁏% )
[[=&=]] = [[=ampersand=]] => 3 ( &⁋& )
[[='=]] = [[=apostrophe=]] => 2 ( '' )
[[=(=]] = [[=left-parenthesis=]] => 4 ( (⁽₍( )
[[=)=]] = [[=right-parenthesis=]] => 4 ( )⁾₎) )
[[=*=]] = [[=asterisk=]] => 2 ( ** )
[[=+=]] = [[=plus-sign=]] => 6 ( +⁺₊﬩﹢+ )
[[=,=]] = [[=comma=]] => 2 ( ,, )
[[=-=]] = [[=hyphen=]] => 3 ( -﹣- )
[[=.=]] = [[=period=]] => 3 ( .․. )
[[=/=]] = [[=slash=]] => 2 ( // )
[[=0=]] = [[=zero=]] => 48 ( 0٠۟۠۰߀०০੦૦୦୵௦౦౸೦൦๐໐༠၀႐០᠐᥆᧐᪀᪐᭐᮰᱀᱐⁰₀↉⓪⓿〇㍘꘠ꛯ꠳꣐꤀꧐꩐꯰0 )
[[=1=]] = [[=one=]] => 54 ( 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁①⑴⒈⓵⚀❶➀➊〡㋀㍙㏠꘡ꛦ꣑꤁꧑꩑꯱1 )
[[=2=]] = [[=two=]] => 54 ( 2²ƻ٢۲߂२২੨૨୨௨౨౺౽೨൨๒໒༢၂႒፪២᠒᥈᧒᪂᪒᭒᮲᱂᱒₂②⑵⒉⓶⚁❷➁➋〢㋁㍚㏡꘢ꛧ꣒꤂꧒꩒꯲2 )
[[=3=]] = [[=three=]] => 53 ( 3³٣۳߃३৩੩૩୩௩౩౻౾೩൩๓໓༣၃႓፫៣᠓᥉᧓᪃᪓᭓᮳᱃᱓₃③⑶⒊⓷⚂❸➂➌〣㋂㍛㏢꘣ꛨ꣓꤃꧓꩓꯳3 )
[[=4=]] = [[=four=]] => 51 ( 4٤۴߄४৪੪૪୪௪౪೪൪๔໔༤၄႔፬៤᠔᥊᧔᪄᪔᭔᮴᱄᱔⁴₄④⑷⒋⓸⚃❹➃➍〤㋃㍜㏣꘤ꛩ꣔꤄꧔꩔꯴4 )
[[=5=]] = [[=five=]] => 53 ( 5Ƽƽ٥۵߅५৫੫૫୫௫౫೫൫๕໕༥၅႕፭៥᠕᥋᧕᪅᪕᭕᮵᱅᱕⁵₅⑤⑸⒌⓹⚄❺➄➎〥㋄㍝㏤꘥ꛪ꣕꤅꧕꩕꯵5 )
[[=6=]] = [[=six=]] => 52 ( 6٦۶߆६৬੬૬୬௬౬೬൬๖໖༦၆႖፮៦᠖᥌᧖᪆᪖᭖᮶᱆᱖⁶₆ↅ⑥⑹⒍⓺⚅❻➅➏〦㋅㍞㏥꘦ꛫ꣖꤆꧖꩖꯶6 )
[[=7=]] = [[=seven=]] => 50 ( 7٧۷߇७৭੭૭୭௭౭೭൭๗໗༧၇႗፯៧᠗᥍᧗᪇᪗᭗᮷᱇᱗⁷₇⑦⑺⒎⓻❼➆➐〧㋆㍟㏦꘧ꛬ꣗꤇꧗꩗꯷7 )
[[=8=]] = [[=eight=]] => 50 ( 8٨۸߈८৮੮૮୮௮౮೮൮๘໘༨၈႘፰៨᠘᥎᧘᪈᪘᭘᮸᱈᱘⁸₈⑧⑻⒏⓼❽➇➑〨㋇㍠㏧꘨ꛭ꣘꤈꧘꩘꯸8 )
[[=9=]] = [[=nine=]] => 50 ( 9٩۹߉९৯੯૯୯௯౯೯൯๙໙༩၉႙፱៩᠙᥏᧙᪉᪙᭙᮹᱉᱙⁹₉⑨⑼⒐⓽❾➈➒〩㋈㍡㏨꘩ꛮ꣙꤉꧙꩙꯹9 )
[[=:=]] = [[=colon=]] => 2 ( :: )
[[=;=]] = [[=semicolon=]] => 3 ( ;;; )
[[=<=]] = [[=less-than-sign=]] => 3 ( <﹤< )
[[===]] = [[=equals-sign=]] => 5 ( =⁼₌﹦= )
[[=>=]] = [[=greater-than-sign=]] => 3 ( >﹥> )
[[=?=]] = [[=question-mark=]] => 2 ( ?? )
[[=@=]] = [[=commercial-at=]] => 2 ( @@ )
[[=A=]] => 86 ( AaªÀÁÂÃÄÅàáâãäåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃȦȧȺɐɑɒͣᴀᴬᵃᵄᶏᶐᶛᷓḀḁẚẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặₐÅ⒜ⒶⓐⱥⱭⱯⱰAa )
[[=B=]] => 29 ( BbƀƁƂƃƄƅɃɓʙᴃᴮᴯᵇᵬᶀḂḃḄḅḆḇℬ⒝ⒷⓑBb )
[[=C=]] => 40 ( CcÇçĆćĈĉĊċČčƆƇƈȻȼɔɕʗͨᴄᴐᵓᶗᶜᶝᷗḈḉℂ℃ℭ⒞ⒸⓒꜾꜿCc )
[[=D=]] => 44 ( DdÐðĎďĐđƊƋƌƍɗʤͩᴅᴆᴰᵈᵭᶁᶑᶞᷘᷙḊḋḌḍḎḏḐḑḒḓⅅⅆ⒟ⒹⓓꝹꝺDd )
[[=E=]] => 82 ( EeÈÉÊËèéêëĒēĔĕĖėĘęĚěƎƏǝȄȅȆȇȨȩɆɇɘəɚͤᴇᴱᴲᵉᵊᶒᶕḔḕḖḗḘḙḚḛḜḝẸẹẺẻẼẽẾếỀềỂểỄễỆệₑₔ℮ℯℰ⅀ⅇ⒠ⒺⓔⱸⱻEe )
[[=F=]] => 22 ( FfƑƒᵮᶂᶠḞḟ℉ℱℲⅎ⒡ⒻⓕꜰꝻꝼꟻFf )
[[=G=]] => 45 ( GgĜĝĞğĠġĢģƓƔǤǥǦǧǴǵɠɡɢɣɤʛˠᴳᵍᵷᶃᶢᷚᷛḠḡℊ⅁⒢ⒼⓖꝾꝿꞠꞡGg )
[[=H=]] => 41 ( HhĤĥĦħȞȟɥɦʜʰʱͪᴴᶣḢḣḤḥḦḧḨḩḪḫẖₕℋℌℍℎℏ⒣ⒽⓗⱧⱨꞍHh )
[[=I=]] => 61 ( IiÌÍÎÏìíîïĨĩĪīĬĭĮįİıƖƗǏǐȈȉȊȋɨɩɪͥᴉᴵᵎᵢᵻᵼᶖᶤᶥᶦᶧḬḭḮḯỈỉỊịⁱℐℑⅈ⒤ⒾⓘꟾIi )
[[=J=]] => 23 ( JjĴĵǰȷɈɉɟʄʝʲᴊᴶᶡᶨⅉ⒥ⒿⓙⱼJj )
[[=K=]] => 38 ( KkĶķĸƘƙǨǩʞᴋᴷᵏᶄᷜḰḱḲḳḴḵₖK⒦ⓀⓚⱩⱪꝀꝁꝂꝃꝄꝅꞢꞣKk )
[[=L=]] => 56 ( LlĹĺĻļĽľĿŀŁłƚƛȽɫɬɭɮʟˡᴌᴸᶅᶩᶪᶫᷝᷞḶḷḸḹḺḻḼḽₗℒℓ⅂⅃⒧ⓁⓛⱠⱡⱢꝆꝇꝈꝉꞀꞁLl )
[[=M=]] => 33 ( MmƜɯɰɱͫᴍᴟᴹᵐᵚᵯᶆᶬᶭᷟḾḿṀṁṂṃₘℳ⒨ⓂⓜⱮꝳꟽMm )
[[=N=]] => 47 ( NnÑñŃńŅņŇňʼnƝƞǸǹȠɲɳɴᴎᴺᴻᵰᶇᶮᶯᶰᷠᷡṄṅṆṇṈṉṊṋⁿₙℕ⒩ⓃⓝꞤꞥNn )
[[=O=]] => 106 ( OoºÒÓÔÕÖØòóôõöøŌōŎŏŐőƟƠơƢƣǑǒǪǫǬǭǾǿȌȍȎȏȪȫȬȭȮȯȰȱɵɶɷͦᴏᴑᴒᴓᴕᴖᴗᴼᵒᵔᵕᶱṌṍṎṏṐṑṒṓỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợₒℴ⒪ⓄⓞⱺꝊꝋꝌꝍOo )
[[=P=]] => 33 ( PpƤƥɸᴘᴾᵖᵱᵽᶈᶲṔṕṖṗₚ℘ℙ⒫ⓅⓟⱣⱷꝐꝑꝒꝓꝔꝕꟼPp )
[[=Q=]] => 16 ( QqɊɋʠℚ℺⒬ⓆⓠꝖꝗꝘꝙQq )
[[=R=]] => 64 ( RrŔŕŖŗŘřƦȐȑȒȓɌɍɹɺɻɼɽɾɿʀʁʳʴʵʶͬᴙᴚᴿᵣᵲᵳᶉᷢᷣṘṙṚṛṜṝṞṟℛℜℝ⒭ⓇⓡⱤⱹꝚꝛꝜꝝꝵꝶꞂꞃRr )
[[=S=]] => 47 ( SsŚśŜŝŞşŠšƧƨƩƪȘșȿʂʃʅʆˢᵴᶊᶋᶘᶳᶴᷤṠṡṢṣṤṥṦṧṨṩₛ⒮ⓈⓢⱾꜱSs )
[[=T=]] => 46 ( TtŢţŤťƫƬƭƮȚțȶȾʇʈʧʨͭᴛᵀᵗᵵᶵᶿṪṫṬṭṮṯṰṱẗₜ⒯ⓉⓣⱦꜨꜩꝷꞆꞇTt )
[[=U=]] => 82 ( UuÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųƯưƱǓǔǕǖǗǘǙǚǛǜȔȕȖȗɄʉʊͧᴜᵁᵘᵤᵾᵿᶙᶶᶷᶸṲṳṴṵṶṷṸṹṺṻỤụỦủỨứỪừỬửỮữỰự⒰ⓊⓤUu )
[[=V=]] => 29 ( VvƲɅʋʌͮᴠᵛᵥᶌᶹᶺṼṽṾṿỼỽ⒱ⓋⓥⱱⱴⱽꝞꝟVv )
[[=W=]] => 28 ( WwŴŵƿǷʍʷᴡᵂẀẁẂẃẄẅẆẇẈẉẘ⒲ⓌⓦⱲⱳWw )
[[=X=]] => 15 ( XxˣͯᶍẊẋẌẍₓ⒳ⓍⓧXx )
[[=Y=]] => 36 ( YyÝýÿŶŷŸƳƴȲȳɎɏʎʏʸẎẏẙỲỳỴỵỶỷỸỹỾỿ⅄⒴ⓎⓨYy )
[[=Z=]] => 41 ( ZzŹźŻżŽžƵƶȤȥɀʐʑᴢᵶᶎᶻᶼᶽᷦẐẑẒẓẔẕℤ℥ℨ⒵ⓏⓩⱫⱬⱿꝢꝣZz )
[[=[=]] = [[=left-square-bracket=]] => 2 ( [[ )
[[=\=]] = [[=backslash=]] => 2 ( \\ )
[[=]=]] = [[=right-square-bracket=]] => 2 ( ]] )
[[=^=]] = [[=circumflex=]] => 3 ( ^ˆ^ )
[[=_=]] = [[=underscore=]] => 2 ( __ )
[[=`=]] = [[=grave-accent=]] => 4 ( `ˋ`` )
[[={=]] = [[=left-curly-bracket=]] => 2 ( {{ )
[[=|=]] = [[=vertical-line=]] => 2 ( || )
[[=}=]] = [[=right-curly-bracket=]] => 2 ( }} )
[[=~=]] = [[=tilde=]] => 2 ( ~~ )
[[==]] = [[=DEL=]] => 1 ( )
[[=Œ=]] => 2 ( Œœ )
[[=¢=]] => 3 ( ¢《¢ )
[[=£=]] => 3 ( £︽£ )
[[=¤=]] => 2 ( ¤》 )
[[=¥=]] => 3 ( ¥︾¥ )
[[=¦=]] => 2 ( ¦¦ )
[[=¬=]] => 2 ( ¬¬ )
[[=¯=]] => 2 ( ¯ ̄ )
[[=´=]] => 2 ( ´´ )
[[=·=]] => 2 ( ·· )
[[=¼=]] => 4 ( ¼୲൳꠰ )
[[=½=]] => 6 ( ½୳൴༪⳽꠱ )
[[=¾=]] => 4 ( ¾୴൵꠲ )
[[=Þ=]] => 6 ( ÞþꝤꝥꝦꝧ )
Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :
[[=AE=]] = [[=Ae=]] = [[=ae=]] => 11 ( ÆæǢǣǼǽᴁᴂᴭᵆᷔ )
[[=CH=]] = [[=Ch=]] = [[=ch=]] => 0 ( ? )
[[=DZ=]] = [[=Dz=]] = [[=dz=]] => 6 ( DŽDždžDZDzdz )
[[=LJ=]] = [[=Lj=]] = [[=lj=]] => 3 ( LJLjlj )
[[=LL=]] = [[=Ll=]] = [[=ll=]] => 2 ( Ỻỻ )
[[=NJ=]] = [[=Nj=]] = [[=nj=]] => 3 ( NJNjnj )
[[=SS=]] = [[=Ss=]] = [[=ss=]] => 2 ( ßẞ )
However, the use of these di-graph characters are quite delicate ! Let’s consider these 7 di-graph collating elements, below, with various cases :
[[.AE.]] [[.Ae.]] [[.ae.]] ( European Ligature )
[[.CH.]] [[.Ch.]] [[.ch.]] ( Spanish )
[[.DZ.]] [[.Dz.]] [[.dz.]] ( Hungarian, Polish, Slovakian, Serbo-Croatian )
[[.LJ.]] [[.Lj.]] [[.lj.]] ( Serbo-Croatian )
[[.LL.]] [[.Ll.]] [[.ll.]] ( Spanish )
[[.NJ.]] [[.Nj.]] [[.nj.]] ( Serbo-Croatian )
[[.SS.]] [[.Ss.]] [[.ss.]] ( German )
As we know that :
LJ 01C7 LATIN CAPITAL LETTER LJ
Lj 01C8 LATIN CAPITAL LETTER L WITH SMALL LETTER J
lj 01C9 LATIN SMALL LETTER LJ
DZ 01F1 LATIN CAPITAL LETTER DZ
Dz 01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z
dz 01F3 LATIN SMALL LETTER DZ
If we apply the regex [[.dz.]-[.lj.][=dz=][=lj=]] against the text bcddzʣefghiijjklljljmn, pasted in a new tab, Columns++ would find 12 matches :
dz
dz
e
f
g
h
i
j
k
l
lj
lj
To sum up, @coises, the key points, of your third experimental version, are :
A major regex engine, inplemented in UTF-32, which correctly handle all the Unicode characters, from \x{0} to \x{0010FFFF}, and correctly manage all the Unicode character classes \p{Xy} or [[:Xy:]]
Additional features as \i, \m, \o and \y and their complements
The \X regex feature ( \M\m* ) correctly works for characters OVER the BMP
The invalid UTF-8 characters may be kept, replaced or deleted ( FIND \i+, REPLACE ABC $1 XYZ )
The NUL character can be placed in replacement ( FIND ABC\x00XYZ, REPLACE \x0--$0--\x{00} )
Correct handle of case replacements, even in case of accentuated characters ( FIND (?-s). REPLACE \U$0 )
The \K feature ALSO works in a step-by-step replacement with the Replace button ( FIND ^.{20}\K(.+), REPLACE --\1-- )
To end, @coises, do you think it’s worth testing some regex examples with possible replacements ? I could test some tricky regexes to check the robustness of your final UTF-32 version., if necessary ?
Best Regards,
guy038