Community
    • Login

    Columns++: Where regex meets Unicode (Here there be dragons!)

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    27 Posts 3 Posters 1.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hello, @coises and All,

      Thanks for adding the 4 remaining elements : so we’ll get a round number of collating elements : 120 !

      You said :

      Notepad++ search appears to include most of the Cc and Cf characters in the basic multilingual plane, so I made it Cc + Cf. I’ll change that to Cc only

      I confirm that, in your second version, [[:cntrl:]] = \p{Cc} + \p{Cf} = 65 + 170 = 235 and thanks for the future modification


      You said :

      Shouldn’t \l / \u, [:lower:] / [:upper:] and [:Ll:] / [:Lu:] all be the same? The title case characters are [:Lt:].

      What do you want to say ? Presently, in your second version, it’s just the case, as shown below or may I miss something obvious !?

      
      [[:upper:]]   =  \p{upper}  =  [[:u:]]  =  \p{u}  =  \pu  =  \u  =  \p{Lu}  =  \p{Uppercase Letter}  =  [[:Lu:]]   an UPPER case letter  =  1,858
      
      [[:lower:]]   =  \p{lower}  =  [[:l:]]  =  \p{l}  =  \pl  =  \l  =  \p{Ll}  =  \p{Lowercase Letter}  =  [[:Ll:]]   a  LOWER case letter  =  2,258
      

      BTW, I didn’t know that the syntax of an Unicode character class \p{Xy} could also be expressed as [[:Xy:]] !

      Best Regards,

      guy038

      CoisesC 1 Reply Last reply Reply Quote 0
      • CoisesC
        Coises @guy038
        last edited by

        @guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):

        You said :

        Shouldn’t \l / \u, [:lower:] / [:upper:] and [:Ll:] / [:Lu:] all be the same? The title case characters are [:Lt:].

        What do you want to say ? Presently, in your second version, it’s just the case, as shown below or may I miss something obvious !?

        I don’t think you missed anything. I think I might have misunderstood you. I thought you were saying that [:lower:] and [:upper:] and/or \l and \u should match the [:Lt:] characters, so that those 31 characters are both upper case and lower case. Perhaps we are agreed that they are neither.

        BTW, I didn’t know that the syntax of an Unicode character class \p{Xy} could also be expressed as [[:Xy:]] !

        Boost::regex is built such that \p{whatever} and [[:whatever:]] are the same. It also “delegates” backslash lower case letter escapes that don’t have any other meaning to classes with the same name, and upper case escapes without another meaning to the complements; so \s is internally “defined” as [[:s:]] and \S as +[^[:s:]]. That’s how I was able to define \i, \m, \o and \y. It’s also why we have to write \p{L*} instead of \p{L}: class names are case insensitive, and “l” already defines \l as lower case. For consistency, all the Unicode general category groups use the asterisk notation.

        Alan KilbornA 1 Reply Last reply Reply Quote 0
        • Alan KilbornA
          Alan Kilborn @Coises
          last edited by

          @Coises

          I’m hopeful that the real end goal for all of this is integration into native Notepad++, and that the plugin is really just a “testbed” for what you’re doing. Columns++ is great, but this is about core unicode searching, and as such really belongs in the standard product.

          It’s great that a person has finally been found that’s capable of (and interested in) doing this stuff, and it would be a shame if Notepad++ moves forward without the benefits of this work.

          Thank you for your work.

          CoisesC 1 Reply Last reply Reply Quote 5
          • CoisesC
            Coises @Alan Kilborn
            last edited by Coises

            @Alan-Kilborn said in Columns++: Where regex meets Unicode (Here there be dragons!):

            I’m hopeful that the real end goal for all of this is integration into native Notepad++, and that the plugin is really just a “testbed” for what you’re doing. Columns++ is great, but this is about core unicode searching, and as such really belongs in the standard product.

            For now I’m focusing on making the search in Columns++ as good as I can make it within the bounds of what I’ve intended search in Columns++ to accomplish. I don’t know that I can get this to where I’m comfortable calling it “stable” before the plugins list for the just-announced Notepad++ 8.7.8 release is frozen, but that’s the limit of my ambition at this point.

            I do hope that once it has been in use for a time, it can serve as a proof of viability — and maybe a bit of pressure — to incorporate better Unicode searching into Notepad++. That would be a massive code change, though, and unfortunately not everything can be simply copied from the way I’m doing it. (Columns++ uses Boost::regex directly; Notepad++ integrates Boost::regex with Scintilla and then uses the upgraded Scintilla search. Most of the same principles should apply, but details, details… details are where the bugs live.)

            There will also surely be a repeat of the same question I faced: whether to use ad hoc code or somehow incorporate ICU, which Boost::regex can use. And since Windows 10 version 1703 (but changing in 1709 and again in 1903), Windows incorporates a stripped-down version of ICU. It appears that Boost::regex can’t use that, but perhaps Boost will fix that someday, or perhaps I or someone else will find a way to connect them. By the time this could be considered for Notepad++, it might be plausible to limit new versions to Win 10 version 1903 or later. Avoiding bespoke code would minimize the possibility of future maintenance burdens for Notepad++. So there will be a lot to consider.

            Thank you for your kind words and encouragement, Alan.

            1 Reply Last reply Reply Quote 4
            • CoisesC
              Coises
              last edited by Coises

              I’ve posted Columns++ for Notepad++ version 1.1.5.3-Experimental.

              Changes:

              • Search in Columns++ shows a progress dialog when it estimates that a count, select or replace all operation will take more than two seconds. That should make apparent freezes (which were observed when attempting select all for expressions that make tens or hundreds of thousands of separate matches) far less likely to happen. (Note that this is not connected to the “Expression too complex” situation; this happens when the expression is reasonable, but there are an extremely high number of matches.)

              • [[:cntrl:]] matches only Unicode General Category Cc characters. Mnemonics for formatting characters [[.sflo.]], [[.sfco.]], [[.sfds.]] and [[.sfus.]] work.

              • I corrected an error that would have caused equivalence classes (e.g., [[=a=]]) to fail for characters U+10000 and above. However, I don’t know if there are any working equivalence classes for characters U+10000 and above, anyway. (Present support for those is dependent on a Windows function; it appears to me that it might not process surrogate pairs in a useful way.)

              • There were other organizational changes.

              As always comments, observations and suggestions are most welcome. My aim is for this to be the last “experimental” release in this series, if nothing awful happens… in which case the major remaining thing to be done before a normal release is documentation.

              1 Reply Last reply Reply Quote 4
              • guy038G
                guy038
                last edited by guy038

                Hi, @coises and All,

                First, here is the summary of the contents of the Total_Chars.txt file :

                    •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
                    |     Range      |  Plane  |      COUNT / MARK of ALL characters      |  # Chars  |    COUNT / MARK of ALL UNASSIGNED characters    |  # Unas.  |
                    •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
                    |   0000...FFFD  |     0   |  [\x{0000}-\x{FFFD}]                     |   63,454  |  (?=[\x{0000}-\x{D7FF}]|[\x{F900}-\x{FFFD}])\Y  |    1,398  |
                    |  10000..1FFFD  |     1   |  [\x{10000}-\x{1FFFD}]                   |   65,534  |  (?=[\x{10000}-\x{1FFFD}])\Y                    |   37,090  |
                    |  20000..2FFFD  |     2   |  [\x{20000}-\x{2FFFD}]                   |   65,534  |  (?=[\x{20000}-\x{2FFFD}])\Y                    |    4,039  |
                    |  30000..3FFFD  |     3   |  [\x{30000}-\x{3FFFD}]                   |   65,534  |  (?=[\x{30000}-\x{3FFFD}])\Y                    |   56,403  |
                    |  E0000..EFFFD  |    14   |  [\x{E01F0}-\x{EFFFD}]                   |   65,534  |  (?=[\x{E0000}-\x{EFFFD}])\Y                    |   65,197  |
                    •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
                    |  00000..EFFFD  |         |  (?s).   \I   \p{Any}   [\x0-\x{EFFFD}]  |  325,590  |  (?![\x{E000}-\x{F8FF}])\Y   \p{Not Assigned}   |  164,127  |
                    •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
                

                Indeed, I cannot post my new Unicode_Col++.txt file, in its entirety, with the detail of all the Unicode blocks ( Too large ! ). However, it will be part of my future Unicode.zip archive that I’ll post on my Google Drive account !


                Now, I tested your third experimental version of Columns++ and everything works as you surely expect to !!

                You said :

                Search in Columns++ shows a progress dialog when it estimates that a count, select or replace all operation will take more than two seconds…

                I pleased to tell you that with this new feature, my laptop did not hang on any more ! For example, I tried to select all the matches of the regex (?s)., against my Total_Chars.txt file and, with the process dialog on my HP ProBook 450 G8 / Windows 10 Pro 64 / Version 21H1 / Intel® Core™ i7 / RAM 32 GB DDR4-3200 MHz, after 8 m 21s, the green zone was complete and it said : 325 590 matches selected ! I even copied all this selection on a new tab and, after suppression of all \r\n line-breaks, the ComparePlus plugin did not find any difference between Total_Chars.txt and this new tab !


                You said :

                [[:cntrl:]] matches only Unicode General Category Cc characters. Mnemonics for formatting characters [[.sflo.]], [[.sfco.]], [[.sfds.]] and [[.sfus.]] work.

                I confirm that these two changes are effective


                Now, I particularly tested the Equivalence classes feature. You can refer to the following link :

                https://unicode.org/charts/collation/index.html

                And also consult the help at :

                https://unicode.org/charts/collation/help.html

                For the letter a, it detects 160 equivalences of a a letter

                However, against the Total_Chars.txt file, the regex [[=a=]] returns 86 matches. So we can deduce that :

                • A lot of equivalences are not found with the [[=a=]] regex

                • Some equivalents, not shown from this link, can be found with the [[=a=]] regex. it’s the case with the \x{249C} character ( PARENTHESIZED LATIN SMALL LETTER A ) !

                This situation happens with any character : for example, the regex [[=1=]] finds 54 matches, but, on the site, it shows 209 equivalences to the digit 1

                Now, with your experimental UTF-32 version, you can use any other equivalent character of the a letter to get the 86 matches ( ((=Ⱥ=]], [[=ⱥ=]], [[=Ɐ=]], … ). Note that, with our present Boost regex engine, some equivalences do not return the 86 matches. It’s the case for the regexes :

                [[=ɐ=]], [[=ɑ=]], [[=ɒ=]], [[=ͣ=]] , [[=ᵃ=]], [[=ᵄ=]], [[=ⱥ=]], [[=Ɑ=]], [[=Ɐ=]], [[=Ɒ=]]

                Thus, your version is more coherent, as it does give the same result, whatever the char used in the equivalence class regex !


                Here is below the list of all the equivalences of any char of the Windows-1252 code-page, from \x{0020} till \x{00DE} Note that, except for the DEL character, as en example, I did not consider the equivalence classes which return only one match !

                I also confirm, that I did not find any character over \x{FFFF} which would be part of a regex equivalence class, either with our Boost engine or with your Columns++ experimental version !

                [[= =]]    =   [[=space=]]                 =>     3    (     )
                [[=!=]]    =   [[=exclamation-mark=]]      =>     2    ( !! )
                [[="=]]    =   [[=quotation-mark=]]        =>     3    ( "⁍" )
                [[=#=]]    =   [[=number-sign=]]           =>     4    ( #؞⁗# )
                [[=$=]]    =   [[=dollar-sign=]]           =>     3    ( $⁒$ )
                [[=%=]]    =   [[=percent-sign=]]          =>     3    ( %⁏% )
                [[=&=]]    =   [[=ampersand=]]             =>     3    ( &⁋& )
                [[='=]]    =   [[=apostrophe=]]            =>     2    ( '' )
                [[=(=]]    =   [[=left-parenthesis=]]      =>     4    ( (⁽₍( )
                [[=)=]]    =   [[=right-parenthesis=]]     =>     4    ( )⁾₎) )
                [[=*=]]    =   [[=asterisk=]]              =>     2    ( ** )
                [[=+=]]    =   [[=plus-sign=]]             =>     6    ( +⁺₊﬩﹢+ )
                [[=,=]]    =   [[=comma=]]                 =>     2    ( ,, )
                [[=-=]]    =   [[=hyphen=]]                =>     3    ( -﹣- )
                [[=.=]]    =   [[=period=]]                =>     3    ( .․. )
                [[=/=]]    =   [[=slash=]]                 =>     2    ( // )
                [[=0=]]    =   [[=zero=]]                  =>    48    ( 0٠۟۠۰߀०০੦૦୦୵௦౦౸೦൦๐໐༠၀႐០᠐᥆᧐᪀᪐᭐᮰᱀᱐⁰₀↉⓪⓿〇㍘꘠ꛯ꠳꣐꤀꧐꩐꯰0 )
                [[=1=]]    =   [[=one=]]                   =>    54    ( 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁①⑴⒈⓵⚀❶➀➊〡㋀㍙㏠꘡ꛦ꣑꤁꧑꩑꯱1 )
                [[=2=]]    =   [[=two=]]                   =>    54    ( 2²ƻ٢۲߂२২੨૨୨௨౨౺౽೨൨๒໒༢၂႒፪២᠒᥈᧒᪂᪒᭒᮲᱂᱒₂②⑵⒉⓶⚁❷➁➋〢㋁㍚㏡꘢ꛧ꣒꤂꧒꩒꯲2 )
                [[=3=]]    =   [[=three=]]                 =>    53    ( 3³٣۳߃३৩੩૩୩௩౩౻౾೩൩๓໓༣၃႓፫៣᠓᥉᧓᪃᪓᭓᮳᱃᱓₃③⑶⒊⓷⚂❸➂➌〣㋂㍛㏢꘣ꛨ꣓꤃꧓꩓꯳3 )
                [[=4=]]    =   [[=four=]]                  =>    51    ( 4٤۴߄४৪੪૪୪௪౪೪൪๔໔༤၄႔፬៤᠔᥊᧔᪄᪔᭔᮴᱄᱔⁴₄④⑷⒋⓸⚃❹➃➍〤㋃㍜㏣꘤ꛩ꣔꤄꧔꩔꯴4 )
                [[=5=]]    =   [[=five=]]                  =>    53    ( 5Ƽƽ٥۵߅५৫੫૫୫௫౫೫൫๕໕༥၅႕፭៥᠕᥋᧕᪅᪕᭕᮵᱅᱕⁵₅⑤⑸⒌⓹⚄❺➄➎〥㋄㍝㏤꘥ꛪ꣕꤅꧕꩕꯵5 )
                [[=6=]]    =   [[=six=]]                   =>    52    ( 6٦۶߆६৬੬૬୬௬౬೬൬๖໖༦၆႖፮៦᠖᥌᧖᪆᪖᭖᮶᱆᱖⁶₆ↅ⑥⑹⒍⓺⚅❻➅➏〦㋅㍞㏥꘦ꛫ꣖꤆꧖꩖꯶6 )
                [[=7=]]    =   [[=seven=]]                 =>    50    ( 7٧۷߇७৭੭૭୭௭౭೭൭๗໗༧၇႗፯៧᠗᥍᧗᪇᪗᭗᮷᱇᱗⁷₇⑦⑺⒎⓻❼➆➐〧㋆㍟㏦꘧ꛬ꣗꤇꧗꩗꯷7 )
                [[=8=]]    =   [[=eight=]]                 =>    50    ( 8٨۸߈८৮੮૮୮௮౮೮൮๘໘༨၈႘፰៨᠘᥎᧘᪈᪘᭘᮸᱈᱘⁸₈⑧⑻⒏⓼❽➇➑〨㋇㍠㏧꘨ꛭ꣘꤈꧘꩘꯸8 )
                [[=9=]]    =   [[=nine=]]                  =>    50    ( 9٩۹߉९৯੯૯୯௯౯೯൯๙໙༩၉႙፱៩᠙᥏᧙᪉᪙᭙᮹᱉᱙⁹₉⑨⑼⒐⓽❾➈➒〩㋈㍡㏨꘩ꛮ꣙꤉꧙꩙꯹9 )
                [[=:=]]    =   [[=colon=]]                 =>     2    ( :: )
                [[=;=]]    =   [[=semicolon=]]             =>     3    ( ;;; )
                [[=<=]]    =   [[=less-than-sign=]]        =>     3    ( <﹤< )
                [[===]]    =   [[=equals-sign=]]           =>     5    ( =⁼₌﹦= )
                [[=>=]]    =   [[=greater-than-sign=]]     =>     3    ( >﹥> )
                [[=?=]]    =   [[=question-mark=]]         =>     2    ( ?? )
                [[=@=]]    =   [[=commercial-at=]]         =>     2    ( @@ )
                [[=A=]]                                    =>    86    ( AaªÀÁÂÃÄÅàáâãäåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃȦȧȺɐɑɒͣᴀᴬᵃᵄᶏᶐᶛᷓḀḁẚẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặₐÅ⒜ⒶⓐⱥⱭⱯⱰAa )
                [[=B=]]                                    =>    29    ( BbƀƁƂƃƄƅɃɓʙᴃᴮᴯᵇᵬᶀḂḃḄḅḆḇℬ⒝ⒷⓑBb )
                [[=C=]]                                    =>    40    ( CcÇçĆćĈĉĊċČčƆƇƈȻȼɔɕʗͨᴄᴐᵓᶗᶜᶝᷗḈḉℂ℃ℭ⒞ⒸⓒꜾꜿCc )
                [[=D=]]                                    =>    44    ( DdÐðĎďĐđƊƋƌƍɗʤͩᴅᴆᴰᵈᵭᶁᶑᶞᷘᷙḊḋḌḍḎḏḐḑḒḓⅅⅆ⒟ⒹⓓꝹꝺDd )
                [[=E=]]                                    =>    82    ( EeÈÉÊËèéêëĒēĔĕĖėĘęĚěƎƏǝȄȅȆȇȨȩɆɇɘəɚͤᴇᴱᴲᵉᵊᶒᶕḔḕḖḗḘḙḚḛḜḝẸẹẺẻẼẽẾếỀềỂểỄễỆệₑₔ℮ℯℰ⅀ⅇ⒠ⒺⓔⱸⱻEe )
                [[=F=]]                                    =>    22    ( FfƑƒᵮᶂᶠḞḟ℉ℱℲⅎ⒡ⒻⓕꜰꝻꝼꟻFf )
                [[=G=]]                                    =>    45    ( GgĜĝĞğĠġĢģƓƔǤǥǦǧǴǵɠɡɢɣɤʛˠᴳᵍᵷᶃᶢᷚᷛḠḡℊ⅁⒢ⒼⓖꝾꝿꞠꞡGg )
                [[=H=]]                                    =>    41    ( HhĤĥĦħȞȟɥɦʜʰʱͪᴴᶣḢḣḤḥḦḧḨḩḪḫẖₕℋℌℍℎℏ⒣ⒽⓗⱧⱨꞍHh )
                [[=I=]]                                    =>    61    ( IiÌÍÎÏìíîïĨĩĪīĬĭĮįİıƖƗǏǐȈȉȊȋɨɩɪͥᴉᴵᵎᵢᵻᵼᶖᶤᶥᶦᶧḬḭḮḯỈỉỊịⁱℐℑⅈ⒤ⒾⓘꟾIi )
                [[=J=]]                                    =>    23    ( JjĴĵǰȷɈɉɟʄʝʲᴊᴶᶡᶨⅉ⒥ⒿⓙⱼJj )
                [[=K=]]                                    =>    38    ( KkĶķĸƘƙǨǩʞᴋᴷᵏᶄᷜḰḱḲḳḴḵₖK⒦ⓀⓚⱩⱪꝀꝁꝂꝃꝄꝅꞢꞣKk )
                [[=L=]]                                    =>    56    ( LlĹĺĻļĽľĿŀŁłƚƛȽɫɬɭɮʟˡᴌᴸᶅᶩᶪᶫᷝᷞḶḷḸḹḺḻḼḽₗℒℓ⅂⅃⒧ⓁⓛⱠⱡⱢꝆꝇꝈꝉꞀꞁLl )
                [[=M=]]                                    =>    33    ( MmƜɯɰɱͫᴍᴟᴹᵐᵚᵯᶆᶬᶭᷟḾḿṀṁṂṃₘℳ⒨ⓂⓜⱮꝳꟽMm )
                [[=N=]]                                    =>    47    ( NnÑñŃńŅņŇňʼnƝƞǸǹȠɲɳɴᴎᴺᴻᵰᶇᶮᶯᶰᷠᷡṄṅṆṇṈṉṊṋⁿₙℕ⒩ⓃⓝꞤꞥNn )
                [[=O=]]                                    =>   106    ( OoºÒÓÔÕÖØòóôõöøŌōŎŏŐőƟƠơƢƣǑǒǪǫǬǭǾǿȌȍȎȏȪȫȬȭȮȯȰȱɵɶɷͦᴏᴑᴒᴓᴕᴖᴗᴼᵒᵔᵕᶱṌṍṎṏṐṑṒṓỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợₒℴ⒪ⓄⓞⱺꝊꝋꝌꝍOo )
                [[=P=]]                                    =>    33    ( PpƤƥɸᴘᴾᵖᵱᵽᶈᶲṔṕṖṗₚ℘ℙ⒫ⓅⓟⱣⱷꝐꝑꝒꝓꝔꝕꟼPp )
                [[=Q=]]                                    =>    16    ( QqɊɋʠℚ℺⒬ⓆⓠꝖꝗꝘꝙQq )
                [[=R=]]                                    =>    64    ( RrŔŕŖŗŘřƦȐȑȒȓɌɍɹɺɻɼɽɾɿʀʁʳʴʵʶͬᴙᴚᴿᵣᵲᵳᶉᷢᷣṘṙṚṛṜṝṞṟℛℜℝ⒭ⓇⓡⱤⱹꝚꝛꝜꝝꝵꝶꞂꞃRr )
                [[=S=]]                                    =>    47    ( SsŚśŜŝŞşŠšƧƨƩƪȘșȿʂʃʅʆˢᵴᶊᶋᶘᶳᶴᷤṠṡṢṣṤṥṦṧṨṩₛ⒮ⓈⓢⱾꜱSs )
                [[=T=]]                                    =>    46    ( TtŢţŤťƫƬƭƮȚțȶȾʇʈʧʨͭᴛᵀᵗᵵᶵᶿṪṫṬṭṮṯṰṱẗₜ⒯ⓉⓣⱦꜨꜩꝷꞆꞇTt )
                [[=U=]]                                    =>    82    ( UuÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųƯưƱǓǔǕǖǗǘǙǚǛǜȔȕȖȗɄʉʊͧᴜᵁᵘᵤᵾᵿᶙᶶᶷᶸṲṳṴṵṶṷṸṹṺṻỤụỦủỨứỪừỬửỮữỰự⒰ⓊⓤUu )
                [[=V=]]                                    =>    29    ( VvƲɅʋʌͮᴠᵛᵥᶌᶹᶺṼṽṾṿỼỽ⒱ⓋⓥⱱⱴⱽꝞꝟVv )
                [[=W=]]                                    =>    28    ( WwŴŵƿǷʍʷᴡᵂẀẁẂẃẄẅẆẇẈẉẘ⒲ⓌⓦⱲⱳWw )
                [[=X=]]                                    =>    15    ( XxˣͯᶍẊẋẌẍₓ⒳ⓍⓧXx )
                [[=Y=]]                                    =>    36    ( YyÝýÿŶŷŸƳƴȲȳɎɏʎʏʸẎẏẙỲỳỴỵỶỷỸỹỾỿ⅄⒴ⓎⓨYy )
                [[=Z=]]                                    =>    41    ( ZzŹźŻżŽžƵƶȤȥɀʐʑᴢᵶᶎᶻᶼᶽᷦẐẑẒẓẔẕℤ℥ℨ⒵ⓏⓩⱫⱬⱿꝢꝣZz )
                [[=[=]]    =  [[=left-square-bracket=]]    =>     2    ( [[ )
                [[=\=]]    =  [[=backslash=]]              =>     2    ( \\ )
                [[=]=]]    =  [[=right-square-bracket=]]   =>     2    ( ]] )
                [[=^=]]    =  [[=circumflex=]]             =>     3    ( ^ˆ^ )
                [[=_=]]    =  [[=underscore=]]             =>     2    ( __ )
                [[=`=]]    =  [[=grave-accent=]]           =>     4    ( `ˋ`` )
                [[={=]]    =  [[=left-curly-bracket=]]     =>     2    ( {{ )
                [[=|=]]    =  [[=vertical-line=]]          =>     2    ( || )
                [[=}=]]    =  [[=right-curly-bracket=]]    =>     2    ( }} )
                [[=~=]]    =  [[=tilde=]]                  =>     2    ( ~~ )
                [[==]]  =  [[=DEL=]]                    =>     1    (  )
                [[=Œ=]]                                    =>     2    ( Œœ )
                [[=¢=]]                                    =>     3    ( ¢《¢ )
                [[=£=]]                                    =>     3    ( £︽£ )
                [[=¤=]]                                    =>     2    ( ¤》 )
                [[=¥=]]                                    =>     3    ( ¥︾¥ )
                [[=¦=]]                                    =>     2    ( ¦¦ )
                [[=¬=]]                                    =>     2    ( ¬¬ )
                [[=¯=]]                                    =>     2    ( ¯ ̄ )
                [[=´=]]                                    =>     2    ( ´´ )
                [[=·=]]                                    =>     2    ( ·· )
                [[=¼=]]                                    =>     4    ( ¼୲൳꠰ )
                [[=½=]]                                    =>     6    ( ½୳൴༪⳽꠱ )
                [[=¾=]]                                    =>     4    ( ¾୴൵꠲ )
                [[=Þ=]]                                    =>     6    ( ÞþꝤꝥꝦꝧ )
                

                Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :

                [[=AE=]] = [[=Ae=]] = [[=ae=]] =>  11 ( ÆæǢǣǼǽᴁᴂᴭᵆᷔ )
                [[=CH=]] = [[=Ch=]] = [[=ch=]] =>   0 ( ? )
                [[=DZ=]] = [[=Dz=]] = [[=dz=]] =>   6 ( DŽDždžDZDzdz )
                [[=LJ=]] = [[=Lj=]] = [[=lj=]] =>   3 ( LJLjlj )
                [[=LL=]] = [[=Ll=]] = [[=ll=]] =>   2 ( Ỻỻ )
                [[=NJ=]] = [[=Nj=]] = [[=nj=]] =>   3 ( NJNjnj )
                [[=SS=]] = [[=Ss=]] = [[=ss=]] =>   2 ( ßẞ )
                

                However, the use of these di-graph characters are quite delicate ! Let’s consider these 7 di-graph collating elements, below, with various cases :

                [[.AE.]]    [[.Ae.]]    [[.ae.]]    ( European Ligature )
                [[.CH.]]    [[.Ch.]]    [[.ch.]]    ( Spanish )
                [[.DZ.]]    [[.Dz.]]    [[.dz.]]    ( Hungarian, Polish, Slovakian, Serbo-Croatian )
                [[.LJ.]]    [[.Lj.]]    [[.lj.]]    ( Serbo-Croatian )
                [[.LL.]]    [[.Ll.]]    [[.ll.]]    ( Spanish )
                [[.NJ.]]    [[.Nj.]]    [[.nj.]]    ( Serbo-Croatian )
                [[.SS.]]    [[.Ss.]]    [[.ss.]]    ( German )
                

                As we know that :

                  LJ  01C7  LATIN CAPITAL LETTER LJ
                  Lj  01C8  LATIN CAPITAL LETTER L WITH SMALL LETTER J
                  lj  01C9  LATIN SMALL LETTER LJ
                
                  DZ  01F1  LATIN CAPITAL LETTER DZ
                  Dz  01F2  LATIN CAPITAL LETTER D WITH SMALL LETTER Z
                  dz  01F3  LATIN SMALL LETTER DZ
                

                If we apply the regex [[.dz.]-[.lj.][=dz=][=lj=]] against the text bcddzdzefghiijjklljljmn, pasted in a new tab, Columns++ would find 12 matches :

                dz
                dz
                e
                f
                g
                h
                i
                j
                k
                l
                lj
                lj
                

                To sum up, @coises, the key points, of your third experimental version, are :

                • A major regex engine, inplemented in UTF-32, which correctly handle all the Unicode characters, from \x{0} to \x{0010FFFF}, and correctly manage all the Unicode character classes \p{Xy} or [[:Xy:]]

                • Additional features as \i, \m, \o and \y and their complements

                • The \X regex feature ( \M\m* ) correctly works for characters OVER the BMP

                • The invalid UTF-8 characters may be kept, replaced or deleted ( FIND \i+, REPLACE ABC $1 XYZ )

                • The NUL character can be placed in replacement ( FIND ABC\x00XYZ, REPLACE \x0--$0--\x{00} )

                • Correct handle of case replacements, even in case of accentuated characters ( FIND (?-s). REPLACE \U$0 )

                • The \K feature ALSO works in a step-by-step replacement with the Replace button ( FIND ^.{20}\K(.+), REPLACE --\1-- )


                To end, @coises, do you think it’s worth testing some regex examples with possible replacements ? I could test some tricky regexes to check the robustness of your final UTF-32 version., if necessary ?

                Best Regards,

                guy038

                Alan KilbornA CoisesC 2 Replies Last reply Reply Quote 2
                • Alan KilbornA
                  Alan Kilborn @guy038
                  last edited by Alan Kilborn

                  @guy038 said:

                  The \K feature ALSO works in a step-by-step replacement with the Replace button

                  That’s major. Perhaps whatever change allows that could be factored out and put into native Notepad++?

                  (Again, I’m not one to say often that functionality that’s in a plugin should “go native”…but, when we’re talking about important find/replace functionality…it should).


                  I could test some tricky regexes to check the robustness of your final UTF-32 version

                  First, I’d encourage this further testing.

                  Second, is there a reason to mention UTF-32, specifically?

                  CoisesC 1 Reply Last reply Reply Quote 1
                  • CoisesC
                    Coises @guy038
                    last edited by Coises

                    @guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):

                    To end, @coises, do you think it’s worth testing some regex examples with possible replacements ? I could test some tricky regexes to check the robustness of your final UTF-32 version., if necessary ?

                    It would surely be helpful; but I now know there will be at least one more experimental version, so you might as well wait for that. I no longer expect to have this ready in time to be included in the plugins list for the next Notepad++ release.

                    It turns out that \X does not work correctly. Consider this text:

                    👍👍🏻👍🏼👍🏽👍🏾👍🏿
                    

                    There are six “graphical characters” there, but \X finds eleven (if you copy without a line ending). It turns out the rules for identifying grapheme cluster breaks are complex, and Boost::regex does not implement them correctly. (As far as I can tell, Scintilla is agnostic about this. Selections go by code point — stepping with the right arrow key, you can see the cursor move to the middle of any character comprised of multiple code points. I think Scintilla depends on the fonts and the display engine to render grapheme clusters properly, but I haven’t verified that.)

                    So I’m working on making that work properly. I think I’ve found a way, but work is still in progress.

                    I will also look at the equivalence classes problems you identified. Thank you for that information! It will help greatly.

                    I’ve had a couple thoughts, and I’m wondering what others think:

                    • I find the character class matching when Match case is not checked (or (?i) is used) absurd. Boost::regex makes \l and \u match all [[:alpha:]] characters (not just cased letters), and the Unicode classes become entirely erratic. I can’t think of any named character classes that would be less useful if case insensitivity were ignored when matching them. If it’s possible to do that — so that, for example, \u still matches only upper case characters even when Match case is not checked — would others find that an improvement? Would anyone find it problematic? (This wouldn’t affect classes specified with explicit characters, like [aeiou] or [A-F]: Match case would still control how those match. If I can accomplish this as I intend, only the Unicode “General Category” character classes, \l, \u, [:lower:], [:upper:] and obvious correlates would be changed to ignore case insensitivity and always test the document text as written.)

                    • Should there be an option to make the POSIX classes and their escapes (such as \s, \w, [[:alnum:]], [[:punct:]]) match only ASCII characters? Unfortunately, I don’t see any reasonable way to make that an in-expression switch like (?i); if it were done at all, it would have to be a checkbox that would apply to the entire expression. Would this help anyone, or just add complication for little value?

                    • Does anyone care much about having Unicode script properties available as regex properties (e.g., \p{Greek}, \p{Hebrew}, \p{Latin})?

                    • Does anyone care much about having Unicode character names available (e.g., [[.GREEK SMALL LETTER FINAL SIGMA.]] equivalent to \x{03C2})? My thought is that including those will make the module much larger, and that by the time you’ve looked up the exact way the name has to be given, you could just look up the hexadecimal code point anyway.

                    1 Reply Last reply Reply Quote 0
                    • CoisesC
                      Coises @Alan Kilborn
                      last edited by

                      @Alan-Kilborn said in Columns++: Where regex meets Unicode (Here there be dragons!):

                      @guy038 said:

                      The \K feature ALSO works in a step-by-step replacement with the Replace button

                      That’s major. Perhaps whatever change allows that could be factored out and put into native Notepad++?

                      When doing a Replace, the first step is to check that the selection matches the search expression. In general, this fails when \K is used because an expression using \K doesn’t select starting from where it matched.

                      What Columns++ does is to remember the starting position for the last successful find (whether it was from Find or as part of a Replace). Then, when Replace is clicked, it checks starting from there rather than from the selection.

                      It’s been a while since I wrote that code, and I don’t remember exactly why, but I put in a number of checks to be sure the remembered starting point is still valid. In particular, if focus leaves the Search dialog (meaning the user might have changed the selection or the text), the position memory is marked invalid and a normal search starting from the beginning of the selection is used. I think the reason was to be sure that if the user intends to start a new search (by clicking in a new position or changing the selection), it would be important not to start from some remembered (and now meaningless) point.

                      Off hand, I don’t see any reason the same principle couldn’t be applied to Notepad++ search. It wouldn’t be a matter of just copying, though; someone would have to think through the logic from scratch in the context of how Notepad++ implements search.

                      1 Reply Last reply Reply Quote 2
                      • guy038G
                        guy038
                        last edited by guy038

                        Hello, @coises and All,

                        You said :

                        It turns out that \X does not work correctly. Consider this text:

                        Ah…, indeed, I spoke too quickly and/or did not test this part thoroughly ! Thanks for your investigations in this matter !


                        You said :

                        Should there be an option to make the POSIX classes and their escapes (such as \s, \w, [[:alnum:]], [[:punct:]]) match only ASCII characters ?

                        I do not think it’s necessary as we can provide the same behaviour with the following regexes :

                        • (?-i)(?=[[:ascii:]])\p{punct} or (?-i)(?=\p{punct})[[:ascii:]] gives 32 matches

                        • (?-i)(?=[[:ascii:]])\u or (?-i)(?=\u)[[:ascii:]] gives 26 matches

                        • (?-i)(?=[[:ascii:]])\l or (?-i)(?=\l)[[:ascii:]] gives 26 matches

                        However, note that the insensitive regexes (?i)(?=[[:ascii:]])\u or (?i)(?=\u)[[:ascii:]] or (?i)(?=[[:ascii:]])\l or (?i)(?=\l)[[:ascii:]] return a wrong result 54 matches !

                        But, luckily, the sensitive regexes (?-i)(?=[[:ascii:]])[\u\l] or (?-i)(?=[\u\l])[[:ascii:]] do return 52 matches

                        See, right after, my opinion on the sensitive vs insensitive ways :


                        You said :

                        I find the character class matching when Match case is not checked (or (?i) is used) absurd. …

                        For example, let’s suppose that we run this regex (?-i)[A-F[:lower:]], against my Total_Chrs.txt file. It does give 2264 matches, so 6 UPPER letters + 2258 LOWER letters

                        Now, if we run this same regex, in an insensitive way, the (?i)[A-F[:lower:]] regex returns 141029 matches Of course, this result is erroneous but, first oddity, why 141029 instead of 141028 ( the total number of letters ) ?

                        Well, the ˮ character ( \x{02EE} ) is the last lowercase letter of the SPACING MODIFIER LETTERS block. As within my file, this Unicode block is followed with the COMBINING DIACRITICAL MARKS block, it happens that an additional \x{0345 combining diacritical mark is tied to that \x{02EE} character ( don’t know why !? )

                        But, actually, the non-sensitive regex (?i)[A-F[:lower:]] should be modified in the sensitive regex (?-i)[A-Fa-f[:upper:][:lower:]] which, in turn, is identical to the regex (?-i)[[:upper:][:lower:]] and correctly returns 4,116 matches ( So 1,858 UPPER letters + 2,258 LOWER letters )

                        So, as you cannot check the Match Case option of your own accord, I think that the more simple way would be, as long as the Regular expression radio button is checked :

                        • When a (?i) modifier is found within the regex

                        or

                        • When the Match Case option is unchecked

                        To show a message, saying :

                        The given regex may produce wrong results, particularly, if replacement is involved. Try to refactor this **insensitive** regex in a **sensitive** way !
                        

                        You said :

                        Does anyone care much about having Unicode script properties available as regex properties`` (e.g., \p{Greek}, \p{Hebrew}, \p{Latin})?

                        It might be useful, sometimes, to differentiate Unicode characters of a text, according to their scripts ( regions ). But it’s up to you : this should not be a primary goal !


                        You said :

                        Does anyone care much about having Unicode character names available (e.g., [[.GREEK SMALL LETTER FINAL SIGMA.]] equivalent to \x{03C2} ) ? …

                        I agree to your reasoning and I think that it would have little interest in doing so. So, let this module fast enough, regarding the included UNICODE features, so far !


                        During my tests, I once searched for the [\p{L*}] regex and I was surprised to get 5 matches. Note that an Unicode character class CANNOT be part of a usual class character !. Thus my regex [\p{L*}] poorly matched the five characters ‘*’, ‘L’, ‘p’, ‘{’ and ‘}’ !

                        Regarding the 31 characters of the Title Case Unicode category ( \p{Lt} ), you said, in a previous post, that you saw this particularity, with our present Boost engine, thanks to the regex (?-i)\u(?<=\l) or also (?-i)(?=\l)\u This is possible because, presently, these chars are both included as UPPER and LOWER chars. However, with the Columns++ plugin, these two regexes correctly return 0 match because the \p[Lt}! class is not concerned !


                        I wish your good development for the next, and probably, last version of your experimental Columns++ plugin !

                        Best Regards,

                        guy038

                        P.S. :

                        When the Match case option is unchecked, in your Columns++ plugin, the following POSIX classes return a wrong number of occurrences, when applied against the Total_Chars file :

                        [[:ascii:]], [[:unicode:]], [[:upper:]], [[:lower:]], [[:word:]], [[:alnum:]] and [[:alpha:]]

                        CoisesC 2 Replies Last reply Reply Quote 0
                        • CoisesC
                          Coises @guy038
                          last edited by

                          @guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):

                          Now, if we run this same regex, in an insensitive way, the (?i)[A-F[:lower:]] regex returns 141029 matches Of course, this result is erroneous but, first oddity, why 141029 instead of 141028 ( the total number of letters ) ?

                          Well, the ˮ character ( \x{02EE} ) is the last lowercase letter of the SPACING MODIFIER LETTERS block. As within my file, this Unicode block is followed with the COMBINING DIACRITICAL MARKS block, it happens that an additional \x{0345 combining diacritical mark is tied to that \x{02EE} character ( don’t know why !? )

                          This is a combination of the way Boost::regex handles case-insensitivity for character classes and a peculiarity of U+0345.

                          In general, case insensitivity makes use of “case folding.” (This is subtly different from just lower casing; for example, Greek capital sigma, small sigma and small final sigma all case fold to small sigma; that way, a case insensitive search for capital sigma will match both small sigma and small final sigma. But it also means a case-insensitive search for either small sigma or small final sigma will match both.) Boost::regex supports explicitly specifying the case folding algorithm for a custom character type, and I’m using the “simple case folding” defined by Unicode.

                          That file includes the line:
                          U+0345; C; 03B9; # COMBINING GREEK YPOGEGRAMMENI
                          which says that U+0345 should case fold to U+03B9. U+0345 is a combining diacritical mark; U+03B9 is a lowercase letter. (Presumably this is because both uppercase to U+399, Greek Capital Letter Iota. Since I don’t know Greek, I’d be speculating as to why it works this way, but the Unicode people probably know what they’re doing.)

                          Boost::regex does the case folding translation when matching character classes. This makes sense for [A-F], but it makes for nonsense when applied to [[:lower:]]. (Further confusion results from the fact that Boost::regex adds [[:alpha:]] to [[:lower:]] and [[:upper:]] when in case-insensitive mode. So all three match any code point which case folds to a letter.)

                          Behavior for the Unicode classes is even more bizarre, since not everything that changes when case folding changes to lowercase. (?i)\p{Lu} finds 644 matches.

                          I haven’t yet deeply investigated whether it is practical to change this behavior. You can perhaps see, though, why I think it should be changed.

                          Note that an Unicode character class CANNOT be part of a usual class character!

                          This is a Boost::regex characteristic (peculiarity)? The \p{...} escapes do not work inside square brackets. However, in all cases, \p{something} is equivalent to [[:something:]] and you can combine that class as usual; so (?-i)[A-F[:Ll:]] will work. It is equivalent to (?-i)[A-F[:lower:]] — but the case-insensitive versions are not equivalent (because Boost::regex silently adds [:alpha:] to case-insensitive [:lower:], but it has no knowledge of or special behavior for the Unicode classes).

                          1 Reply Last reply Reply Quote 1
                          • CoisesC
                            Coises @guy038
                            last edited by

                            @guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):

                            When the Match case option is unchecked, in your Columns++ plugin, the following POSIX classes return a wrong number of occurrences, when applied against the Total_Chars file :

                            [[:ascii:]], [[:unicode:]], [[:upper:]], [[:lower:]], [[:word:]], [[:alnum:]] and [[:alpha:]]

                            For [[:ascii:]], that’s because the two characters ſK (U+017F and U+212A) case fold to the ASCII letters s and k.

                            For [[:unicode:]], the five characters ŸſẞKÅ match (?-i)[[:unicode:]] but not (?i)[[:unicode:]], while µ matches (?i)[[:unicode:]] but not (?-i)[[:unicode:]]. I have no doubt the same sort of case folding analysis would explain this, but I haven’t done it.

                            For the others, the explanation is as in my previous post.

                            None of this can be “fixed” without changing the way the Boost::regex engine processes case-insensitivity for POSIX classes (and the Unicode general category property values, which are treated as named classes), because it is, in fact, behaving as designed. Whether it will be practical for me to override that design decision is next on my list of things to investigate (after I finish correcting \X).

                            1 Reply Last reply Reply Quote 1
                            • guy038G
                              guy038
                              last edited by guy038

                              Hi, @coises,

                              First, when I supposed that the management of \X was correct, in your experimental version, I was thinhing of combining characters only, placed after the main character itself. And I did not think about the possible emoji combinations !


                              Now, I remember of this post, in 2020, which could interest you :

                              https://community.notepad-plus-plus.org/post/57965

                              I simply recopied this part of this post, below, because it could help you to gather all information for a correct management of the \X regex syntax :

                              To be exhaustive, the different special characters, involved with the Emoji characters, are :

                              • The 2 Format Characters, U+200C and U+200D, in the Unicode block General Punctuation ( U+2000 - U+206F )

                              https://www.unicode.org/charts/PDF/U2000.pdf

                              • The 26 Regional Indicator Symbols, U+1F1E6 - U+1F1FF, in the Unicode block Enclosed Alphanumeric Supplement (U+1F100 – U+1F1FF )

                              https://www.unicode.org/charts/PDF/U1F100.pdf

                              • The 5 Emoji Modifiers, U+1F3FB - U+1F3FF, in the Unicode block Miscellaneous Symbols and Pictographs ( U+1F300 – U+1F5FF )

                              https://www.unicode.org/charts/PDF/U1F300.pdf

                              • The 4 Emoji Components, U+1F9B0 - U+1F9B3, in the Unicode block Supplemental Symbols and Pictographs ( U+1F900 – U+1F9FF )

                              https://www.unicode.org/charts/PDF/U1F900.pdf

                              • The 2 Emoji Variation Selectors,U+FE0E** and U+FE0F, in the Unicode block Variation Selectors( U+FE00 – U+FE0F )

                              https://www.unicode.org/charts/PDF/UFE00.pdf

                              Best Regards,

                              guy038

                              See the beauty :

                              For example, from these four characters :

                              🏳 ( \x{1F3F3} )
                              ️ ( \x{FE0F} = VS-16 )
                              ‍ ( \x{200D} = ZWJ )
                              🌈 ( \x{1F308} )

                              The sequence \x{1F3F3}\x{FE0F}\x{200D}\x{1F308} would return the RAINBOW FLAG / PRIDE FLAG emoji :

                              🏳️‍🌈

                              Note that the VS-16 char is even not necessary. Thus, the sequence \x{1F3F3}\x{200D}\x{1F308} would work, as well :

                              🏳‍🌈

                              Now, if we omit the Zero Width Joiner char too, we simply get the two characters, not glued !

                              🏳🌈


                              In the same way, from these four characters :

                              🏴 ( \x{1F3F4 )
                              ️ ( \x{FE0F} = VS-16 )
                              ‍ ( \x{200D} = ZWJ )
                              ☠ ( \x{2620} )

                              The sequence \x{1F3F4}\x{FE0F}\x{200D}\x{2620} would display the PIRAT FLAG emoji !

                              🏴️‍☠

                              1 Reply Last reply Reply Quote 0
                              • CoisesC
                                Coises
                                last edited by Coises

                                I’ve posted Columns++ for Notepad++ version 1.1.5.4-Experimental.

                                • \X should work properly now. I haven’t been able to invent “real-world” tests for the Hangul and Indic rules, because I know nothing about those languages. I think I got it right. (Note that if you have View | Show Symbol | Show Non-Printing Characters checked, some examples, like the compound flags in @guy038’s post, won’t show correctly in Notepad++. \X will still select the entire group as one, since it’s based on Unicode rules, not how Scintilla displays the text.)

                                • Properties (\p{...} and \P{...}), named classes (like [[:ascii:]] or [[:lower::]]) and the \l and \u escapes now ignore the Match case setting and the (?i) flag: they are always case sensitive. The old behavior was bizarre and all but useless, so even though this introduces a change in semantics, I think it does more good than harm. (A downside is that the “trick” I used to change the Boost::regex behavior for Unicode matching is not applicable to ANSI matching, so there is a discrepancy in behavior between the two.)

                                I will probably remove the \m and \M escapes, because they don’t accomplish their original intended purpose, which was that I thought they identified code points which combine with the previous code point to make a single character. \X finds whole characters. You can use (?!\X). to find code points that are not the first or only code point in a combining group. (I made \m equivalent to \p{M*}, but that’s probably not as useful as I thought.) There is, unfortunately, no straightforward way to define a class or escape sequence that separates “simple” and “compound” characters. (Classes depend on individual code point values, but which code points are part of a “grapheme cluster” — what normal people call a character — depends on context. Boost::regex already had the \X available, and I was able to find a “trick” to modify its behavior for Unicode searches. I haven’t found a way to add an entirely new escape that wouldn’t mean messing with the Boost code to a greater degree than I think is wise.)

                                Remaining tasks (if there are no important problems in this version) will be updating the documentation and perhaps investigating if equivalence classes can work better. (I don’t think I will let equivalence classes be a blocker, though. Probably few people even know what they are, and I’m pretty sure they don’t work any worse than before.)

                                1 Reply Last reply Reply Quote 1
                                • guy038G
                                  guy038
                                  last edited by guy038

                                  Hello, @coises an All,

                                  Due, to the total size of my text, I’ll have to split it into three consecutive posts. I’m going to recapitulate some parts, given in previous posts in order to get a coherent and full description of the tests of your Columns++ for Notepad++ version 1.1.5.4-Experimental

                                  As a preamble, big thanks for your process dialog : Notepad++ do not freeze anymore, when using the Select All feature. To my mind, that’s an important improvement !

                                  Here are the different Unicode planes included in my Total_Chars file ( 325,590 characters for a size of 1,236,733 bytes )

                                      •--------------------•-------------------•------------•---------------------------•----------------•-------------------•
                                      |       Range        |    Description    |   Status   |      Number of Chars      | UTF-8 Encoding |  Number of Bytes  |
                                      •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
                                      |    0000  -   007F  |  PLANE 0 - BMP    |  Included  |             |        128  |    1 Byte      |              128  |
                                      •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
                                      |    0080  -   00FF  |  PLANE 0 - BMP    |  Included  |             |    +   128  |                |              256  |
                                      |                    |                   |            |             |             |    2 Bytes     |                   |-------
                                      |    0100  -   07FF  |  PLANE 0 - BMP    |  Included  |             |    + 1,792  |                |            3,584  |      \
                                      •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•      |
                                      |    0800  -   D7FF  |  PLANE 0 - BMP    |  Included  |             |   + 53,248  |                |          159,744  |      |
                                      |                    |                   |            |             |             |                |                   |      |
                                      |    D800  -   DFFF  |  SURROGATES zone  |  EXCLUDED  |    - 2,048  |             |                |                   |      |
                                      |                    |                   |            |             |             |                |                   |      |
                                      |    E000  -   F8FF  |  PLANE 0 - PUA    |  Included  |             |    + 6,400  |                |           19,200  |      |
                                      |                    |                   |            |             |             |                |                   |      |
                                      |    F900  -   FDCF  |  PLANE 0 - BMP    |  Included  |             |    + 1,232  |    3 Bytes     |            3,696  |      |
                                      |                    |                   |            |             |             |                |                   |      |
                                      |    FDD0  -   FDEF  |  NON-characters   |  EXCLUDED  |       - 32  |             |                |                   |      |
                                      |                    |                   |            |             |             |                |                   |      |
                                      |    FDF0  -   FFFD  |  PLANE 0 - BMP    |  Included  |             |      + 526  |                |            1,578  |      |
                                      |                    |                   |            |             |             |                |                   |      |==>  [[:unicode:]]
                                      |    FFFE  -   FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |      |
                                      •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•      |
                                      |                       Plane 0 - BMP    | SUB-Totals |    - 2,082  |   + 63,454  |                |          188,186  |      |
                                      •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•      |
                                      |   10000  -  1FFFD  |  PLANE 1 - SMP    |  Included  |             |   + 65,534  |                |          262,136  |      |
                                      |                    |                   |            |             |             |                |                   |      |
                                      |   1FFFE  -  1FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |      |
                                      •--------------------•-------------------•------------•-------------•-------------•                •-------------------•      |
                                      |   20000  -  2FFFD  |  PLANE 2 - SIP    |  Included  |             |   + 65,534  |                |          262,136  |      |
                                      |                    |                   |            |             |             |    4 Bytes     |                   |      |
                                      |   2FFFE  -  2FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |      |
                                      •--------------------•-------------------•------------•-------------•-------------•                •-------------------•      |
                                      |   30000  -  3FFFD  |  PLANE 3 - TIP    |  Included  |             |   + 65,534  |                |          262,136  |      |
                                      |                    |                   |            |             |             |                |                   |      |
                                      |   3FFFE  -  3FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |      |
                                      •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•      |
                                      |   40000  -  DFFFF  |  PLANES 4 to 13   |  NOT USED  |  - 655,360  |             |    4 Bytes     |                   |      |
                                      •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•      |
                                      |   E0000  -  EFFFD  |  PLANE 14 - SPP   |  Included  |             |   + 65,534  |                |          262,136  |      /
                                      |                    |                   |            |             |             |                |                   |
                                      |   EFFFE  -  EFFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
                                      •--------------------•-------------------•------------•-------------•-------------•                •-------------------•
                                      |   FFFF0  -  FFFFD  |  PLANE 15 - SPUA  |  NOT USED  |   - 65,334  |             |                |                   |
                                      |                    |                   |            |             |             |                |                   |
                                      |   FFFFE  -  FFFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
                                      •--------------------•-------------------•------------•-------------•-------------•    4 Bytes     •-------------------•
                                      |  100000  - 10FFFD  |  PLANE 16 - SPUA  |  NOT USED  |   - 65,334  |             |                |                   |
                                      |                    |                   |            |             |             |                |                   |
                                      |  10FFFE  - 10FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
                                      •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
                                      |                                       GRAND Totals  |  - 788,522  |  + 325,590  |                |        1,236,730  |
                                      |                                                     |             |             |                |                   |
                                      |                              Byte Order Mark - BOM  |             |             |                |                3  |
                                      •-----------------------------------------------------•-------------•-------------•                •-------------------•
                                      |                                                     |  1,114,112 Unicode chars  |                |  Size  1,236,733  |
                                      •-----------------------------------------------------•---------------------------•----------------•-------------------•
                                  

                                  Presently, here is the summary of the contents of the Total_Chars.txt file :

                                      •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
                                      |     Range      |  Plane  |      COUNT / MARK of ALL characters      |  # Chars  |    COUNT / MARK of ALL UNASSIGNED characters    |  # Unas.  |
                                      •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
                                      |   0000...FFFD  |     0   |  [\x{0000}-\x{FFFD}]                     |   63,454  |  (?=[\x{0000}-\x{D7FF}]|[\x{F900}-\x{FFFD}])\Y  |    1,398  |
                                      |  10000..1FFFD  |     1   |  [\x{10000}-\x{1FFFD}]                   |   65,534  |  (?=[\x{10000}-\x{1FFFD}])\Y                    |   37,090  |
                                      |  20000..2FFFD  |     2   |  [\x{20000}-\x{2FFFD}]                   |   65,534  |  (?=[\x{20000}-\x{2FFFD}])\Y                    |    4,039  |
                                      |  30000..3FFFD  |     3   |  [\x{30000}-\x{3FFFD}]                   |   65,534  |  (?=[\x{30000}-\x{3FFFD}])\Y                    |   56,403  |
                                      |  E0000..EFFFD  |    14   |  [\x{E01F0}-\x{EFFFD}]                   |   65,534  |  (?=[\x{E0000}-\x{EFFFD}])\Y                    |   65,197  |
                                      •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
                                      |  00000..EFFFD  |         |  (?s).   \I   \p{Any}   [\x0-\x{EFFFD}]  |  325,590  |  (?![\x{E000}-\x{F8FF}])\Y   \p{Not Assigned}   |  164,127  |
                                      •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
                                  

                                  Indeed, I cannot post my new Unicode_Col++.txt file, in its entirety, with the detail of all the Unicode blocks ( Too large ! ). However, it will be part of my future Unicode.zip archive that I’ll post on my Google Drive account !


                                  To begin with, all the ranges or characters, from the Total_Chars.txt file , are correct :

                                      [\x{0000}-\x{007F}]   =>      128 chars ( OK )  \
                                      [\x{0080}-\x{00FF}]   =>      128 chars ( OK )  |
                                      [\x{0100}-\x{07FF}]   =>    1,792 chars ( OK )  |
                                      [\x{0800}-\x{D7FF}]   =>   53,248 chars ( OK )  |  Plane 0
                                      [\x{E000}-\x{F8FF}]   =>    6,400 chars ( OK )  |
                                      [\x{F900}-\x{FDCF}]   =>    1,232 chars ( OK )  |
                                      [\x{FDF0}-\x{FFFD}]   =>      526 chars ( OK )  /
                                  
                                      [\x{10000}-\x{1FFFD}]  =>   65,534 chars  Total characters PLANE  1    ( OK )
                                      [\x{20000}-\x{2FFFD}]  =>   65 534 chars  Total characters PLANE  2    ( OK )
                                      [\x{30000}-\x{3FFFD}]  =>   65 534 chars  Total characters PLANE  3    ( OK )
                                      [\x{E0000}-\x{EFFFD}]  =>   65 534 chars  Total characters PLANE 14    ( OK )
                                  
                                      [\x{0000}-\x{007F}]    =>      128 chars, coded with 1 byte            ( OK )
                                      [\x{0080}-\x{07FF}]    =>    1,920 chars, coded with 2 bytes           ( OK )
                                      [\x{0800}-\x{FFFD}]    =>   61,406 chars, coded with 3 bytes           ( OK )
                                      [\x{10000}-\x{EFFFD}]  =>  262,136 chars, coded with 4 bytes           ( OK )
                                  
                                      [\x{0000}-\x{EFFFD}]   =>  325,590 chars  = TOTAL of characters        ( OK )
                                      
                                      [\x{0100}-\x{EFFFD}]   =>  325 334 chars  = Total chars OVER \x{00FF}  ( OK )
                                      
                                      [[:unicode:]]          =>  323 334 chars  = Total chars OVER \x{00FF}  ( OK )
                                  

                                  I tried some expressions with look-aheads and look-behinds, containing overlapping zones !

                                  For instance, against this text aaaabaaababbbaabbabb, pasted in a new tab, with a final line-break, all the regexes, below, give the correct number of matches :

                                  ba*(?=a)   =>  4 matches
                                  ba*(?!a)   =>  9 matches
                                  ba*(?=b)   =>  8 matches
                                  ba*(?!b)   =>  5 matches
                                  
                                  (?<=a)ba*  =>  5 matches
                                  (?<!b)ba*  =>  5 matches
                                  
                                  (?<=b)ba*  =>  4 matches
                                  (?<!a)ba*  =>  4 matches
                                  

                                  On the other hand, the search of the regex :

                                  [[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.PAD.][.HOP.][.BPH.][.NBH.][.IND.][.NEL.][.SSA.][.ESA.][.HTS.][.HTJ.][.LTS.][.PLD.][.PLU.][.RI.][.SS2.][.SS3.][.DCS.][.PU1.][.PU2.][.STS.][.CCH.][.MW.][.SPA.][.EPA.][.SOS.][.SGCI.][.SCI.][.CSI.][.ST.][.OSC.][.PM.][.APC.][.NBSP.][.SHY.][.ALM.][.SAM.][.OSPM.][.MVS.][.NQSP.][.MQSP.][.ENSP.][.EMSP.][.3/MSP.][.4/MSP.][.6/MSP.][.FSP.][.PSP.][.THSP.][.HSP.][.ZWSP.][.ZWNJ.][.ZWJ.][.LRM.][.RLM.][.LS.][.PS.][.LRE.][.RLE.][.PDF.][.LRO.][.RLO.][.NNBSP.][.MMSP.][.WJ.][.(FA).][.(IT).][.(IS).][.(IP).][.LRI.][.RLI.][.FSI.][.PDI.][.ISS.][.ASS.][.IAFS.][.AAFS.][.NADS.][.NODS.][.IDSP.][.ZWNBSP.][.IAA.][.IAS.][.IAT.][.SFLO.][.SFCO.][.SFDS.][.SFUS.]]

                                  Does return 118 matches to which we must add the [[.LF.]] and [[.CR.]] symbolic char names, giving a total number of 120 symbolic character names whose each one can be found, independently, with the [[.XXX.]] syntax !


                                  Now, against the Total_Chars.txt file, all the following results are correct :

                                  (?s).  =  \I  =  \p{Any}  =  [\x{0000}-\x{EFFFD}]                                         =>                Total =  325,590
                                  
                                  
                                  \p{Unicode}  =  [[:Unicode:]]                                                             =>  325,334    |
                                                                                                                                           |  Total =  325,590
                                  \P{Unicode}  =  [[:^Unicode:]]                                                            =>      256    |
                                  
                                  
                                  \p{Ascii}  =  (?s)\o                                                                      =>      128    |
                                                                                                                                           |  Total =  325,590
                                  \P{Ascii}  =  \O                                                                          =>  325,462    |
                                  
                                  
                                  \X                                                                                        =>  322,628    |
                                                                                                                                           |  Total =  325,590
                                  (?!\X).                                                                                   =>    2,962    |
                                  
                                  
                                  [\x{E000}-\x{F8FF}]|\y     =  [\x{E000}-\x{F8FF}]|[[:defined:]]      =  \p{Assigned}      =>  161,463    |
                                                                                                                                           |  Total =  325,590
                                  (?![\x{E000}-\x{F8FF}])\Y  =  (?![\x{E000}-\x{F8FF}])[^[:defined:]]  =  \p{Not Assigned}  =>  164,127    |
                                  

                                  See next post !

                                  1 Reply Last reply Reply Quote 1
                                  • guy038G
                                    guy038
                                    last edited by guy038

                                    Hi @Coises and All,

                                    Continuation of my reply :

                                    Here are the correct results, concerning all the Posix character classes, against the Total_Chars.txt file

                                    [[:ascii:]]                                                       an UNDER \x{0080}         character                     128   =  [\x{0000}-\x{007F}]  =  \p{ascii}
                                    [[:unicode:]]  =  \p{unicode}                                     an OVER  \x{00FF}         character                 325,334   =  [\x{0100}-\x{EFFFD}] ( in 'Total_Chars.txt' )
                                    
                                    [[:space:]]   =  \p{space}  =  [[:s:]]  =  \p{s}  =  \ps  =  \s   a             WHITE-SPACE character                      25   =  [\t\n\x{000B}\f\r \x{0085}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}]
                                                                   [[:h:]]  =  \p{h}  =  \ph  =  \h   an HORIZONTAL white space character                      18   =  [\t \x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]  =  \p{Zs}|\t
                                    [[:blank:]]   =  \p{blank}                                        a  BLANK                  character                      18   =  [\t \x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]  =  \p{Zs}|\t
                                                                   [[:v:]]  =  \p{v}  =  \pv  =  \v   a  VERTICAL   white space character                       7   =  [\n\x{000B}\f\r\x{0085}\x{2028}\x{2029}]
                                    
                                    [[:cntrl:]]   =  \p{cntrl}                                        a  CONTROL code           character                      65   =  [\x{0000}-\x{001F}\x{007F}\x{0080}-\x{009F}]
                                    
                                    [[:upper:]]   =  \p{upper}  =  [[:u:]]  =  \p{u}  =  \pu  =  \u   an  UPPER case    letter                              1,858   =  \p{Lu}
                                    [[:lower:]]   =  \p{lower}  =  [[:l:]]  =  \p{l}  =  \pl  =  \l   a   LOWER case    letter                              2,258   =  \p{Ll}
                                                                                                      a   DI-GRAPIC     letter                                 31   =  \p{Lt}
                                                                                                      a   MODIFIER      letter                                404   =  \p{Lm}
                                                                                                      an  OTHER         letter + SYLLABES / IDEOGRAPHS    136,477   =  \p{Lo}
                                    [[:digit:]]   =  \p{digit}  =  [[:d:]]  =  \p{d}  =  \pd   = \d   a   DECIMAL       number                                760   =  \p{Nd}
                                     _            =  \x{005F}                                         the LOW_LINE      character                               1
                                                                                                                                                        -----------
                                    [[:word:]]    =  \p{word}   =  [[:w:]]  =  \p{w}  =  \pw   = \w   a   WORD                  character                 141,789   =  \p{L*}|\p{nd}|_
                                    
                                    [[:alnum:]]   =  \p{alnum}                                        an  ALPHANUMERIC          character                 141,788   =  \p{L*}|\p{nd}
                                    
                                    [[:alpha:]]   =  \p{alpha}                                        any LETTER                character                 141,028   =  \p{L*}
                                    
                                    [[:graph:]]   =  \p{graph}                                        any VISIBLE               character                 154,809
                                    
                                    [[:print:]]   =  \p{print}                                        any PRINTABLE             character                 154,834   =  [[:graph:]]|\s
                                    
                                    [[:punct:]]   =  \p{punct}                                        any PUNCTUATION or SYMBOL character                   9,369   =  \p{P*} + \p{S*}  =  \p{Punctuation} + \p{Symbol}  = 855 + 8,514
                                    
                                    [[:xdigit:]]                                                      an HEXADECIMAL            character                      22   =  [0-9A-Fa-f]
                                    

                                    BTW, there are 31 di-graph characters, which are, either, considered as upper case and lower case letters, which can be found with the Unicode class char \p{Lt}


                                    And here are the correct results regarding the Unicode character classes, against the Total_Chars.txt file :

                                             \p{Any}                             any                            character                   325,590  =  (?s). =  \I  =  [\x{0000}-\x{EFFFD}
                                    
                                             \p{Ascii}                           a                              character UNDER \x80            128
                                    
                                             \p{Assigned}                        an ASSIGNED                    character                   161,463
                                    
                                    \p{Cc}   \p{Control}                         a  C0 or C1 CONTROL code       character                        65
                                    \p{Cf}   \p{Format}                          a  FORMAT CONTROL              character                       170
                                    \p{Cn}   \p{Not Assigned}                    an UNASSIGNED or NON-CHARACTER character                   164,127     ( 'Total_Chars.txt' does NOT contain the 66 NON-CHARACTER chars )
                                    \p{Co}   \p{Private Use}                     a  PRIVATE-USE                 character                     6,400
                                    \p{Cs}   \p{Surrogate}                       a  SURROGATE                   character  INVALID regex    ( 2,048 )   ( 'Total_Chars.txt' does NOT contain the 2,048 SURROGATE chars )
                                                                                                                                          -----------
                                    \p{C*}   \p{Other}                                                                                      170,762  =  \p{Cc}|\p{Cf}|\p{Cn}|\p{Co}
                                    
                                    \p{Lu}   \p{Uppercase Letter}                an UPPER case letter                                         1,858
                                    \p{Ll}   \p{Lowercase Letter}                a  LOWER case letter                                         2,258
                                    \p{Lt}   \p{Titlecase}                       a  DI-GRAPHIC letter                                            31
                                    \p{Lm}   \p{Modifier Letter}                 a  MODIFIER   letter                                           404
                                    \p{Lo}   \p{Other Letter}                    OTHER LETTER, including SYLLABLES and IDEOGRAPHS           136,477
                                                                                                                                          -----------
                                    \p{L*}   \p{Letter}                                                                                     141,028  =  \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}  =  [[:alpha:]]   =  \p{alpha}
                                    
                                    \p{Mc}   \p{Spacing Combining Mark}          a  NON-SPACING COMBINING mark (ZERO     advance width)         468
                                    \p{Me}   \p{Enclosing Mark}                  a  SPACING COMBINING     mark (POSITIVE advance width)          13
                                    \p{Mn}   \p{Non-Spacing Mark}                an ENCLOSING COMBINING   mark                                2,020
                                                                                                                                            ---------
                                    \p{M*}   \p{Mark}                                                                                         2,501  =  \p{Mc}|\p{Me}|\p{Mn}  ( = \m, which should be REMOVED )
                                    
                                    \p{Nd}   \p{Decimal Digit Number}            a DECIMAL number     character                                 760
                                    \p{Nl}   \p{Letter Number}                   a LETTERLIKE numeric character                                 236
                                    \p{No}   \p{Other Number}                    OTHER NUMERIC        character                                 915
                                                                                                                                            ---------
                                    \p{N*}   \p{Number}                                                                                       1,911  =  \p{Nd}|\p{Nl}|\p{No}
                                    
                                    \p{Pd}   \p{Dash Punctuation}                a  DASH or HYPHEN punctuation mark                              27
                                    \p{Ps}   \p{Open Punctuation}                an OPENING    PUNCTUATION     mark in a pair                    79
                                    \p{Pc}   \p{Connector Punctuation}           a  CONNECTING PUNCTUATION     mark                              10
                                    \p{Pe}   \p{Close Punctuation}               a  CLOSING    PUNCTUATION     mark in a pair                    77
                                    \p{Pi}   \p{Initial Punctuation}             an INITIAL QUOTATION          mark                              12
                                    \p{Pf}   \p{Final Punctuation}               a  FINAL   QUOTATION          mark                              10
                                    \p{Po}   \p{Other Punctuation}               OTHER PUNCTUATION             mark                             640
                                                                                                                                              -------
                                    \p{P*}   \p{Punctuation}                                                                                    855  =  \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po}
                                    
                                    \p{Sm}   \p{Math Symbol}                     a MATHEMATICAL symbol     character                            950
                                    \p{Sc}   \p{Currency Symbol}                 a CURRENCY                character                             63
                                    \p{Sk}   \p{Modifier Symbol}                 a NON-LETTERLIKE MODIFIER character                            125
                                    \p{So}   \p{Other Symbol}                    OTHER SYMBOL              character                          7,376
                                    
                                    \p{S*}   \p{Symbol}                                                                                       8,514  =  \p{Sm}|\p{Sc}|\p{Sk}|\p{So}
                                    
                                    \p{Zs}   \p{Space Separator}                 a   NON-ZERO width SPACE   character                            17  =  [\x{0020}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]  =  (?!\t)\h
                                    \p{Zl}   \p{Line Separator}                  the LINE SEPARATOR         character                             1  =  \x{2028}
                                    \p{Zp}   \p{Paragraph Separator}             the PARAGRAPH SEPARATOR    character                             1  =  \x{2029}
                                                                                                                                               ------   
                                    \p{Z*}   \p{Separator}                                                                                       19  =  \p{Zs}|\p{Zl}|\p{Zp}  
                                    

                                    For example, the regexes (?=[\x{0300}-\x{036F}])\p{M*} or (?=\p{M*})[\x{0300}-\x{036F}] would return 112 occurrences, i.e. all Mark characters of the COMBINING DIACRITICAL MARKS Unicode block ( refer https://www.unicode.org/charts/PDF/U0300.pdf )

                                    Remark :

                                    • A negative POSIX character class can be expressed as [^[:........:]] or [[:^........:]]

                                    • A negative UNICODE character class can be expressed as \P{..}, with an uppercase letter P


                                    Now, if you follow the procedure explained in the last part of this post :

                                    https://community.notepad-plus-plus.org/post/99844

                                    The regexes [\x{DC80}-\x{DCFF}] or \i or [[:invalid:]] do give 134 occurrences, which is the exact number of invalid characters of that example !


                                    I tested ALL the Equivalence classes feature. You can refer to the following link, for comparison :

                                    https://unicode.org/charts/collation/index.html

                                    And also consult the help at :

                                    https://unicode.org/charts/collation/help.html

                                    For the letter a, it detects 160 equivalences of a a letter

                                    However, against the Total_Chars.txt file, the regex [[=a=]] returns 86 matches. So we can deduce that :

                                    • A lot of equivalences are not found with the [[=a=]] regex

                                    • Some equivalents, not shown from this link, can be found with the [[=a=]] regex. it’s the case with the \x{249C} character ( PARENTHESIZED LATIN SMALL LETTER A ) !

                                    This situation happens with any character : for example, the regex [[=1=]] finds 54 matches, but, on the site, it shows 209 equivalences to the digit 1

                                    Now, with your experimental UTF-32 version, we can use any other equivalent character of the a letter to get the 86 matches ( for instance : ((=Ⱥ=]], [[=ⱥ=]], [[=Ɐ=]], … ) whereas, with our present Boost regex engine, some equivalences of a specific char do not return the right result. Thus, your version is more coherent, as it does give the same result, whatever the char used in the equivalence class regex !

                                    Here is below the list of all the equivalences of any char of the Windows-1252 code-page, from \x{0020} till \x{00DE} Note that, except for the DEL character, as an example, I did not consider the equivalence classes which return only one match !

                                    I also confirm, that I did not find any character over \x{FFFF} which would be part of a regex equivalence class, either with our Boost engine or with your Columns++ experimental version !

                                    [[= =]]    =   [[=space=]]                 =>     3    (     )
                                    [[=!=]]    =   [[=exclamation-mark=]]      =>     2    ( !! )
                                    [[="=]]    =   [[=quotation-mark=]]        =>     3    ( "⁍" )
                                    [[=#=]]    =   [[=number-sign=]]           =>     4    ( #؞⁗# )
                                    [[=$=]]    =   [[=dollar-sign=]]           =>     3    ( $⁒$ )
                                    [[=%=]]    =   [[=percent-sign=]]          =>     3    ( %⁏% )
                                    [[=&=]]    =   [[=ampersand=]]             =>     3    ( &⁋& )
                                    [[='=]]    =   [[=apostrophe=]]            =>     2    ( '' )
                                    [[=(=]]    =   [[=left-parenthesis=]]      =>     4    ( (⁽₍( )
                                    [[=)=]]    =   [[=right-parenthesis=]]     =>     4    ( )⁾₎) )
                                    [[=*=]]    =   [[=asterisk=]]              =>     2    ( ** )
                                    [[=+=]]    =   [[=plus-sign=]]             =>     6    ( +⁺₊﬩﹢+ )
                                    [[=,=]]    =   [[=comma=]]                 =>     2    ( ,, )
                                    [[=-=]]    =   [[=hyphen=]]                =>     3    ( -﹣- )
                                    [[=.=]]    =   [[=period=]]                =>     3    ( .․. )
                                    [[=/=]]    =   [[=slash=]]                 =>     2    ( // )
                                    [[=0=]]    =   [[=zero=]]                  =>    48    ( 0٠۟۠۰߀०০੦૦୦୵௦౦౸೦൦๐໐༠၀႐០᠐᥆᧐᪀᪐᭐᮰᱀᱐⁰₀↉⓪⓿〇㍘꘠ꛯ꠳꣐꤀꧐꩐꯰0 )
                                    [[=1=]]    =   [[=one=]]                   =>    54    ( 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁①⑴⒈⓵⚀❶➀➊〡㋀㍙㏠꘡ꛦ꣑꤁꧑꩑꯱1 )
                                    [[=2=]]    =   [[=two=]]                   =>    54    ( 2²ƻ٢۲߂२২੨૨୨௨౨౺౽೨൨๒໒༢၂႒፪២᠒᥈᧒᪂᪒᭒᮲᱂᱒₂②⑵⒉⓶⚁❷➁➋〢㋁㍚㏡꘢ꛧ꣒꤂꧒꩒꯲2 )
                                    [[=3=]]    =   [[=three=]]                 =>    53    ( 3³٣۳߃३৩੩૩୩௩౩౻౾೩൩๓໓༣၃႓፫៣᠓᥉᧓᪃᪓᭓᮳᱃᱓₃③⑶⒊⓷⚂❸➂➌〣㋂㍛㏢꘣ꛨ꣓꤃꧓꩓꯳3 )
                                    [[=4=]]    =   [[=four=]]                  =>    51    ( 4٤۴߄४৪੪૪୪௪౪೪൪๔໔༤၄႔፬៤᠔᥊᧔᪄᪔᭔᮴᱄᱔⁴₄④⑷⒋⓸⚃❹➃➍〤㋃㍜㏣꘤ꛩ꣔꤄꧔꩔꯴4 )
                                    [[=5=]]    =   [[=five=]]                  =>    53    ( 5Ƽƽ٥۵߅५৫੫૫୫௫౫೫൫๕໕༥၅႕፭៥᠕᥋᧕᪅᪕᭕᮵᱅᱕⁵₅⑤⑸⒌⓹⚄❺➄➎〥㋄㍝㏤꘥ꛪ꣕꤅꧕꩕꯵5 )
                                    [[=6=]]    =   [[=six=]]                   =>    52    ( 6٦۶߆६৬੬૬୬௬౬೬൬๖໖༦၆႖፮៦᠖᥌᧖᪆᪖᭖᮶᱆᱖⁶₆ↅ⑥⑹⒍⓺⚅❻➅➏〦㋅㍞㏥꘦ꛫ꣖꤆꧖꩖꯶6 )
                                    [[=7=]]    =   [[=seven=]]                 =>    50    ( 7٧۷߇७৭੭૭୭௭౭೭൭๗໗༧၇႗፯៧᠗᥍᧗᪇᪗᭗᮷᱇᱗⁷₇⑦⑺⒎⓻❼➆➐〧㋆㍟㏦꘧ꛬ꣗꤇꧗꩗꯷7 )
                                    [[=8=]]    =   [[=eight=]]                 =>    50    ( 8٨۸߈८৮੮૮୮௮౮೮൮๘໘༨၈႘፰៨᠘᥎᧘᪈᪘᭘᮸᱈᱘⁸₈⑧⑻⒏⓼❽➇➑〨㋇㍠㏧꘨ꛭ꣘꤈꧘꩘꯸8 )
                                    [[=9=]]    =   [[=nine=]]                  =>    50    ( 9٩۹߉९৯੯૯୯௯౯೯൯๙໙༩၉႙፱៩᠙᥏᧙᪉᪙᭙᮹᱉᱙⁹₉⑨⑼⒐⓽❾➈➒〩㋈㍡㏨꘩ꛮ꣙꤉꧙꩙꯹9 )
                                    [[=:=]]    =   [[=colon=]]                 =>     2    ( :: )
                                    [[=;=]]    =   [[=semicolon=]]             =>     3    ( ;;; )
                                    [[=<=]]    =   [[=less-than-sign=]]        =>     3    ( <﹤< )
                                    [[===]]    =   [[=equals-sign=]]           =>     5    ( =⁼₌﹦= )
                                    [[=>=]]    =   [[=greater-than-sign=]]     =>     3    ( >﹥> )
                                    [[=?=]]    =   [[=question-mark=]]         =>     2    ( ?? )
                                    [[=@=]]    =   [[=commercial-at=]]         =>     2    ( @@ )
                                    

                                    See next post !

                                    1 Reply Last reply Reply Quote 1
                                    • guy038G
                                      guy038
                                      last edited by

                                      Hi @Coises and All,

                                      End of my reply :

                                      [[=A=]]                                    =>    86    ( AaªÀÁÂÃÄÅàáâãäåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃȦȧȺɐɑɒͣᴀᴬᵃᵄᶏᶐᶛᷓḀḁẚẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặₐÅ⒜ⒶⓐⱥⱭⱯⱰAa )
                                      [[=B=]]                                    =>    29    ( BbƀƁƂƃƄƅɃɓʙᴃᴮᴯᵇᵬᶀḂḃḄḅḆḇℬ⒝ⒷⓑBb )
                                      [[=C=]]                                    =>    40    ( CcÇçĆćĈĉĊċČčƆƇƈȻȼɔɕʗͨᴄᴐᵓᶗᶜᶝᷗḈḉℂ℃ℭ⒞ⒸⓒꜾꜿCc )
                                      [[=D=]]                                    =>    44    ( DdÐðĎďĐđƊƋƌƍɗʤͩᴅᴆᴰᵈᵭᶁᶑᶞᷘᷙḊḋḌḍḎḏḐḑḒḓⅅⅆ⒟ⒹⓓꝹꝺDd )
                                      [[=E=]]                                    =>    82    ( EeÈÉÊËèéêëĒēĔĕĖėĘęĚěƎƏǝȄȅȆȇȨȩɆɇɘəɚͤᴇᴱᴲᵉᵊᶒᶕḔḕḖḗḘḙḚḛḜḝẸẹẺẻẼẽẾếỀềỂểỄễỆệₑₔ℮ℯℰ⅀ⅇ⒠ⒺⓔⱸⱻEe )
                                      [[=F=]]                                    =>    22    ( FfƑƒᵮᶂᶠḞḟ℉ℱℲⅎ⒡ⒻⓕꜰꝻꝼꟻFf )
                                      [[=G=]]                                    =>    45    ( GgĜĝĞğĠġĢģƓƔǤǥǦǧǴǵɠɡɢɣɤʛˠᴳᵍᵷᶃᶢᷚᷛḠḡℊ⅁⒢ⒼⓖꝾꝿꞠꞡGg )
                                      [[=H=]]                                    =>    41    ( HhĤĥĦħȞȟɥɦʜʰʱͪᴴᶣḢḣḤḥḦḧḨḩḪḫẖₕℋℌℍℎℏ⒣ⒽⓗⱧⱨꞍHh )
                                      [[=I=]]                                    =>    61    ( IiÌÍÎÏìíîïĨĩĪīĬĭĮįİıƖƗǏǐȈȉȊȋɨɩɪͥᴉᴵᵎᵢᵻᵼᶖᶤᶥᶦᶧḬḭḮḯỈỉỊịⁱℐℑⅈ⒤ⒾⓘꟾIi )
                                      [[=J=]]                                    =>    23    ( JjĴĵǰȷɈɉɟʄʝʲᴊᴶᶡᶨⅉ⒥ⒿⓙⱼJj )
                                      [[=K=]]                                    =>    38    ( KkĶķĸƘƙǨǩʞᴋᴷᵏᶄᷜḰḱḲḳḴḵₖK⒦ⓀⓚⱩⱪꝀꝁꝂꝃꝄꝅꞢꞣKk )
                                      [[=L=]]                                    =>    56    ( LlĹĺĻļĽľĿŀŁłƚƛȽɫɬɭɮʟˡᴌᴸᶅᶩᶪᶫᷝᷞḶḷḸḹḺḻḼḽₗℒℓ⅂⅃⒧ⓁⓛⱠⱡⱢꝆꝇꝈꝉꞀꞁLl )
                                      [[=M=]]                                    =>    33    ( MmƜɯɰɱͫᴍᴟᴹᵐᵚᵯᶆᶬᶭᷟḾḿṀṁṂṃₘℳ⒨ⓂⓜⱮꝳꟽMm )
                                      [[=N=]]                                    =>    47    ( NnÑñŃńŅņŇňʼnƝƞǸǹȠɲɳɴᴎᴺᴻᵰᶇᶮᶯᶰᷠᷡṄṅṆṇṈṉṊṋⁿₙℕ⒩ⓃⓝꞤꞥNn )
                                      [[=O=]]                                    =>   106    ( OoºÒÓÔÕÖØòóôõöøŌōŎŏŐőƟƠơƢƣǑǒǪǫǬǭǾǿȌȍȎȏȪȫȬȭȮȯȰȱɵɶɷͦᴏᴑᴒᴓᴕᴖᴗᴼᵒᵔᵕᶱṌṍṎṏṐṑṒṓỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợₒℴ⒪ⓄⓞⱺꝊꝋꝌꝍOo )
                                      [[=P=]]                                    =>    33    ( PpƤƥɸᴘᴾᵖᵱᵽᶈᶲṔṕṖṗₚ℘ℙ⒫ⓅⓟⱣⱷꝐꝑꝒꝓꝔꝕꟼPp )
                                      [[=Q=]]                                    =>    16    ( QqɊɋʠℚ℺⒬ⓆⓠꝖꝗꝘꝙQq )
                                      [[=R=]]                                    =>    64    ( RrŔŕŖŗŘřƦȐȑȒȓɌɍɹɺɻɼɽɾɿʀʁʳʴʵʶͬᴙᴚᴿᵣᵲᵳᶉᷢᷣṘṙṚṛṜṝṞṟℛℜℝ⒭ⓇⓡⱤⱹꝚꝛꝜꝝꝵꝶꞂꞃRr )
                                      [[=S=]]                                    =>    47    ( SsŚśŜŝŞşŠšƧƨƩƪȘșȿʂʃʅʆˢᵴᶊᶋᶘᶳᶴᷤṠṡṢṣṤṥṦṧṨṩₛ⒮ⓈⓢⱾꜱSs )
                                      [[=T=]]                                    =>    46    ( TtŢţŤťƫƬƭƮȚțȶȾʇʈʧʨͭᴛᵀᵗᵵᶵᶿṪṫṬṭṮṯṰṱẗₜ⒯ⓉⓣⱦꜨꜩꝷꞆꞇTt )
                                      [[=U=]]                                    =>    82    ( UuÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųƯưƱǓǔǕǖǗǘǙǚǛǜȔȕȖȗɄʉʊͧᴜᵁᵘᵤᵾᵿᶙᶶᶷᶸṲṳṴṵṶṷṸṹṺṻỤụỦủỨứỪừỬửỮữỰự⒰ⓊⓤUu )
                                      [[=V=]]                                    =>    29    ( VvƲɅʋʌͮᴠᵛᵥᶌᶹᶺṼṽṾṿỼỽ⒱ⓋⓥⱱⱴⱽꝞꝟVv )
                                      [[=W=]]                                    =>    28    ( WwŴŵƿǷʍʷᴡᵂẀẁẂẃẄẅẆẇẈẉẘ⒲ⓌⓦⱲⱳWw )
                                      [[=X=]]                                    =>    15    ( XxˣͯᶍẊẋẌẍₓ⒳ⓍⓧXx )
                                      [[=Y=]]                                    =>    36    ( YyÝýÿŶŷŸƳƴȲȳɎɏʎʏʸẎẏẙỲỳỴỵỶỷỸỹỾỿ⅄⒴ⓎⓨYy )
                                      [[=Z=]]                                    =>    41    ( ZzŹźŻżŽžƵƶȤȥɀʐʑᴢᵶᶎᶻᶼᶽᷦẐẑẒẓẔẕℤ℥ℨ⒵ⓏⓩⱫⱬⱿꝢꝣZz )
                                      [[=[=]]    =  [[=left-square-bracket=]]    =>     2    ( [[ )
                                      [[=\=]]    =  [[=backslash=]]              =>     2    ( \\ )
                                      [[=]=]]    =  [[=right-square-bracket=]]   =>     2    ( ]] )
                                      [[=^=]]    =  [[=circumflex=]]             =>     3    ( ^ˆ^ )
                                      [[=_=]]    =  [[=underscore=]]             =>     2    ( __ )
                                      [[=`=]]    =  [[=grave-accent=]]           =>     4    ( `ˋ`` )
                                      [[={=]]    =  [[=left-curly-bracket=]]     =>     2    ( {{ )
                                      [[=|=]]    =  [[=vertical-line=]]          =>     2    ( || )
                                      [[=}=]]    =  [[=right-curly-bracket=]]    =>     2    ( }} )
                                      [[=~=]]    =  [[=tilde=]]                  =>     2    ( ~~ )
                                      [[==]]  =  [[=DEL=]]                    =>     1    (  )
                                      [[=Œ=]]                                    =>     2    ( Œœ )
                                      [[=¢=]]                                    =>     3    ( ¢《¢ )
                                      [[=£=]]                                    =>     3    ( £︽£ )
                                      [[=¤=]]                                    =>     2    ( ¤》 )
                                      [[=¥=]]                                    =>     3    ( ¥︾¥ )
                                      [[=¦=]]                                    =>     2    ( ¦¦ )
                                      [[=¬=]]                                    =>     2    ( ¬¬ )
                                      [[=¯=]]                                    =>     2    ( ¯ ̄ )
                                      [[=´=]]                                    =>     2    ( ´´ )
                                      [[=·=]]                                    =>     2    ( ·· )
                                      [[=¼=]]                                    =>     4    ( ¼୲൳꠰ )
                                      [[=½=]]                                    =>     6    ( ½୳൴༪⳽꠱ )
                                      [[=¾=]]                                    =>     4    ( ¾୴൵꠲ )
                                      [[=Þ=]]                                    =>     6    ( ÞþꝤꝥꝦꝧ )
                                      

                                      Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :

                                      [[=AE=]] = [[=Ae=]] = [[=ae=]] =>  11 ( ÆæǢǣǼǽᴁᴂᴭᵆᷔ )
                                      [[=CH=]] = [[=Ch=]] = [[=ch=]] =>   0 ( ? )
                                      [[=DZ=]] = [[=Dz=]] = [[=dz=]] =>   6 ( DŽDždžDZDzdz )
                                      [[=LJ=]] = [[=Lj=]] = [[=lj=]] =>   3 ( LJLjlj )
                                      [[=LL=]] = [[=Ll=]] = [[=ll=]] =>   2 ( Ỻỻ )
                                      [[=NJ=]] = [[=Nj=]] = [[=nj=]] =>   3 ( NJNjnj )
                                      [[=SS=]] = [[=Ss=]] = [[=ss=]] =>   2 ( ßẞ )
                                      

                                      As a conclusion, no difference regarding the management of the Equivalence classes between your third and the present fourth version. ( It’s just an observation, not a reproach ! )


                                      In some cases, the use of these di-graph characters are quite delicate ! Let’s consider these 7 di-graph collating elements, below, with various cases :

                                      [[.AE.]]    [[.Ae.]]    [[.ae.]]    ( European Ligature )
                                      [[.CH.]]    [[.Ch.]]    [[.ch.]]    ( Spanish )
                                      [[.DZ.]]    [[.Dz.]]    [[.dz.]]    ( Hungarian, Polish, Slovakian, Serbo-Croatian )
                                      [[.LJ.]]    [[.Lj.]]    [[.lj.]]    ( Serbo-Croatian )
                                      [[.LL.]]    [[.Ll.]]    [[.ll.]]    ( Spanish )
                                      [[.NJ.]]    [[.Nj.]]    [[.nj.]]    ( Serbo-Croatian )
                                      [[.SS.]]    [[.Ss.]]    [[.ss.]]    ( German )
                                      

                                      As we know that :

                                        LJ  01C7  LATIN CAPITAL LETTER LJ
                                        Lj  01C8  LATIN CAPITAL LETTER L WITH SMALL LETTER J
                                        lj  01C9  LATIN SMALL LETTER LJ
                                      
                                        DZ  01F1  LATIN CAPITAL LETTER DZ
                                        Dz  01F2  LATIN CAPITAL LETTER D WITH SMALL LETTER Z
                                        dz  01F3  LATIN SMALL LETTER DZ
                                      

                                      If we apply the regex [[.dz.]-[.lj.][=dz=][=lj=]] against the text bcddzdzefghiijjklljljmn, pasted in a new tab, Columns++ would find 12 matches :

                                      dz
                                      dz
                                      e
                                      f
                                      g
                                      h
                                      i
                                      j
                                      k
                                      l
                                      lj
                                      lj
                                      

                                      Note that there is still one lower-case letter ʣ which is a di-graph letter. It could be added to the regex. Then, the regex [[.dz.]-[.lj.][=dz=][=lj=]]|\x{02A3}, against the text bcddzdzʣefghiijjklljljmn, would correctly return 13 matches :

                                      dz
                                      dz  ( \x{01F1} )
                                      ʣ  ( \x{02A3} )
                                      e
                                      f
                                      g
                                      h
                                      i
                                      j
                                      k
                                      l
                                      lj
                                      lj  ( \x{01C9} )
                                      

                                      You said in a previous post

                                      Should there be an option to make the POSIX classes and their escapes (such as \s, \w, [[:alnum:]], [[:punct:]]) match only ASCII characters ?

                                      And in your last post :

                                      Properties (\p{...} and \P{...}), named classes (like [[:ascii:]] or [[:lower::]]) and the \l and \u escapes now ignore the Match case setting and the (?i) flag: they are always case sensitive

                                      I do not think that this option is necessary as we can provide the same behaviour with the following regexes :

                                      • (?-i)(?=[[:ascii:]])\p{punct} or (?-i)(?=\p{punct})[[:ascii:]] gives 32 matches

                                      • (?-i)(?=[[:ascii:]])\u or (?-i)(?=\u)[[:ascii:]] gives 26 matches

                                      • (?-i)(?=[[:ascii:]])\l or (?-i)(?=\l)[[:ascii:]] gives 26 matches

                                      With your fourth experimental version, note that, obviously, the insensitive regexes (?i)(?=[[:ascii:]])\u or (?i)(?=\u)[[:ascii:]] or (?i)(?=[[:ascii:]])\l or (?i)(?=\l)[[:ascii:]] does give the same result

                                      And the regexes (?-i)(?=[[:ascii:]])[\u\l] or (?-i)(?=[\u\l])[[:ascii:]] do return 52 matches

                                      An other example : let’s suppose that we run this regex (?-i)[A-F[:lower:]], against my Total_Chrs.txt file. It does give 2 264 matches, so 6 UPPER letters + 2,258 LOWER letters

                                      With this fourth version, if we run the same regex, in an insensitive way, the (?i)[A-F[:lower:]] regex returns the same result 2,264 matches

                                      And note that the regex (?-i)[[:upper:][:lower:]] or (?i)[[:upper:][:lower:]] acts as an insensitive regex and return 4,116 matches ( So 1,858 UPPER letters + 2,258 LOWER letters )

                                      As I said in a previous post, the regexes (?-i)\u(?<=\l) and (?-i)(?=\l)\u do not find matches anymore, as the 31 characters of the Titled Case Unicode category ( \p{Lt} ) are not concerned.


                                      Let’s see the behavior of the \X feature when accentuated characters are involved :

                                      Paste this accentuated letter ŏ̈̅ in a new tab. This character is composed of a single lowercase letter o, followed by three combining diacritical marks so the sequence \x{006F}\x{0306}\x{0308}\x{0305}. Note that if you exchange the place of two accents ( for example \x{006F}\x{0308}\x{0306}\x{0305} ), it logically returns 0 match !

                                      We can see the different relations :

                                      • \x{006F}\x{0306}\x{0308}\x{0305} returns 1 match

                                      • \X returns 1 match as it considers the base letter with ALL its diacritical marks ( idem as above )

                                      • (?!\X). return 3 matches, so the number of combining diacritical marks ( accents ) only

                                      • . considers each character, in an indivicual way, so returns 4 chararcters

                                      Now, let’s imagine that I revert the position of all the accents. So, paste this accentuated letter o̅̈̆ in a new tab. This character is composed of a single lowercase letter o, followed by three combining diacritical marks so the sequence \x{006F}\x{0305}\x{0308}\x{0306}. However, note that the corresponding glyph looks slightly different from the previous one !!

                                      This time :

                                      • \x{006F}\x{0305}\x{0308}\x{0306} returns 1 match

                                      • \X returns 1 match as it considers the base letter with ALL its diacritical marks ( idem as above )

                                      • (?!\X). return 3 matches, so the number of combining diacritical marks ( accents ) only

                                      • . considers each character, in an indivicual way, so returns 4 chararcters


                                      Let’s see, now, the behavior of the \X feature when emoji characters are involved :

                                      Paste the RAINBOW FLAG / PRIDE FLAG emoji 🏳️‍🌈 in a new tab. This pseudo-character is composed of the four individual chars :

                                      🏳 ( \x{1F3F3} )
                                      ️ ( \x{FE0F} = VS-16 )
                                      ‍ ( \x{200D} = ZWJ )
                                      🌈 ( \x{1F308} )

                                      Again, we deduce the same relations :

                                      • \x{1F3F3}\x{FE0F}\x{200D}\x{1F308} returns 1 match : the RAINBOW FLAG / PRIDE FLAG emoji

                                      • \X returns 1 match as it considers the base flag and ALL the subsequent characters which defines the RAINBOW FLAG / PRIDE FLAG ( idem as above )

                                      • (?!\X). return 3 matches, so the number of characters associated to the base flag only.

                                      Note that, here, the last char 🌈 ( \x{1F308} ) is not considered as a true char, but rather as an associated char defining the RAINBOW FLAG

                                      • . considers each character, in an indivicual way, so returns 4 chararcters

                                      In contrast, if we consider this sequence 🏳🌈, so the two characters \x{1F3F3} followed by \x{1F308}

                                      • \x{1F3F3}\x{1F308} returns 1 match

                                      • \X returns 2 matches as it considers each char as a base character

                                      • (?!\X). return 0 match as no associated character os involved in this sequence

                                      • . considers each character, in an indivicual way, so returns 2 characters


                                      I also tried a search-replacement, containing two backtracking control verbs : (*SKIP) and (*FAIL) or (*F) which is equivalent to (?!), against this text : This is a test with some {text} to see {if} it works :

                                      • FIND {[^}]*}(*SKIP)(*FAIL)|\b\w+\b and REPLACE --$0--

                                      • FIND {[^}]*}(*SKIP)(*F)|\b\w+\b and REPLACE --$0--

                                      • FIND {[^}]*}(*SKIP)(?!)|\b\w+\b and REPLACE --$0--

                                      All these replacements correctly surrounded all the words of the example with two dashes, except for the words already between braces !

                                      Remark :

                                      Of course, these replacements could have been done without the use of the backtracking control verbs :

                                      FIND {[^}]*}|(\b\w+\b) and REPLACE ?1--$0--:$0

                                      FIND ({[^}]*}.*?\K)?\b\w+\b and REPLACE --$0--


                                      To sum up, @coises, the key points, of your four experimental version, are :

                                      • A major regex engine, implemented in UTF-32, which correctly handle all the Unicode characters, from \x{0} to \x{0010FFFF}, and correctly manage all the Unicode character classes \p{Xy} or [[:Xy:]] as well as all the POSIX characters clases.

                                      • Additional features as \i, \o and \y and their complements ( The \m and \M options should be removed ! )

                                      • The \X regex feature correctly works with any character, UNDER or OVER the BMP ( Accentuated letters, Emoji and probably the Indic and Hangul languages, although not tested )

                                      • The invalid UTF-8 characters may be kept, replaced or deleted ( FIND \i+, REPLACE --> $0 <--Invalid )

                                      • The NUL character can be placed in replacement ( FIND ABC\x00XYZ, REPLACE \x0--$0--\x{00} )

                                      • The correct handle of case replacements, even in case of accentuated characters ( FIND (?-s). REPLACE \U$0 )

                                      • The \K feature ALSO works in a step-by-step replacement with the Replace button ( FIND ^.{20}\K(.+), REPLACE --\1-- )


                                      So, @coises, all your improvements for a powerful UNICODE regex engine, based on the Boost regex engine, are really awesome. I think that your project is totally mature. and that you just have to find some time to build the documentation, in the same way that the present one, which is a pleasure to consult !

                                      Best Regards,

                                      guy038

                                      P.S. :

                                      At the end, I’ll put a new version of my Unicode.zip archive, in my Google Drive account, referring to your latest experimental version of ColumnsPLusPlus which should highly simplify the regex syntax, in order to count or mark all chars of Unicode ranges !

                                      CoisesC 1 Reply Last reply Reply Quote 1
                                      • CoisesC
                                        Coises @guy038
                                        last edited by

                                        @guy038

                                        Thank you so much for all your testing!

                                        As I think you’ve seen, I decided to release a nominally “stable” version that has only a few small changes from the fourth experimental version and with (I think) up-to-date documentation.

                                        I will be taking a closer look at the equivalence classes, and your tests and references will be very helpful. I also need to sort out the situation with digraphs and ligatures. (Unicode distinguishes between the two, and I might be forgetting a third thing — all cases where what we normally think of as multiple characters act as one, sometimes visually, sometimes for collation even when they are visually separate.) Particularly vexing is that this should affect character ranges depending on the active locale; e.g., in Slovak, “ch” sorts between “h” and “i”; in Spanish, it sorts between “c” and “d” and in English, it isn’t a digraph at all, just two letters. So [[.ch.]-i] should match “f” in Spanish but not in Slovak… and in English it makes no sense at all, because ch isn’t a digraph in English.

                                        This sort of thing is why I decided to table the whole notion until I can rest and regroup.

                                        1 Reply Last reply Reply Quote 1
                                        • First post
                                          Last post
                                        The Community of users of the Notepad++ text editor.
                                        Powered by NodeBB | Contributors