Community
    • Login

    Columns++ version 1.3: All Unicode, all the time

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    9 Posts 2 Posters 453 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • CoisesC
      Coises
      last edited by Coises

      Columns++ version 1.3 brings the enhancements for regular expressions in Unicode documents to ANSI documents as well:

      • Regular expressions now match based on Unicode code points in all documents, so the syntax and semantics of regular expressions are no longer dependent on the underlying representation in Scintilla. The features added in version 1.2 for Unicode documents now work in all documents.

      • Regular expressions did not work properly in ANSI documents for the system default code pages 932 (Japanese, Shift-JIS), 936 (Chinese Simplified, GB2312), 949 (Korean, Windows-949 / Unified Hangul Code) or 950 (Chinese Traditional, Big5) in version 1.2. Regular expressions now match these documents based on Unicode code points.

      • Character information tables for regular expressions were updated to reflect Unicode 17.0.0 (released on 2025-09-09).

      I haven’t yet set this as “stable” (and of course it isn’t yet in the plugins admin list). Anyone feeling adventuresome is welcome to try it and see if it behaves. There is documentation describing how Unicode-based matching in Columns++ differs from Notepad++ matching.

      If anyone reading this routinely has one of the CJK code pages as their system default code page, I would be very interested to know if regular expressions in this version appear to work as expected on your locale’s ANSI files.

      1 Reply Last reply Reply Quote 4
      • guy038G
        guy038
        last edited by

        Hello, @coises and All,

        I’ve just tried your last ColumnsPlusPlus v1.3 release and indeed, the search is now considered as a true Unicode search, whatever the individual encoding of each file !

        Let’s consider this simple UTF-8 text :

        This ‟ is a † very • small ‰ text ‱ for › test
           201F    2020   2022   2030    2031    203A  in Unicode UTF-8 enoding
        

        And this ANSI text :

        This ? is a † very • small ‰ text ? for › test
             ?     0086   0095    0089    ?    009B   in Windows-1252 encoding
        

        IMPORTANT Don’t forget, when this second text is opened in N++, to run the Encoding > Convert to ANSI option, first !


        Now, we can create the following table, which recapitulates the Non-ASCII characters used in my examples :

            •--------•-----------------•-----------------•
            |        |   Windows-1252  |     Unicode     |
            |        •--------•--------•--------•--------•
            |  Char  |   Dec  |   Hex  |   Dec  |   Hex  |
            •--------•--------•--------•--------•--------•
            |   ‟    |   ?    |   ?    |  8223  |  201F  |
            |        |        |        |        |        |
            |   †    |  0134  |  0086  |  8224  |  2020  |
            |        |        |        |        |        |
            |   •    |  0149  |  0095  |  8226  |  2022  |
            |        |        |        |        |        |
            |   ‰    |  0137  |  0089  |  8240  |  2030  |
            |        |        |        |        |        |
            |   ‱  |   ?    |   ?    |  8241  |  2031  |
            |        |        |        |        |        |
            |   ›    |  0155  |  009B  |  8250  |  203A  |
            •--------•--------•--------•--------•--------•
        

        • In Notepad++ :

          • Within an ANSI file, the regexes [†-‰] or [\x86-\x89] would only find the characters † and ‰ but not the • whose Win-1252 code ( \x95 ) is after \x89

          • Within an UTF8 file, the regexes [†-‰] or [\x{2020}-\x{2030}] would find the characters † and ‰ and also the • whose Unicode code-point is between 2020 and 2030

        • In Columns++ :

          • Within an ANSI file, the regexes [†-‰] or [\x{2020}-\x{2030}] would find the characters † and ‰ and also the • whose Unicode code-point is between 2020 and 2030

          • Within an UTF8 file, the regexes [†-‰] or [\x{2020}-\x{2030}] would find the characters † and ‰ and also the • whose Unicode code-point is between 2020 and 2030

        Note that using the range [†-›] within an ANSI file, a N++ search of the • char would have been successful as its code-point ( 2022 ) lies within the 2020 and 203A range !


        Now, @coises, I cannot test easily the CJK behaviour of your new search engine as it’s obvious that I do not a default CJK code-page, needed for such a study ! However, I do not see why your new search behavior couln’t be applied to any kind of Unicode chars ;-)

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hello, @coises and All,

          When I first used the v1.3 release of Columns++, I did not pay attention to the fact that, among the new features, there was the point :

          Character information tables for regular expressions were updated to reflect Unicode 17.0.0 (released on 2025-09-09)

          So, in this post, I re-tested all the regex features of the v1.3 release, that you’ll find below and I pleased to tell you that ALL results are correct, EXCEPT for one thing :

          Indeed, there’s a bug, somehow, regarding the Mark characters :

          • Open this file in your browser : https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt

          • Then, hit the Ctrl + F shortcut within your browser and search for the string Nonsp, within the DerivedGeneralCategory.txt file

          • Under the first occurrence, you should see :

          # General_Category=Nonspacing_Mark
                             ¯¯¯¯¯
          0300..036F    ; Mn # [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X
          0483..0487    ; Mn #   [5] COMBINING CYRILLIC TITLO..COMBINING CYRILLIC POKRYTIE
          0591..05BD    ; Mn #  [45] HEBREW ACCENT ETNAHTA..HEBREW POINT METEG
          

          The first line clearly shows that the 112 characters of the COMBINING DIACRITICAL MARKS Unicode block (refer to https://www.unicode.org/charts/PDF/U0300.pdf ) are considered, by the Unicode Consortium, as Non Spacing Mark characters !

          And, indeed, if I use the regex [\x{0300}-\x{036F}], against my Total_Chars.txt file, it corectly returns 112 occurrences and if I use the \p{Mn} regex, it correctly returns 2,059 occurrences, either.

          However, then I test the regexes (?=[\x{0300}-\x{036F}])\p{M*} or (?=\p{M*})[\x{0300}-\x{036F}] or, more precisely, the regexes (?=[\x{0300}-\x{036F}])\p{Mn} or (?=\p{Mn})[\x{0300}-\x{036F}], it ONLY returns 111 occurrences and NOT 112 ! Did I make a mistake ?


          Now, against the Total_Chars.txt file, all these general results, below, are correct :

          (?s).  =  \I  =  \p{Any}  =  [\x{0000}-\x{EFFFD}]                                         =>                Total =  325,590
          
          
          \p{Unicode}  =  [[:Unicode:]]                                                             =>  325,334    |
                                                                                                                   |  Total =  325,590
          \P{Unicode}  =  [[:^Unicode:]]                                                            =>      256    |
          
          
          \p{Ascii}  =  \o                                                                          =>      128    |
                                                                                                                   |  Total =  325,590
          \P{Ascii}  =  \O                                                                          =>  325,462    |
          
          
          \X                                                                                        =>  322,586    |
                                                                                                                   |  Total =  325,590
          (?!\X).                                                                                   =>    3,004    |
          
          
          [\x{E000}-\x{F8FF}]|\y     =  [\x{E000}-\x{F8FF}]|[[:defined:]]      =  \p{Assigned}      =>  166,266    |
                                                                                                                   |  Total =  325,590
          (?![\x{E000}-\x{F8FF}])\Y  =  (?![\x{E000}-\x{F8FF}])[^[:defined:]]  =  \p{Not Assigned}  =>  159,324    |
          
          

          Note : if we add, to the number of characters of Total_Chars.txt, the contents of any omitted planes ( Planes 4 to 13, 16 and 17 ), less the TWO non-characters for each, plus the Surrogate characters and all the Unicode non-chars, we obtain :

          325,590 + (65536 - 2) * 12 + 2,048 + 66 = 1,114,112 which is, indeed, the total amount of Unicode chars, , both assigned or not assigned !


          Here are the correct results, concerning all the Posix character classes, against the Total_Chars.txt file

          [[:ascii:]]                                                        an UNDER \x{0080}         character                     128   =  [\x{0000}-\x{007F}]  =  \p{ascii} = \o
          
          [[:unicode:]]  =  \p{unicode}                                      an OVER  \x{00FF}         character                 325,334   =  [\x{0100}-\x{EFFFD}] ( restricted to 'Total_Chars.txt' )
          
          
          [[:space:]]    =  \p{space}  =  [[:s:]]  =  \p{s}  =  \ps  =  \s   a             WHITE-SPACE character                      25   =  [\t\n\x{000B}\f\r\x20\x{0085}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}]
                                          [[:h:]]  =  \p{h}  =  \ph  =  \h   an HORIZONTAL white space character                      18   =  [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]  =  \p{Zs}|\t
          [[:blank:]]    =  \p{blank}                                        a  BLANK                  character                      18   =  [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]  =  \p{Zs}|\t
                                          [[:v:]]  =  \p{v}  =  \pv  =  \v   a  VERTICAL   white space character                       7   =  [\n\x{000B}\f\r\x{0085}\x{2028}\x{2029}]
          
          [[:cntrl:]]    =  \p{cntrl}                                        a  CONTROL code           character                      65   =  [\x{0000}-\x{001F}\x{007F}\x{0080}-\x{009F}]
          
          [[:upper:]]    =  \p{upper}  =  [[:u:]]  =  \p{u}  =  \pu  =  \u   an  UPPER case    letter                              1,886   =  \p{Lu}
          [[:lower:]]    =  \p{lower}  =  [[:l:]]  =  \p{l}  =  \pl  =  \l   a   LOWER case    letter                              2,283   =  \p{Ll}
                                                                             a   DI-GRAPIC     letter                                 31   =  \p{Lt}
                                                                             a   MODIFIER      letter                                410   =  \p{Lm}
                                                                             an  OTHER         letter + SYLLABES / IDEOGRAPHS    141,062   =  \p{Lo}
          [[:digit:]]    =  \p{digit}  =  [[:d:]]  =  \p{d}  =  \pd  =  \d   a   DECIMAL       number                                770   =  \p{Nd}
           _             =  \x{005F}                                         the LOW_LINE      character                               1
                                                                                                                              -----------
          [[:word:]]     =  \p{word}   =  [[:w:]]  =  \p{w}  =  \pw  =  \w   a   WORD                  character                 146,443   =  \p{L*}|\p{nd}|_
          
          [[:alnum:]]    =  \p{alnum}                                        an  ALPHANUMERIC          character                 146,442   =  \p{L*}|\p{nd}
          
          [[:alpha:]]    =  \p{alpha}                                        any LETTER                character                 145,672   =  \p{L*}
          
          [[:graph:]]    =  \p{graph}                                        any VISIBLE               character                 159,612
          
          [[:print:]]    =  \p{print}                                        any PRINTABLE             character                 159,637   =  [[:graph:]]|\s
          
          [[:punct:]]    =  \p{punct}                                        any PUNCTUATION or SYMBOL character                   9,473   =  \p{P*}|\p{S*}  =  \p{Punctuation}|\p{Symbol}  = 856 + 8,617
          
          [[:xdigit:]]                                                       an HEXADECIMAL            character                      22   =  [0-9A-Fa-f]
          

          And here, are the correct results regarding the Unicode character classes, against the Total_Chars.txt file :

                     \p{Any}                           Any character                                              325,590  =  (?s).  =  \I  =  [\x{0000}-\x{EFFFD}]
          
                     \p{Ascii}                         a character UNDER \x80                                         128  =  [[:ascii:]]  =  \o
          
                     \p{Assigned}                      an ASSIGNED character                                      166,266   ( of Total_Chars.txt, ONLY )
          
          \p{Cc}  =  \p{Control}                       a  C0 or C1 CONTROL code       character                        65
          \p{Cf}  =  \p{Format}                        a  FORMAT CONTROL              character                       170
          \p{Cn}  =  \p{Not Assigned}                  an UNASSIGNED or NON-CHARACTER character                   159,324   ( 'Total_Chars.txt' does NOT contain the 66 NON-CHARACTER chars )
          \p{Co}  =  \p{Private Use}                   a  PRIVATE-USE                 character                     6,400
          \p{Cs}  =  \p{Surrogate}    (INVALID regex)  a  SURROGATE                   character                    [2,048]  ( 'Total_Chars.txt' does NOT contain the 2,048 SURROGATE chars )
                                                                                                                -----------
          \p{C*}  =  \p{Other}                                                                                    165,959  =  \p{Cc}|\p{Cf}|\p{Cn}|\p{Co}
          
          \p{Lu}  =  \p{Uppercase Letter}              an UPPER case letter                                         1,886  =  \u  =  [[:upper:]]  =  \p{upper}
          \p{Ll}  =  \p{Lowercase Letter}              a  LOWER case letter                                         2,283  =  \l  =  [[:lower:]]  =  \p{lower}
          \p{Lt}  =  \p{Titlecase}                     a  DI-GRAPHIC letter                                            31
          \p{Lm}  =  \p{Modifier Letter}               a  MODIFIER   letter                                           410
          \p{Lo}  =  \p{Other Letter}                  OTHER LETTER, including SYLLABLES and IDEOGRAPHS           141,062
                                                                                                                -----------
          \p{L*}  =  \p{Letter}                                                                                   145,672  =  \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}  =  [[:alpha:]]   =  \p{alpha}
          
          \p{Mc}  =  \p{Spacing Combining Mark}        a  NON-SPACING COMBINING mark (ZERO     advance width)         471
          \p{Me}  =  \p{Enclosing Mark}                a  SPACING COMBINING     mark (POSITIVE advance width)          13
          \p{Mn}  =  \p{Non-Spacing Mark}              an ENCLOSING COMBINING   mark                                2,059
                                                                                                                  ---------
          \p{M*}  =  \p{Mark}                                                                                       2,543  =  \p{Mc}|\p{Me}|\p{Mn}
          
          \p{Nd}  =  \p{Decimal Digit Number}          a DECIMAL number     character                                 770
          \p{Nl}  =  \p{Letter Number}                 a LETTERLIKE numeric character                                 239
          \p{No}  =  \p{Other Number}                  OTHER NUMERIC        character                                 915
                                                                                                                  ---------
          \p{N*}  =  \p{Number}                                                                                     1,924  =  \p{Nd}|\p{Nl}|\p{No}
          
          \p{Pd}  =  \p{Dash Punctuation}              a  DASH or HYPHEN punctuation mark                              27
          \p{Ps}  =  \p{Open Punctuation}              an OPENING    PUNCTUATION     mark, in a pair                   79
          \p{Pc}  =  \p{Connector Punctuation}         a  CONNECTING PUNCTUATION     mark                              10
          \p{Pe}  =  \p{Close Punctuation}             a  CLOSING    PUNCTUATION     mark, in a pair                   77
          \p{Pi}  =  \p{Initial Punctuation}           an INITIAL QUOTATION          mark                              12
          \p{Pf}  =  \p{Final Punctuation}             a  FINAL   QUOTATION          mark                              10
          \p{Po}  =  \p{Other Punctuation}             OTHER PUNCTUATION             mark                             641
                                                                                                                    -------
          \p{P*}  =  \p{Punctuation}                                                                                  856  =  \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po}
          
          \p{Sm}  =  \p{Math Symbol}                   a MATHEMATICAL symbol     character                            960
          \p{Sc}  =  \p{Currency Symbol}               a CURRENCY                character                             64
          \p{Sk}  =  \p{Modifier Symbol}               a NON-LETTERLIKE MODIFIER character                            125
          \p{So}  =  \p{Other Symbol}                  OTHER SYMBOL              character                          7,468
          
          \p{S*}  =  \p{Symbol}                                                                                     8,617  =  \p{Sm}|\p{Sc}|\p{Sk}|\p{So}
          
          \p{Zs}  =  \p{Space Separator}               a   NON-ZERO width SPACE   character                            17  =  [\x{0020}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]  =  (?!\t)\h
          \p{Zl}  =  \p{Line Separator}                the LINE SEPARATOR         character                             1  =  \x{2028}
          \p{Zp}  =  \p{Paragraph Separator}           the PARAGRAPH SEPARATOR    character                             1  =  \x{2029}
                                                                                                                     ------
          \p{Z*}  =  \p{Separator}                                                                                     19  =  \p{Zs}|\p{Zl}|\p{Zp}
          

          Remark :

          • A negative POSIX character class can be expressed as [^[:........:]] or [[:^........:]]

          • A negative UNICODE character class can be expressed as \P{..}, with an uppercase letter P


          Now, if you follow the procedure explained in the last part of this post :

          https://community.notepad-plus-plus.org/post/99844

          The regexes [\x{DC80}-\x{DCFF}] or \i or [[:invalid:]] do give 134 occurrences, which is the exact number of invalid UTF-8 characters of that example !

          Continuation on next post

          CoisesC 1 Reply Last reply Reply Quote 2
          • guy038G
            guy038
            last edited by guy038

            Hi @Coises and All,

            Continuation and end of my reply :

            I also tested ALL the Equivalence classes feature, against the Total_Chars.txt file.

            With Columns++, we can use ANY equivalent character to get the total number of matches of the equivalence class character

            For instance, [[=Ⱥ=]] = [[=ⱥ=]] = [[=Ɐ=]] always gives 86, matches whereas native N++ Boost engine is less coherent and sometimes displays a wrong number of occurrences :-((

            Here is, below, the list of all equivalences of any char of the Windows-1252 code-page, from \x{0020} till \x{00DE} Note that, except for the DEL character, as an example, I did not consider the equivalence classes which only return 1 match !

            I also confirm, that I did not find any character over \x{FFFF} which would be part of a regex equivalence class, either with our Boost engine or with the Columns++ search engine !

            [[= =]]   =   [[=space=]]                  =>     3    (     )
            [[=!=]]   =   [[=exclamation-mark=]]       =>     2    ( !! )
            [[="=]]   =   [[=quotation-mark=]]         =>     3    ( "⁍" )
            [[=#=]]   =   [[=number-sign=]]            =>     4    ( #؞⁗# )
            [[=$=]]   =   [[=dollar-sign=]]            =>     3    ( $⁒$ )
            [[=%=]]   =   [[=percent-sign=]]           =>     3    ( %⁏% )
            [[=&=]]   =   [[=ampersand=]]              =>     3    ( &⁋& )
            [[='=]]   =   [[=apostrophe=]]             =>     2    ( '' )
            [[=(=]]   =   [[=left-parenthesis=]]       =>     4    ( (⁽₍( )
            [[=)=]]   =   [[=right-parenthesis=]]      =>     4    ( )⁾₎) )
            [[=*=]]   =   [[=asterisk=]]               =>     2    ( ** )
            [[=+=]]   =   [[=plus-sign=]]              =>     6    ( +⁺₊﬩﹢+ )
            [[=,=]]   =   [[=comma=]]                  =>     2    ( ,, )
            [[=-=]]   =   [[=hyphen=]]                 =>     3    ( -﹣- )
            [[=.=]]   =   [[=period=]]                 =>     3    ( .․. )
            [[=/=]]   =   [[=slash=]]                  =>     2    ( // )
            
            [[=0=]]   =   [[=zero=]]                   =>    48    ( 0٠۟۠۰߀०০੦૦୦୵௦౦౸೦൦๐໐༠၀႐០᠐᥆᧐᪀᪐᭐᮰᱀᱐⁰₀↉⓪⓿〇㍘꘠ꛯ꠳꣐꤀꧐꩐꯰0 )
            [[=1=]]   =   [[=one=]]                    =>    54    ( 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁①⑴⒈⓵⚀❶➀➊〡㋀㍙㏠꘡ꛦ꣑꤁꧑꩑꯱1 )
            [[=2=]]   =   [[=two=]]                    =>    54    ( 2²ƻ٢۲߂२২੨૨୨௨౨౺౽೨൨๒໒༢၂႒፪២᠒᥈᧒᪂᪒᭒᮲᱂᱒₂②⑵⒉⓶⚁❷➁➋〢㋁㍚㏡꘢ꛧ꣒꤂꧒꩒꯲2 )
            [[=3=]]   =   [[=three=]]                  =>    53    ( 3³٣۳߃३৩੩૩୩௩౩౻౾೩൩๓໓༣၃႓፫៣᠓᥉᧓᪃᪓᭓᮳᱃᱓₃③⑶⒊⓷⚂❸➂➌〣㋂㍛㏢꘣ꛨ꣓꤃꧓꩓꯳3 )
            [[=4=]]   =   [[=four=]]                   =>    51    ( 4٤۴߄४৪੪૪୪௪౪೪൪๔໔༤၄႔፬៤᠔᥊᧔᪄᪔᭔᮴᱄᱔⁴₄④⑷⒋⓸⚃❹➃➍〤㋃㍜㏣꘤ꛩ꣔꤄꧔꩔꯴4 )
            [[=5=]]   =   [[=five=]]                   =>    53    ( 5Ƽƽ٥۵߅५৫੫૫୫௫౫೫൫๕໕༥၅႕፭៥᠕᥋᧕᪅᪕᭕᮵᱅᱕⁵₅⑤⑸⒌⓹⚄❺➄➎〥㋄㍝㏤꘥ꛪ꣕꤅꧕꩕꯵5 )
            [[=6=]]   =   [[=six=]]                    =>    52    ( 6٦۶߆६৬੬૬୬௬౬೬൬๖໖༦၆႖፮៦᠖᥌᧖᪆᪖᭖᮶᱆᱖⁶₆ↅ⑥⑹⒍⓺⚅❻➅➏〦㋅㍞㏥꘦ꛫ꣖꤆꧖꩖꯶6 )
            [[=7=]]   =   [[=seven=]]                  =>    50    ( 7٧۷߇७৭੭૭୭௭౭೭൭๗໗༧၇႗፯៧᠗᥍᧗᪇᪗᭗᮷᱇᱗⁷₇⑦⑺⒎⓻❼➆➐〧㋆㍟㏦꘧ꛬ꣗꤇꧗꩗꯷7 )
            [[=8=]]   =   [[=eight=]]                  =>    50    ( 8٨۸߈८৮੮૮୮௮౮೮൮๘໘༨၈႘፰៨᠘᥎᧘᪈᪘᭘᮸᱈᱘⁸₈⑧⑻⒏⓼❽➇➑〨㋇㍠㏧꘨ꛭ꣘꤈꧘꩘꯸8 )
            [[=9=]]   =   [[=nine=]]                   =>    50    ( 9٩۹߉९৯੯૯୯௯౯೯൯๙໙༩၉႙፱៩᠙᥏᧙᪉᪙᭙᮹᱉᱙⁹₉⑨⑼⒐⓽❾➈➒〩㋈㍡㏨꘩ꛮ꣙꤉꧙꩙꯹9 )
            
            [[=:=]]   =   [[=colon=]]                  =>     2    ( :: )
            [[=;=]]   =   [[=semicolon=]]              =>     3    ( ;;; )
            [[=<=]]   =   [[=less-than-sign=]]         =>     3    ( <﹤< )
            [[===]]   =   [[=equals-sign=]]            =>     5    ( =⁼₌﹦= )
            [[=>=]]   =   [[=greater-than-sign=]]      =>     3    ( >﹥> )
            [[=?=]]   =   [[=question-mark=]]          =>     2    ( ?? )
            [[=@=]]   =   [[=commercial-at=]]          =>     2    ( @@ )
            
            [[=A=]]                                    =>    86    ( AaªÀÁÂÃÄÅàáâãäåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃȦȧȺɐɑɒͣᴀᴬᵃᵄᶏᶐᶛᷓḀḁẚẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặₐÅ⒜ⒶⓐⱥⱭⱯⱰAa )
            [[=B=]]                                    =>    29    ( BbƀƁƂƃƄƅɃɓʙᴃᴮᴯᵇᵬᶀḂḃḄḅḆḇℬ⒝ⒷⓑBb )
            [[=C=]]                                    =>    40    ( CcÇçĆćĈĉĊċČčƆƇƈȻȼɔɕʗͨᴄᴐᵓᶗᶜᶝᷗḈḉℂ℃ℭ⒞ⒸⓒꜾꜿCc )
            [[=D=]]                                    =>    44    ( DdÐðĎďĐđƊƋƌƍɗʤͩᴅᴆᴰᵈᵭᶁᶑᶞᷘᷙḊḋḌḍḎḏḐḑḒḓⅅⅆ⒟ⒹⓓꝹꝺDd )
            [[=E=]]                                    =>    82    ( EeÈÉÊËèéêëĒēĔĕĖėĘęĚěƎƏǝȄȅȆȇȨȩɆɇɘəɚͤᴇᴱᴲᵉᵊᶒᶕḔḕḖḗḘḙḚḛḜḝẸẹẺẻẼẽẾếỀềỂểỄễỆệₑₔ℮ℯℰ⅀ⅇ⒠ⒺⓔⱸⱻEe )
            [[=F=]]                                    =>    22    ( FfƑƒᵮᶂᶠḞḟ℉ℱℲⅎ⒡ⒻⓕꜰꝻꝼꟻFf )
            [[=G=]]                                    =>    47    ( GgĜĝĞğĠġĢģƓƔǤǥǦǧǴǵɠɡɢɣɤʛˠᴳᵍᵷᶃᶢᷚᷛḠḡℊ⅁⒢ⒼⓖꝾꝿꞠꞡꞬꟋGg )
            [[=H=]]                                    =>    42    ( HhĤĥĦħȞȟɥɦʜʰʱͪᴴᶣḢḣḤḥḦḧḨḩḪḫẖₕℋℌℍℎℏ⒣ⒽⓗⱧⱨꞍꞪHh )
            [[=I=]]                                    =>    62    ( IiÌÍÎÏìíîïĨĩĪīĬĭĮįİıƖƗǏǐȈȉȊȋɨɩɪͥᴉᴵᵎᵢᵻᵼᶖᶤᶥᶦᶧḬḭḮḯỈỉỊịⁱℐℑⅈ⒤ⒾⓘꞮꟾIi )
            [[=J=]]                                    =>    24    ( JjĴĵǰȷɈɉɟʄʝʲᴊᴶᶡᶨⅉ⒥ⒿⓙⱼꞲJj )
            [[=K=]]                                    =>    39    ( KkĶķĸƘƙǨǩʞᴋᴷᵏᶄᷜḰḱḲḳḴḵₖK⒦ⓀⓚⱩⱪꝀꝁꝂꝃꝄꝅꞢꞣꞰKk )
            [[=L=]]                                    =>    58    ( LlĹĺĻļĽľĿŀŁłƚƛȽɫɬɭɮʟˡᴌᴸᶅᶩᶪᶫᷝᷞḶḷḸḹḺḻḼḽₗℒℓ⅂⅃⒧ⓁⓛⱠⱡⱢꝆꝇꝈꝉꞀꞁꞭꟜLl )
            [[=M=]]                                    =>    33    ( MmƜɯɰɱͫᴍᴟᴹᵐᵚᵯᶆᶬᶭᷟḾḿṀṁṂṃₘℳ⒨ⓂⓜⱮꝳꟽMm )
            [[=N=]]                                    =>    47    ( NnÑñŃńŅņŇňʼnƝƞǸǹȠɲɳɴᴎᴺᴻᵰᶇᶮᶯᶰᷠᷡṄṅṆṇṈṉṊṋⁿₙℕ⒩ⓃⓝꞤꞥNn )
            [[=O=]]                                    =>   106    ( OoºÒÓÔÕÖØòóôõöøŌōŎŏŐőƟƠơƢƣǑǒǪǫǬǭǾǿȌȍȎȏȪȫȬȭȮȯȰȱɵɶɷͦᴏᴑᴒᴓᴕᴖᴗᴼᵒᵔᵕᶱṌṍṎṏṐṑṒṓỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợₒℴ⒪ⓄⓞⱺꝊꝋꝌꝍOo )
            [[=P=]]                                    =>    33    ( PpƤƥɸᴘᴾᵖᵱᵽᶈᶲṔṕṖṗₚ℘ℙ⒫ⓅⓟⱣⱷꝐꝑꝒꝓꝔꝕꟼPp )
            [[=Q=]]                                    =>    16    ( QqɊɋʠℚ℺⒬ⓆⓠꝖꝗꝘꝙQq )
            [[=R=]]                                    =>    64    ( RrŔŕŖŗŘřƦȐȑȒȓɌɍɹɺɻɼɽɾɿʀʁʳʴʵʶͬᴙᴚᴿᵣᵲᵳᶉᷢᷣṘṙṚṛṜṝṞṟℛℜℝ⒭ⓇⓡⱤⱹꝚꝛꝜꝝꝵꝶꞂꞃRr )
            [[=S=]]                                    =>    50    ( SsŚśŜŝŞşŠšſƧƨƩƪȘșȿʂʃʅʆˢᵴᶊᶋᶘᶳᶴᷤṠṡṢṣṤṥṦṧṨṩẛₛ⒮ⓈⓢⱾꜱꟅSs )
            [[=T=]]                                    =>    47    ( TtŢţŤťƫƬƭƮȚțȶȾʇʈʧʨͭᴛᵀᵗᵵᶵᶿṪṫṬṭṮṯṰṱẗₜ⒯ⓉⓣⱦꜨꜩꝷꞆꞇꞱTt )
            [[=U=]]                                    =>    82    ( UuÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųƯưƱǓǔǕǖǗǘǙǚǛǜȔȕȖȗɄʉʊͧᴜᵁᵘᵤᵾᵿᶙᶶᶷᶸṲṳṴṵṶṷṸṹṺṻỤụỦủỨứỪừỬửỮữỰự⒰ⓊⓤUu )
            [[=V=]]                                    =>    29    ( VvƲɅʋʌͮᴠᵛᵥᶌᶹᶺṼṽṾṿỼỽ⒱ⓋⓥⱱⱴⱽꝞꝟVv )
            [[=W=]]                                    =>    28    ( WwŴŵƿǷʍʷᴡᵂẀẁẂẃẄẅẆẇẈẉẘ⒲ⓌⓦⱲⱳWw )
            [[=X=]]                                    =>    15    ( XxˣͯᶍẊẋẌẍₓ⒳ⓍⓧXx )
            [[=Y=]]                                    =>    36    ( YyÝýÿŶŷŸƳƴȲȳɎɏʎʏʸẎẏẙỲỳỴỵỶỷỸỹỾỿ⅄⒴ⓎⓨYy )
            [[=Z=]]                                    =>    42    ( ZzŹźŻżŽžƵƶȤȥɀʐʑᴢᵶᶎᶻᶼᶽᷦẐẑẒẓẔẕℤ℥ℨ⒵ⓏⓩⱫⱬⱿꝢꝣꟆZz )
            
            [[=[=]]   =   [[=left-square-bracket=]]    =>     2    ( [[ )
            [[=\=]]   =   [[=backslash=]]              =>     2    ( \\ )
            [[=]=]]   =   [[=right-square-bracket=]]   =>     2    ( ]] )
            [[=^=]]   =   [[=circumflex=]]             =>     3    ( ^ˆ^ )
            [[=_=]]   =   [[=underscore=]]             =>     2    ( __ )
            [[=`=]]   =   [[=grave-accent=]]           =>     4    ( `ˋ`` )
            [[={=]]   =   [[=left-curly-bracket=]]     =>     2    ( {{ )
            [[=|=]]   =   [[=vertical-line=]]          =>     2    ( || )
            [[=}=]]   =   [[=right-curly-bracket=]]    =>     2    ( }} )
            [[=~=]]   =   [[=tilde=]]                  =>     2    ( ~~ )
            [[==]] =   [[=DEL=]]                    =>     1    (  )
            
            [[=Œ=]]                                    =>     2    ( Œœ )
            [[=¢=]]                                    =>     3    ( ¢《¢ )
            [[=£=]]                                    =>     3    ( £︽£ )
            [[=¤=]]                                    =>     2    ( ¤》 )
            [[=¥=]]                                    =>     3    ( ¥︾¥ )
            [[=¦=]]                                    =>     2    ( ¦¦ )
            [[=¬=]]                                    =>     2    ( ¬¬ )
            [[=¯=]]                                    =>     2    ( ¯ ̄ )
            [[=´=]]                                    =>     2    ( ´´ )
            [[=·=]]                                    =>     2    ( ·· )
            [[=¼=]]                                    =>     4    ( ¼୲൳꠰ )
            [[=½=]]                                    =>     6    ( ½୳൴༪⳽꠱ )
            [[=¾=]]                                    =>     4    ( ¾୴൵꠲ )
            [[=Þ=]]                                    =>     6    ( ÞþꝤꝥꝦꝧ )
            

            Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :

            [[=AE=]]   =   [[=Ae=]]   =   [[=ae=]]     =>    11 ( ÆæǢǣǼǽᴁᴂᴭᵆᷔ )
            [[=CH=]]   =   [[=Ch=]]   =   [[=ch=]]     =>     0 ( ? )
            [[=DZ=]]   =   [[=Dz=]]   =   [[=dz=]]     =>     6 ( DŽDždžDZDzdz )
            [[=LJ=]]   =   [[=Lj=]]   =   [[=lj=]]     =>     3 ( LJLjlj )
            [[=LL=]]   =   [[=Ll=]]   =   [[=ll=]]     =>     2 ( Ỻỻ )
            [[=NJ=]]   =   [[=Nj=]]   =   [[=nj=]]     =>     3 ( NJNjnj )
            [[=SS=]]   =   [[=Ss=]]   =   [[=ss=]]     =>     2 ( ßẞ )
            

            You said in a previous post :

            With Columns++, properties (like \p{digit} or \P{digit}), named classes (like [[:lower:]] or [[:^lower::]]) and escapes ( like \u or \U ) now ignore the Match case setting and the (?i) flag: they are always case sensitive

            Thus :

            • The regexes (?=[[:ascii:]])\p{punct} or (?=\p{punct})[[:ascii:]] always gives 32 matches

            • The regexes (?=[[:ascii:]])\u or (?=\u)[[:ascii:]] always gives 26 matches

            • The regexes (?=[[:ascii:]])\l or (?=\l)[[:ascii:]] always gives 26 matches

            • The regexes (?=[[:ascii:]])[\u\l] or (?=[\u\l])[[:ascii:]] always return 52 matches

            Other examples :

            • The regex [A-F[:lower:]] does give 2 289 matches, so 6 UPPER letters + 2,283 LOWER letters

            • The regexes [[:upper:]]|[[:lower:]] and [[:upper:][:lower:]] act as insensitive regexes and return 4,169 matches ( i.e. 1,886 UPPER letters + 2,283 LOWER letters )


            So, everything works as expected, so far but the slight annoyance, described at beginning of my previous post !

            Best Regards

            guy038

            1 Reply Last reply Reply Quote 2
            • CoisesC
              Coises @guy038
              last edited by Coises

              @guy038 said in Columns++ version 1.3: All Unicode, all the time:

              And, indeed, if I use the regex [\x{0300}-\x{036F}], against my Total_Chars.txt file, it corectly returns 112 occurrences and if I use the \p{Mn} regex, it correctly returns 2,059 occurrences, either.

              However, then I test the regexes (?=[\x{0300}-\x{036F}])\p{M*} or (?=\p{M*})[\x{0300}-\x{036F}] or, more precisely, the regexes (?=[\x{0300}-\x{036F}])\p{Mn} or (?=\p{Mn})[\x{0300}-\x{036F}], it ONLY returns 111 occurrences and NOT 112 ! Did I make a mistake ?

              This appears to be related to character U+0345. This character is a combining character, but it has an uppercase equivalent (U+0399) a case folding (U+03B9) which is not a combining character.

              I think at least some of your tests must have been without match case checked?

              I do, however, find that with match case not checked, I see a count of 111 for [\x{0300}-\x{036F}] as well as for your other expressions. With match case checked, I see 112 for all of them.

              In regular Notepad++ Find, I get 112 either way for [\x{0300}-\x{036F}]. So there is something I am doing differently that is affecting ranges. I don’t yet know what it is. I will look into it.

              Thank you for the alert.

              Edit to add:

              I think what is happening is that when processing a range with match case unchecked (or (?i) in effect), the regex engine first does a case fold operation on both ends of the range, then does a case fold on each character to be matched to see if it falls in the range. All the characters from U+0300 to U+036F case fold to themselves except for U+0345, which case folds to U+03B9.

              No doubt Notepad++ native Find behaves differently because Boost::regex does not implement full Unicode case folding without either including ICU or otherwise supplying customized character traits (as Columns++ does).

              I agree that it is a somewhat bizarre behavior, but it is not clear what, if anything, I can do about it. Regex ranges with case insensitive matching, I think, are prone to unanticipated quirks. For example, in Notepad++ Find, [A-z] matches 58 characters when case sensitive and 52 characters when case insensitive. In Columns++ search, when case insensitive it matches 54 characters, because there are two non-ASCII characters, ſ, U+017F and K, U+212A, which case fold to ASCII s and k.

              1 Reply Last reply Reply Quote 2
              • guy038G
                guy038
                last edited by guy038

                Hi, @coises and all,

                Yes, @coises, you were right about it. So, in short, against my Total_Chars.txt file :

                • The regex \p{Mn} does return 2,059 occurrences, whatever the case option is cheked or not

                • The regexes [\x{0300}-\x{036F}], (?=[\x{0300}-\x{036F}])\p{Mn} and (?=\p{Mn})[\x{0300}-\x{036F}] return 112 occurrences, when the Match case option is checked

                • The regexes [\x{0300}-\x{036F}], (?=[\x{0300}-\x{036F}])\p{Mn} and (?=\p{Mn})[\x{0300}-\x{036F}] return 111 occurrences, when the Match case option is not checked


                You said :

                All the characters, in range [\x{0300}-\x{036F}], case fold to themselves, except for the single character U+0345 which case folds to U+03B9

                This certainly explains why Columns++, taking account of the folding cases, in this specific range [\x{0300}-\x{036F}] ONLY, just finds 111 occurrences, when the Match case option is not checked !


                I would say that any range, with defined characters ( so, not using your restriction to be automatically sensitive ) :

                • When the Match case option is checked :

                  • Finds the exact number of Unicode chars between the two boundaries of that range. For example, the regex [A-z] returns 58 occurrences and is identical to the range [ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyz] with, either, N++ and Columns++

                • When the Match case option is not checked :

                  • Finds ONLY the characters of that range which case fold to a character of this range. Thus, the regexes [A-z] and [ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz] return 52 occurrences with N++ ( 26 + 26 )

                  • Finds ALL the Unicode characters which case fold to a character of that range. Thus, the regex [A-z] return 54 occurrences with Columns++ : 52 + 2 chars, whose case folding ( s and k ) belongs to the specific range [A-z]

                And note that the regex [ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyzſK] and even [ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyz] return 60 occurrences ( 58 + 2 ), with Columns++, when the Match case option is not checked !

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, @coises and All,

                  Now, here are the new tests regarding the Total_ANSI.txt file, described below :

                  •---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
                  |     Range     |  Description    |   Status   |  COUNT / MARK of ALL chars  |  # Chars  |  ANSI Encoding  |  # Bytes  |
                  •---------------•-----------------•------------•-----------------------------•-----------•-----------------•- ---------•
                  |  0000 - 007F  |  PLANE 0 - BMP  |  Included  |  [\x00-\x7F]                |      128  |                 |      128  |
                  |               |                 |            |                             |           |     1 Byte      |           |
                  |  0080 - 00FF  |  PLANE 0 - BMP  |  Included  |  [\x80-\xFF]                |      128  |                 |      128  |
                  •---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
                  

                  Against this file, the following general results are correct :

                  
                  (?s).  =  \I  =  \p{Any}  =  [\x{0000}-\x{EFFFD}]                                         =>      256
                  
                  
                  [[:unicode:]]  =  \p{unicode}    ( Total chars with Unicode value OVER  \x{00FF} )        =>       27    |
                                                                                                                           |  Total =  256
                  [^[:unicode:]]  =  \P{unicode}   ( Total chars with Unicode value UNDER \x{0100} )        =>      229    |
                  
                  
                  \p{Ascii}  =  \o                                                                          =>      128    |
                                                                                                                           |  Total =  256
                  \P{Ascii}  =  \O                                                                          =>      128    |
                  
                  
                  \X         ( Character with possible combining MARKS )                                     =>      256    |
                                                                                                                            |  Total =  256
                  (?!\X).    ( A combining mark ALONE )                                                      =>        0    |
                  
                  
                  \y  =  [[:defined:]]  =  \p{Assigned}                                                     =>      256    |
                                                                                                                           |  Total =  256
                  \Y  =  [^[:defined:]]  =  \p{Not Assigned}                                                =>        0    |
                  
                  
                  \i  =  [[:invalid:]]    ( NO byte in invalid UTF-8 sequence, as ANSI file )               =>        0    |
                                                                                                                           |  Total =  256
                  \I  =  [^[:invalid:]]   ( All VALID bytes, as ANSI file )                                 =>      256    |
                  

                  However, note that, with the Columns++ regex engine :

                  [\x00-\xFF]            ( Total chars with Unicode value UNDER \x{0100} )                  =>      229  =  [\x00-\x7F\x81\x8D\x8F\x90\x9D\xA0-\xFF]
                  
                  [\x{0000}-\x{00FF}]    ( Total chars with Unicode value UNDER \x{0100} )                  =>      229  =  [\x00-\x7F\x81\x8D\x8F\x90\x9D\xA0-\xFF]
                  
                  (?-s).                                                                                    =>      254  =  [^\x0A\x0D]
                  

                  Whereas, with the N++ Boost regex engine :

                  [\x00-\xFF]                                                                               =>      256
                  
                  [\x{0000}-\x{00FF}]                                                                       =>      INVALID regex syntax ( as ANSI file )
                  
                  (?-s).                                                                                    =>      253  =  [^\x0A\x0C\x0D]
                  

                  I tried some expressions with look-aheads and look-behinds, containing overlapping zones !

                  For instance, against this text aaaabaaababbbaabbabb, pasted in a new ANSI tab, with a final line-break, all the regexes, below, give the correct number of matches :

                  ba*(?=a)   =>  4 matches
                  ba*(?!a)   =>  9 matches
                  ba*(?=b)   =>  8 matches
                  ba*(?!b)   =>  5 matches
                  
                  (?<=a)ba*  =>  5 matches
                  (?<!b)ba*  =>  5 matches
                  
                  (?<=b)ba*  =>  4 matches
                  (?<!a)ba*  =>  4 matches
                  

                  Here are the correct results, concerning all the Posix character classes, against the Total_ANSI.txt file

                  [[:ascii:]]                                                       an UNDER \x{0080} character              128  =  [\x{0000}-\x{007F}]  =  [\x{00}-\x{7F}]  =  [\x00-\x7F]
                                                                                    
                  [[:unicode:]]  =  \p{unicode}                                     an OVER  \x{00FF} character               27  =  [\x{0100}-\x{EFFFD}]  =  [^\x{0000}-\x{00FF}]  =  [^\x{00}-\x{FF}]  =  [^\x00-\xFF]  =  
                  
                  
                  [[:space:]]  =  \p{space}  =  [[:s:]]  =  \p{s}  =  \ps  =  \s    a             WHITE-SPACE character        7  =  [\t\n\x0B\f\r\x20\xA0]
                  
                                                 [[:h:]]  =  \p{h}  =  \ph  =  \h   an HORIZONTAL white space character        3  =  [\t\x20\xA0]
                  [[:blank:]]  =  \p{blank}                                         a  BLANK                  character        3  =  [\t\x20\xA0]
                  
                                                 [[:v:]]  =  \p{v}  =  \pv  =  \v   a  VERTICAL   white space character        4  =  [\n\x0B\f\r]
                  
                  [[:cntrl:]]  =  \p{cntrl}                                         a  CONTROL code           character       38  =  [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]
                                                                                                                                  =  [[.NUL.]-[.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OSC.]]
                  
                  
                  [[:upper:]]  =  \p{upper}  =  [[:u:]]  =  \p{u}  =  \pu  =  \u    an  UPPER case    letter                  60  =  [A-ZŠŒŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞß]
                  [[:lower:]]  =  \p{lower}  =  [[:l:]]  =  \p{l}  =  \pl  =  \l    a   LOWER case    letter                  63  =  [a-zƒšœžµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
                  [ªº]         =  [\xAA\xBA]                                        2   OTHER Letters                          2
                  ˆ            =  \x{02C6}                                          a   MODIFIER letter                        1
                  [[:digit:]]  =  \p{digit}  =  [[:d:]]  =  \p{d}  =  \pd   = \d    a   DECIMAL       number                  10  =  [0-9]
                  _            =  \x5F                                              the LOW_LINE      character                1
                                                                                                                           -------
                  [[:word:]]   =  \p{word}   =  [[:w:]]  =  \p{w}  =  \pw   = \w    a   WORD                  character      137  =  [0-9A-Z_a-zƒˆŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
                                                                                                                                  =  [[:alnum:]]|\x5F  =  \p{alnum}|\x5F
                  
                  
                  [[:upper:]]|[[:lower:]]  =  [[:upper:][:lower:]]  =  \u|\l        Any LETTER, whatever its CASE            123
                  
                  
                  [[:alnum:]]  =  \p{alnum}                                         an  ALPHANUMERIC          character      136  =  [0-9A-Za-zƒˆŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
                                                                                                                                  =  [[:upper:][:lower:][:digit:]\xAA\xBA\x{02C6}]
                  
                  [[:alpha:]]  =  \p{alpha}                                         any LETTER                character      126  =  [[:upper:][:lower:]\xAA\xBA\x{02C6}]
                  
                  
                  [[:graph:]]  =  \p{graph}                                         any VISIBLE               character      215  =  [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD]
                  
                  [[:print:]]  =  \p{print}                                         any PRINTABLE             character      222  =  [[:graph:]]|\s
                  
                  
                  [[:punct:]]  =  \p{punct}                                         any PUNCTUATION or SYMBOL character       73  =  \p{Punctuation}|\p{Symbol}
                                                                                                                                  =  [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x{20AC}\x{201A}\x{201E}\x{2026}\x{2020}\x{2021}\x{2030}\x{2039}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{203A}\xA1-\xA9\xAB\xAC\xAE-\xB1\xB4\xB6-\xB8\xBB\xBF\xD7\xF7]
                                                                                                                                  =  [^[:cntrl:]\w\x20\xA0\xAD\xB2\xB3\xB9\xBC\xBD\xBE]|\x5F 
                  
                  
                  [[:xdigit:]]                                                      an HEXADECIMAL            character       22  =  [0-9A-Fa-f]  =  (?i)[0-9A-F]
                  

                  Below, the correct results for all Unicode character classes, against the Total_ANSI.txt file ( since Columns++ v1.3, Unicode classes work in ANSI files, as well ) :

                             \p{Any}                           Any character                                                  256  =  (?s).  =  \I  =  [\x{0000}-\x{EFFFD}]
                  
                             \p{Ascii}                         a character UNDER \x80                                         128  =  [[:ascii:]]  =  \o
                  
                             \p{Assigned}                      an ASSIGNED character                                          256
                  
                  \p{Cc}  =  \p{Control}                       a  C0 or C1 CONTROL code       character                        38  =  [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]
                  \p{Cf}  =  \p{Format}                        a  FORMAT CONTROL              character                         1  =  \xAD
                  \p{Cn}  =  \p{Not Assigned}                  an UNASSIGNED or NON-CHARACTER character                         0
                  \p{Co}  =  \p{Private Use}                   a  PRIVATE-USE                 character                         0
                  \p{Cs}  =  \p{Surrogate}    (INVALID regex)  a  SURROGATE                   character                         0
                                                                                                                             ------
                  \p{C*}  =  \p{Other}                                                                                         39  =  \p{Cc}|\p{Cf}|\p{Cn}|\p{Co}
                  
                  
                  \p{Lu}  =  \p{Uppercase Letter}              an UPPER case letter                                            60  =  \u  =  [[:upper:]]  =  \p{upper}
                  \p{Ll}  =  \p{Lowercase Letter}              a  LOWER case letter                                            63  =  \l  =  [[:lower:]]  =  \p{lower}
                  \p{Lt}  =  \p{Titlecase}                     a  DI-GRAPHIC letter                                             0
                  \p{Lm}  =  \p{Modifier Letter}               a  MODIFIER   letter                                             1  =  \x{02C6}
                  \p{Lo}  =  \p{Other Letter}                  OTHER         letter                                             2  =  [\xAA\xBA]
                                                                                                                            -------
                  \p{L*}  =  \p{Letter}                                                                                       126  =  \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}  =  [[:alpha:]]   =  \p{alpha}
                  
                  
                  \p{Mc}  =  \p{Spacing Combining Mark}        a  NON-SPACING COMBINING mark (ZERO     advance width)           0
                  \p{Me}  =  \p{Enclosing Mark}                a  SPACING COMBINING     mark (POSITIVE advance width)           0
                  \p{Mn}  =  \p{Non-Spacing Mark}              an ENCLOSING COMBINING   mark                                    0
                                                                                                                              -----
                  \p{M*}  =  \p{Mark}                                                                                           0  =  \p{Mc}|\p{Me}|\p{Mn}
                  
                  \p{Nd}  =  \p{Decimal Digit Number}          a DECIMAL number     character                                  10  =  \d  =  [[:digit:]]  =  \p{digit}
                  \p{Nl}  =  \p{Letter Number}                 a LETTERLIKE numeric character                                   0
                  \p{No}  =  \p{Other Number}                  OTHER NUMERIC        character                                   6  =  [\xB2\xB3\xB9\xBC\xBD\xBE]
                                                                                                                             ------
                  \p{N*}  =  \p{Number}                                                                                        16  =  \p{Nd}|\p{Nl}|\p{No}  =  [0-9\xB2\xB3\xB9\xBC\xBD\xBE]
                  
                  
                  \p{Pd}  =  \p{Dash Punctuation}              a  DASH or HYPHEN punctuation mark                               3  =  [\x2D\x{2013}\x{2014}]
                  \p{Ps}  =  \p{Open Punctuation}              an OPENING    PUNCTUATION     mark, in a pair                    5  =  [\x28\x5B\x7B\x{201A}\x{201E}]
                  \p{Pc}  =  \p{Connector Punctuation}         a  CONNECTING PUNCTUATION     mark                               1  =  \x5F
                  \p{Pe}  =  \p{Close Punctuation}             a  CLOSING    PUNCTUATION     mark, in a pair                    3  =  [\x29\x5D\x7D]
                  \p{Pi}  =  \p{Initial Punctuation}           an INITIAL QUOTATION          mark                               4  =  [\x{2039}\x{2018}\x{201C}\xAB]
                  \p{Pf}  =  \p{Final Punctuation}             a  FINAL   QUOTATION          mark                               4  =  [\x{2019}\x{201D}\x{203A}\xBB]
                  \p{Po}  =  \p{Other Punctuation}             OTHER PUNCTUATION             mark                              25  =  [\x21-\x23\x25-\x27\x2A\x2C\x2E\x2F\x3A\x3B\x3F\x40\x5C\x{2026}\x{2020}\x{2021}\x{2030}\x{2022}\xA1\xA7\xB6\xB7\xBF]
                                                                                                                             ------
                  \p{P*}  =  \p{Punctuation}                                                                                   45  =  \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po}
                  
                  
                  \p{Sm}  =  \p{Math Symbol}                   a MATHEMATICAL symbol     character                             10  =  [\x2B\x3C-\x3E\x7C\x7E\xAC\xB1\xD7\xF7]
                  \p{Sc}  =  \p{Currency Symbol}               a CURRENCY                character                              6  =  [\x24\x{20AC}\xA2-\xA5]
                  \p{Sk}  =  \p{Modifier Symbol}               a NON-LETTERLIKE MODIFIER character                              7  =  [\x5E\x60\x{02DC}\xA8\xAF\xB4\xB8]
                  \p{So}  =  \p{Other Symbol}                  OTHER SYMBOL              character                              5  =  [\x{2122}\xA6\xA9\xAE\xB0]
                                                                                                                             ------
                  \p{S*}  =  \p{Symbol}                                                                                        28  =  \p{Sm}|\p{Sc}|\p{Sk}|\p{So}
                  
                  
                  \p{Zs}  =  \p{Space Separator}               a   NON-ZERO width SPACE   character                             2  =  [\x20\xA0]  =  (?!\t)\h
                  \p{Zl}  =  \p{Line Separator}                the LINE SEPARATOR         character                             0
                  \p{Zp}  =  \p{Paragraph Separator}           the PARAGRAPH SEPARATOR    character                             0
                                                                                                                              -----
                  \p{Z*}  =  \p{Separator}                                                                                      2  =  \p{Zs}|\p{Zl}|\p{Zp}
                  

                  Remark :

                  • A negative POSIX character class can be expressed as [^[:........:]] or [[:^........:]]

                  • A negative UNICODE character class can be expressed as \P{..}, with an uppercase letter P


                  With this last release, @coises, results are totally coherent between ANSI and UTF-8 files !

                  Continuation on next post

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by

                    Hello, @coises and All,

                    Continuation and end of my post

                    I also tested ALL the `equivalence class feature :

                    You can use ANY equivalent character to get the total number of matches of the equivalence class character. For example, [[=ª=]] = [[=Å=]] = [[=ã=]] = … )

                    Here is, below, the list of all the equivalences of any char of the Windows-1252 code-page, against the Total_ANSI.txt file. Note that I did not consider the equivalence classes which returns only one match !

                    [[=1=]]    =   [[=one=]]         =>     2   [1¹]
                    [[=2=]]    =   [[=two=]]         =>     2   [2²]
                    [[=3=]]    =   [[=three=]]       =>     2   [3³]
                    
                    [[=A=]]                          =>    15   [AaªÀÁÂÃÄÅàáâãäå]
                    [[=B=]]                          =>     2   [Bb]
                    [[=C=]]                          =>     4   [CcÇç]
                    [[=D=]]                          =>     4   [DdÐð]
                    [[=E=]]                          =>    10   [EeÈÉÊËèéêë]
                    [[=F=]]                          =>     3   [Ffƒ]
                    [[=G=]]                          =>     2   [Gg]
                    [[=H=]]                          =>     2   [Hh]
                    [[=I=]]                          =>    10   [IiÌÍÎÏìíîï]
                    [[=J=]]                          =>     2   [Jj]
                    [[=K=]]                          =>     2   [Kk]
                    [[=L=]]                          =>     2   [Ll]
                    [[=M=]]                          =>     2   [Mm]
                    [[=N=]]                          =>     4   [NnÑñ]
                    [[=O=]]                          =>    15   [OoºÒÓÔÕÖØòóôõöø]
                    [[=P=]]                          =>     2   [Pp]
                    [[=Q=]]                          =>     2   [Qq]
                    [[=R=]]                          =>     2   [Rr]
                    [[=S=]]                          =>     4   [SsŠš]
                    [[=T=]]                          =>     2   [Tt]
                    [[=U=]]                          =>    10   [UuÙÚÛÜùúûü]
                    [[=V=]]                          =>     2   [Vv]
                    [[=W=]]                          =>     2   [Ww]
                    [[=X=]]                          =>     2   [Xx]
                    [[=Y=]]                          =>     6   [YyÝýÿŸ]
                    [[=Z=]]                          =>     4   [ZzŽž]
                    
                    [[=^=]]    =  [[=circumflex=]]   =>     2   [^ˆ]  =  [\x5E\x{02C6}]
                    [[=Œ=]]                          =>     2   [Œœ]  =  [\x{0152}\x{0153}]
                    [[=­=]]                           =>     2   [[.NUL.][.SHY.]]  =  [\x00\xAD]
                    [[=Þ=]]                          =>     2   [Þþ]  =  [\xDE\xFE]
                    

                    Some double-letter characters equivalences :

                    [[=AE=]] = [[=Ae=]] = [[=ae=]]   =>   2   [Ææ]  =  [\xC6\xE6]
                    [[=SS=]] = [[=Ss=]] = [[=ss=]]   =>   1   [ß]  =  [\xDF]
                    

                    An example : let’s suppose that we run this regex [A-F[:lower:]], against my Total_ANSI.txt file. It does give 69 matches, so 6 UPPER letters + 63 LOWER letters

                    The regexes [[:upper:]]|[[:lower:]] and [[:upper:][:lower:]] act as insensitive regexes and return 123 matches ( So 60 UPPER letters + 63 LOWER letters )

                    The regexes (?=\u)\l and (?=\l)\u do not find anything. This implies that the sets of UPPER and LOWER letters, in Total_ANSI.twt, are totally disjoint

                    Best Regards

                    guy038

                    P.S. :

                    BTW, I forgot to list the equivalence classes, > 1, of the Control C0/C1 and Control Format characters, against the Total_Chars.txt file ! Here are the results, below :

                    [[=nul=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cc
                    
                    [[= =]]                              =>     3  [\x{0020}\x{205F}\x{3000}]   Zs
                    [[=mmsp=]]                           =>     3  [\x{0020}\x{205F}\x{3000}]   Zs
                    [[=idsp=]]                           =>     3  [\x{0020}\x{205F}\x{3000}]   Zs
                    
                    [[=shy=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=alm=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    
                    [[=sam=]]                            =>     2  [\x{070F}\x{2E1A}]           Po
                    
                    [[=nqsp=]]                           =>     2  [\x{2000}\X[2002}]           Zs
                    [[=ensp=]]                           =>     2  [\x{2000}\X[2002}]           Zs
                    
                    [[=mqsp=]]                           =>     2  [\x{2001}\X{2003}]           Zs
                    [[=emsp=]]                           =>     2  [\x{2001}\X{2003}]           Zs
                    
                    [[=zwnj=]]                           => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=zwj=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=lrm=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=rlm=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    
                    [[=ls=]]                             =>     2  [\x{2028}\x{FE47}]           Zl
                    
                    [[=lre=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=rle=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=pdf=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=lro=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=rlo=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    
                    [[=wj=]]                             => 3,240  [\x{0000}\X{00AD}....]       Cf
                    
                    [[=(fa)=]]                           => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=(it)=]]                           => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=(is)=]]                           => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=(ip)=]]                           => 3,240  [\x{0000}\X{00AD}....]       Cf
                    
                    [[=lri=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=rli=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=fsi=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=pdi=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    
                    [[=iss=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=ass=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=iafs=]]                           => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=aafs=]]                           => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=nads=]]                           => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=nods=]]                           => 3,240  [\x{0000}\X{00AD}....]       Cf
                    
                    [[=zwnbsp=]]                         => 3,240  [\x{0000}\X{00AD}....]       Cf
                    
                    [[=iaa=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=ias=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    [[=iat=]]                            => 3,240  [\x{0000}\X{00AD}....]       Cf
                    

                    As you can see, a lot of Format characters return an erroneous result of 3,240 occurrences. But we’re not going to bother about these wrong equivalence classes, as long as the similar collating names, with the [[.XXX.]] syntax, are totally correct !

                    Luckily, all the other equivalence classes are also correct, except for [[=ls=]] which returns 2 matches \x{2028} and \x{FE47} ??

                    CoisesC 1 Reply Last reply Reply Quote 0
                    • CoisesC
                      Coises @guy038
                      last edited by Coises

                      @guy038 said in Columns++ version 1.3: All Unicode, all the time:

                      As you can see, a lot of Format characters return an erroneous result of 3,240 occurrences. But we’re not going to bother about these wrong equivalence classes, as long as the similar collating names, with the [[.XXX.]] syntax, are totally correct !

                      Luckily, all the other equivalence classes are also correct, except for [[=ls=]] which returns 2 matches \x{2028} and \x{FE47} ??

                      Thank you for the observation. I will have to look into this more closely. I believe the Boost::regex engine uses the transform_primary member function of the character traits class to determine equivalence: if the sort key returned by that function for two characters is the same, then they are equivalent. I implemented transform_primary using LCMapStringEx, as that is normally how one does Unicode sorting. But how is sorting relevant to regular expressions?

                      It could be — despite the documented requirement for the function — that what is needed from transform_primary isn’t a sort key, but rather a case folding followed by a compatibility decomposition.

                      Again, thank you for all your testing, and for calling this to my attention.

                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors