Community
    • Login

    Columns++ version 1.3: All Unicode, all the time

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    15 Posts 2 Posters 942 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hi, @coises and all,

      Yes, @coises, you were right about it. So, in short, against my Total_Chars.txt file :

      • The regex \p{Mn} does return 2,059 occurrences, whatever the case option is cheked or not

      • The regexes [\x{0300}-\x{036F}], (?=[\x{0300}-\x{036F}])\p{Mn} and (?=\p{Mn})[\x{0300}-\x{036F}] return 112 occurrences, when the Match case option is checked

      • The regexes [\x{0300}-\x{036F}], (?=[\x{0300}-\x{036F}])\p{Mn} and (?=\p{Mn})[\x{0300}-\x{036F}] return 111 occurrences, when the Match case option is not checked


      You said :

      All the characters, in range [\x{0300}-\x{036F}], case fold to themselves, except for the single character U+0345 which case folds to U+03B9

      This certainly explains why Columns++, taking account of the folding cases, in this specific range [\x{0300}-\x{036F}] ONLY, just finds 111 occurrences, when the Match case option is not checked !


      I would say that any range, with defined characters ( so, not using your restriction to be automatically sensitive ) :

      • When the Match case option is checked :

        • Finds the exact number of Unicode chars between the two boundaries of that range. For example, the regex [A-z] returns 58 occurrences and is identical to the range [ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyz] with, either, N++ and Columns++

      • When the Match case option is not checked :

        • Finds ONLY the characters of that range which case fold to a character of this range. Thus, the regexes [A-z] and [ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz] return 52 occurrences with N++ ( 26 + 26 )

        • Finds ALL the Unicode characters which case fold to a character of that range. Thus, the regex [A-z] return 54 occurrences with Columns++ : 52 + 2 chars, whose case folding ( s and k ) belongs to the specific range [A-z]

      And note that the regex [ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyzſK] and even [ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^_\x60abcdefghijklmnopqrstuvwxyz] return 60 occurrences ( 58 + 2 ), with Columns++, when the Match case option is not checked !

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @coises and All,

        Now, here are the new tests regarding the Total_ANSI.txt file, described below :

        •---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
        |     Range     |  Description    |   Status   |  COUNT / MARK of ALL chars  |  # Chars  |  ANSI Encoding  |  # Bytes  |
        •---------------•-----------------•------------•-----------------------------•-----------•-----------------•- ---------•
        |  0000 - 007F  |  PLANE 0 - BMP  |  Included  |  [\x00-\x7F]                |      128  |                 |      128  |
        |               |                 |            |                             |           |     1 Byte      |           |
        |  0080 - 00FF  |  PLANE 0 - BMP  |  Included  |  [\x80-\xFF]                |      128  |                 |      128  |
        •---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
        

        Against this file, the following general results are correct :

        
        (?s).  =  \I  =  \p{Any}  =  [\x{0000}-\x{EFFFD}]                                         =>      256
        
        
        [[:unicode:]]  =  \p{unicode}    ( Total chars with Unicode value OVER  \x{00FF} )        =>       27    |
                                                                                                                 |  Total =  256
        [^[:unicode:]]  =  \P{unicode}   ( Total chars with Unicode value UNDER \x{0100} )        =>      229    |
        
        
        \p{Ascii}  =  \o                                                                          =>      128    |
                                                                                                                 |  Total =  256
        \P{Ascii}  =  \O                                                                          =>      128    |
        
        
        \X         ( Character with possible combining MARKS )                                     =>      256    |
                                                                                                                  |  Total =  256
        (?!\X).    ( A combining mark ALONE )                                                      =>        0    |
        
        
        \y  =  [[:defined:]]  =  \p{Assigned}                                                     =>      256    |
                                                                                                                 |  Total =  256
        \Y  =  [^[:defined:]]  =  \p{Not Assigned}                                                =>        0    |
        
        
        \i  =  [[:invalid:]]    ( NO byte in invalid UTF-8 sequence, as ANSI file )               =>        0    |
                                                                                                                 |  Total =  256
        \I  =  [^[:invalid:]]   ( All VALID bytes, as ANSI file )                                 =>      256    |
        

        However, note that, with the Columns++ regex engine :

        [\x00-\xFF]            ( Total chars with Unicode value UNDER \x{0100} )                  =>      229  =  [\x00-\x7F\x81\x8D\x8F\x90\x9D\xA0-\xFF]
        
        [\x{0000}-\x{00FF}]    ( Total chars with Unicode value UNDER \x{0100} )                  =>      229  =  [\x00-\x7F\x81\x8D\x8F\x90\x9D\xA0-\xFF]
        
        (?-s).                                                                                    =>      254  =  [^\x0A\x0D]
        

        Whereas, with the N++ Boost regex engine :

        [\x00-\xFF]                                                                               =>      256
        
        [\x{0000}-\x{00FF}]                                                                       =>      INVALID regex syntax ( as ANSI file )
        
        (?-s).                                                                                    =>      253  =  [^\x0A\x0C\x0D]
        

        I tried some expressions with look-aheads and look-behinds, containing overlapping zones !

        For instance, against this text aaaabaaababbbaabbabb, pasted in a new ANSI tab, with a final line-break, all the regexes, below, give the correct number of matches :

        ba*(?=a)   =>  4 matches
        ba*(?!a)   =>  9 matches
        ba*(?=b)   =>  8 matches
        ba*(?!b)   =>  5 matches
        
        (?<=a)ba*  =>  5 matches
        (?<!b)ba*  =>  5 matches
        
        (?<=b)ba*  =>  4 matches
        (?<!a)ba*  =>  4 matches
        

        Here are the correct results, concerning all the Posix character classes, against the Total_ANSI.txt file

        [[:ascii:]]                                                       an UNDER \x{0080} character              128  =  [\x{0000}-\x{007F}]  =  [\x{00}-\x{7F}]  =  [\x00-\x7F]
                                                                          
        [[:unicode:]]  =  \p{unicode}                                     an OVER  \x{00FF} character               27  =  [\x{0100}-\x{EFFFD}]  =  [^\x{0000}-\x{00FF}]  =  [^\x{00}-\x{FF}]  =  [^\x00-\xFF]  =  
        
        
        [[:space:]]  =  \p{space}  =  [[:s:]]  =  \p{s}  =  \ps  =  \s    a             WHITE-SPACE character        7  =  [\t\n\x0B\f\r\x20\xA0]
        
                                       [[:h:]]  =  \p{h}  =  \ph  =  \h   an HORIZONTAL white space character        3  =  [\t\x20\xA0]
        [[:blank:]]  =  \p{blank}                                         a  BLANK                  character        3  =  [\t\x20\xA0]
        
                                       [[:v:]]  =  \p{v}  =  \pv  =  \v   a  VERTICAL   white space character        4  =  [\n\x0B\f\r]
        
        [[:cntrl:]]  =  \p{cntrl}                                         a  CONTROL code           character       38  =  [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]
                                                                                                                        =  [[.NUL.]-[.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OSC.]]
        
        
        [[:upper:]]  =  \p{upper}  =  [[:u:]]  =  \p{u}  =  \pu  =  \u    an  UPPER case    letter                  60  =  [A-ZŠŒŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞß]
        [[:lower:]]  =  \p{lower}  =  [[:l:]]  =  \p{l}  =  \pl  =  \l    a   LOWER case    letter                  63  =  [a-zƒšœžµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
        [ªº]         =  [\xAA\xBA]                                        2   OTHER Letters                          2
        ˆ            =  \x{02C6}                                          a   MODIFIER letter                        1
        [[:digit:]]  =  \p{digit}  =  [[:d:]]  =  \p{d}  =  \pd   = \d    a   DECIMAL       number                  10  =  [0-9]
        _            =  \x5F                                              the LOW_LINE      character                1
                                                                                                                 -------
        [[:word:]]   =  \p{word}   =  [[:w:]]  =  \p{w}  =  \pw   = \w    a   WORD                  character      137  =  [0-9A-Z_a-zƒˆŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
                                                                                                                        =  [[:alnum:]]|\x5F  =  \p{alnum}|\x5F
        
        
        [[:upper:]]|[[:lower:]]  =  [[:upper:][:lower:]]  =  \u|\l        Any LETTER, whatever its CASE            123
        
        
        [[:alnum:]]  =  \p{alnum}                                         an  ALPHANUMERIC          character      136  =  [0-9A-Za-zƒˆŠŒŽšœžŸªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
                                                                                                                        =  [[:upper:][:lower:][:digit:]\xAA\xBA\x{02C6}]
        
        [[:alpha:]]  =  \p{alpha}                                         any LETTER                character      126  =  [[:upper:][:lower:]\xAA\xBA\x{02C6}]
        
        
        [[:graph:]]  =  \p{graph}                                         any VISIBLE               character      215  =  [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD]
        
        [[:print:]]  =  \p{print}                                         any PRINTABLE             character      222  =  [[:graph:]]|\s
        
        
        [[:punct:]]  =  \p{punct}                                         any PUNCTUATION or SYMBOL character       73  =  \p{Punctuation}|\p{Symbol}
                                                                                                                        =  [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x{20AC}\x{201A}\x{201E}\x{2026}\x{2020}\x{2021}\x{2030}\x{2039}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{203A}\xA1-\xA9\xAB\xAC\xAE-\xB1\xB4\xB6-\xB8\xBB\xBF\xD7\xF7]
                                                                                                                        =  [^[:cntrl:]\w\x20\xA0\xAD\xB2\xB3\xB9\xBC\xBD\xBE]|\x5F 
        
        
        [[:xdigit:]]                                                      an HEXADECIMAL            character       22  =  [0-9A-Fa-f]  =  (?i)[0-9A-F]
        

        Below, the correct results for all Unicode character classes, against the Total_ANSI.txt file ( since Columns++ v1.3, Unicode classes work in ANSI files, as well ) :

                   \p{Any}                           Any character                                                  256  =  (?s).  =  \I  =  [\x{0000}-\x{EFFFD}]
        
                   \p{Ascii}                         a character UNDER \x80                                         128  =  [[:ascii:]]  =  \o
        
                   \p{Assigned}                      an ASSIGNED character                                          256
        
        \p{Cc}  =  \p{Control}                       a  C0 or C1 CONTROL code       character                        38  =  [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]
        \p{Cf}  =  \p{Format}                        a  FORMAT CONTROL              character                         1  =  \xAD
        \p{Cn}  =  \p{Not Assigned}                  an UNASSIGNED or NON-CHARACTER character                         0
        \p{Co}  =  \p{Private Use}                   a  PRIVATE-USE                 character                         0
        \p{Cs}  =  \p{Surrogate}    (INVALID regex)  a  SURROGATE                   character                         0
                                                                                                                   ------
        \p{C*}  =  \p{Other}                                                                                         39  =  \p{Cc}|\p{Cf}|\p{Cn}|\p{Co}
        
        
        \p{Lu}  =  \p{Uppercase Letter}              an UPPER case letter                                            60  =  \u  =  [[:upper:]]  =  \p{upper}
        \p{Ll}  =  \p{Lowercase Letter}              a  LOWER case letter                                            63  =  \l  =  [[:lower:]]  =  \p{lower}
        \p{Lt}  =  \p{Titlecase}                     a  DI-GRAPHIC letter                                             0
        \p{Lm}  =  \p{Modifier Letter}               a  MODIFIER   letter                                             1  =  \x{02C6}
        \p{Lo}  =  \p{Other Letter}                  OTHER         letter                                             2  =  [\xAA\xBA]
                                                                                                                  -------
        \p{L*}  =  \p{Letter}                                                                                       126  =  \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}  =  [[:alpha:]]   =  \p{alpha}
        
        
        \p{Mc}  =  \p{Spacing Combining Mark}        a   SPACING  COMBINING    mark                                   0
        \p{Me}  =  \p{Enclosing Mark}                an  ENCLOSING             mark (POSITIVE advance width)          0
        \p{Mn}  =  \p{Non-Spacing Mark}              a   NON-SPACING COMBINING mark (ZERO     advance width)          0
                                                                                                                    -----
        \p{M*}  =  \p{Mark}                                                                                           0  =  \p{Mc}|\p{Me}|\p{Mn}
        
        
        \p{Nd}  =  \p{Decimal Digit Number}          a DECIMAL number     character                                  10  =  \d  =  [[:digit:]]  =  \p{digit}
        \p{Nl}  =  \p{Letter Number}                 a LETTERLIKE numeric character                                   0
        \p{No}  =  \p{Other Number}                  OTHER NUMERIC        character                                   6  =  [\xB2\xB3\xB9\xBC\xBD\xBE]
                                                                                                                   ------
        \p{N*}  =  \p{Number}                                                                                        16  =  \p{Nd}|\p{Nl}|\p{No}  =  [0-9\xB2\xB3\xB9\xBC\xBD\xBE]
        
        
        \p{Pd}  =  \p{Dash Punctuation}              a  DASH or HYPHEN punctuation mark                               3  =  [\x2D\x{2013}\x{2014}]
        \p{Ps}  =  \p{Open Punctuation}              an OPENING    PUNCTUATION     mark, in a pair                    5  =  [\x28\x5B\x7B\x{201A}\x{201E}]
        \p{Pc}  =  \p{Connector Punctuation}         a  CONNECTING PUNCTUATION     mark                               1  =  \x5F
        \p{Pe}  =  \p{Close Punctuation}             a  CLOSING    PUNCTUATION     mark, in a pair                    3  =  [\x29\x5D\x7D]
        \p{Pi}  =  \p{Initial Punctuation}           an INITIAL QUOTATION          mark                               4  =  [\x{2039}\x{2018}\x{201C}\xAB]
        \p{Pf}  =  \p{Final Punctuation}             a  FINAL   QUOTATION          mark                               4  =  [\x{2019}\x{201D}\x{203A}\xBB]
        \p{Po}  =  \p{Other Punctuation}             OTHER PUNCTUATION             mark                              25  =  [\x21-\x23\x25-\x27\x2A\x2C\x2E\x2F\x3A\x3B\x3F\x40\x5C\x{2026}\x{2020}\x{2021}\x{2030}\x{2022}\xA1\xA7\xB6\xB7\xBF]
                                                                                                                   ------
        \p{P*}  =  \p{Punctuation}                                                                                   45  =  \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po}
        
        
        \p{Sm}  =  \p{Math Symbol}                   a MATHEMATICAL symbol     character                             10  =  [\x2B\x3C-\x3E\x7C\x7E\xAC\xB1\xD7\xF7]
        \p{Sc}  =  \p{Currency Symbol}               a CURRENCY                character                              6  =  [\x24\x{20AC}\xA2-\xA5]
        \p{Sk}  =  \p{Modifier Symbol}               a NON-LETTERLIKE MODIFIER character                              7  =  [\x5E\x60\x{02DC}\xA8\xAF\xB4\xB8]
        \p{So}  =  \p{Other Symbol}                  OTHER SYMBOL              character                              5  =  [\x{2122}\xA6\xA9\xAE\xB0]
                                                                                                                   ------
        \p{S*}  =  \p{Symbol}                                                                                        28  =  \p{Sm}|\p{Sc}|\p{Sk}|\p{So}
        
        
        \p{Zs}  =  \p{Space Separator}               a   NON-ZERO width SPACE   character                             2  =  [\x20\xA0]  =  (?!\t)\h
        \p{Zl}  =  \p{Line Separator}                the LINE SEPARATOR         character                             0
        \p{Zp}  =  \p{Paragraph Separator}           the PARAGRAPH SEPARATOR    character                             0
                                                                                                                    -----
        \p{Z*}  =  \p{Separator}                                                                                      2  =  \p{Zs}|\p{Zl}|\p{Zp}
        

        Remark :

        • A negative POSIX character class can be expressed as [^[:........:]] or [[:^........:]]

        • A negative UNICODE character class can be expressed as \P{..}, with an uppercase letter P


        With this last release, @coises, results are totally coherent between ANSI and UTF-8 files !

        Continuation on next post

        1 Reply Last reply Reply Quote 1
        • guy038G
          guy038
          last edited by guy038

          Hello, @coises and All,

          Continuation and end of my post

          I also tested ALL the `equivalence class feature :

          You can use ANY equivalent character to get the total number of matches of the equivalence class character. For example, [[=ª=]] = [[=Å=]] = [[=ã=]] = … )

          Here is, below, the list of all the equivalences of any char of the Windows-1252 code-page, against the Total_ANSI.txt file. Note that I did not consider the equivalence classes which returns only one match !

          [[=1=]]    =   [[=one=]]         =>     2   [1¹]
          [[=2=]]    =   [[=two=]]         =>     2   [2²]
          [[=3=]]    =   [[=three=]]       =>     2   [3³]
          
          [[=A=]]                          =>    15   [AaªÀÁÂÃÄÅàáâãäå]
          [[=B=]]                          =>     2   [Bb]
          [[=C=]]                          =>     4   [CcÇç]
          [[=D=]]                          =>     4   [DdÐð]
          [[=E=]]                          =>    10   [EeÈÉÊËèéêë]
          [[=F=]]                          =>     3   [Ffƒ]
          [[=G=]]                          =>     2   [Gg]
          [[=H=]]                          =>     2   [Hh]
          [[=I=]]                          =>    10   [IiÌÍÎÏìíîï]
          [[=J=]]                          =>     2   [Jj]
          [[=K=]]                          =>     2   [Kk]
          [[=L=]]                          =>     2   [Ll]
          [[=M=]]                          =>     2   [Mm]
          [[=N=]]                          =>     4   [NnÑñ]
          [[=O=]]                          =>    15   [OoºÒÓÔÕÖØòóôõöø]
          [[=P=]]                          =>     2   [Pp]
          [[=Q=]]                          =>     2   [Qq]
          [[=R=]]                          =>     2   [Rr]
          [[=S=]]                          =>     4   [SsŠš]
          [[=T=]]                          =>     2   [Tt]
          [[=U=]]                          =>    10   [UuÙÚÛÜùúûü]
          [[=V=]]                          =>     2   [Vv]
          [[=W=]]                          =>     2   [Ww]
          [[=X=]]                          =>     2   [Xx]
          [[=Y=]]                          =>     6   [YyÝýÿŸ]
          [[=Z=]]                          =>     4   [ZzŽž]
          
          [[=^=]]    =  [[=circumflex=]]   =>     2   [^ˆ]  =  [\x5E\x{02C6}]
          [[=Œ=]]                          =>     2   [Œœ]  =  [\x{0152}\x{0153}]
          [[=­=]]                           =>     2   [[.NUL.][.SHY.]]  =  [\x00\xAD]
          [[=Þ=]]                          =>     2   [Þþ]  =  [\xDE\xFE]
          

          Some double-letter characters equivalences :

          [[=AE=]] = [[=Ae=]] = [[=ae=]]   =>   2   [Ææ]  =  [\xC6\xE6]
          [[=SS=]] = [[=Ss=]] = [[=ss=]]   =>   1   [ß]  =  [\xDF]
          

          An example : let’s suppose that we run this regex [A-F[:lower:]], against my Total_ANSI.txt file. It does give 69 matches, so 6 UPPER letters + 63 LOWER letters

          The regexes [[:upper:]]|[[:lower:]] and [[:upper:][:lower:]] act as insensitive regexes and return 123 matches ( So 60 UPPER letters + 63 LOWER letters )

          The regexes (?=\u)\l and (?=\l)\u do not find anything. This implies that the sets of UPPER and LOWER letters, in Total_ANSI.twt, are totally disjoint

          Best Regards

          guy038

          P.S. :

          BTW, I forgot to list the equivalence classes, > 1, of the Control C0/C1 and Control Format characters, against the Total_Chars.txt file ! Here are the results, below :

          [[=nul=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cc
          
          [[= =]]                              =>     3  [\x{0020}\x{205F}\x{3000}]   Zs
          [[=mmsp=]]                           =>     3  [\x{0020}\x{205F}\x{3000}]   Zs
          [[=idsp=]]                           =>     3  [\x{0020}\x{205F}\x{3000}]   Zs
          
          [[=shy=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=alm=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          
          [[=sam=]]                            =>     2  [\x{070F}\x{2E1A}]           Po
          
          [[=nqsp=]]                           =>     2  [\x{2000}\x[2002}]           Zs
          [[=ensp=]]                           =>     2  [\x{2000}\x[2002}]           Zs
          
          [[=mqsp=]]                           =>     2  [\x{2001}\x{2003}]           Zs
          [[=emsp=]]                           =>     2  [\x{2001}\x{2003}]           Zs
          
          [[=zwnj=]]                           => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=zwj=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=lrm=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=rlm=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          
          [[=ls=]]                             =>     2  [\x{2028}\x{FE47}]           Zl
          
          [[=lre=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=rle=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=pdf=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=lro=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=rlo=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          
          [[=wj=]]                             => 3,240  [\x{0000}\x{00AD}....]       Cf
          
          [[=(fa)=]]                           => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=(it)=]]                           => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=(is)=]]                           => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=(ip)=]]                           => 3,240  [\x{0000}\x{00AD}....]       Cf
          
          [[=lri=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=rli=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=fsi=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=pdi=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          
          [[=iss=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=ass=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=iafs=]]                           => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=aafs=]]                           => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=nads=]]                           => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=nods=]]                           => 3,240  [\x{0000}\x{00AD}....]       Cf
          
          [[=zwnbsp=]]                         => 3,240  [\x{0000}\x{00AD}....]       Cf
          
          [[=iaa=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=ias=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          [[=iat=]]                            => 3,240  [\x{0000}\x{00AD}....]       Cf
          
          

          As you can see, a lot of Format characters return an erroneous result of 3,240 occurrences. But we’re not going to bother about these wrong equivalence classes, as long as the similar collating names, with the [[.XXX.]] syntax, are totally correct !

          Luckily, all the other equivalence classes are also correct, except for [[=ls=]] which returns 2 matches \x{2028} and \x{FE47} ??

          CoisesC 2 Replies Last reply Reply Quote 1
          • CoisesC
            Coises @guy038
            last edited by Coises

            @guy038 said in Columns++ version 1.3: All Unicode, all the time:

            As you can see, a lot of Format characters return an erroneous result of 3,240 occurrences. But we’re not going to bother about these wrong equivalence classes, as long as the similar collating names, with the [[.XXX.]] syntax, are totally correct !

            Luckily, all the other equivalence classes are also correct, except for [[=ls=]] which returns 2 matches \x{2028} and \x{FE47} ??

            Thank you for the observation. I will have to look into this more closely. I believe the Boost::regex engine uses the transform_primary member function of the character traits class to determine equivalence: if the sort key returned by that function for two characters is the same, then they are equivalent. I implemented transform_primary using LCMapStringEx, as that is normally how one does Unicode sorting. But how is sorting relevant to regular expressions?

            It could be — despite the documented requirement for the function — that what is needed from transform_primary isn’t a sort key, but rather a case folding followed by a compatibility decomposition.

            Again, thank you for all your testing, and for calling this to my attention.

            1 Reply Last reply Reply Quote 1
            • guy038G
              guy038
              last edited by guy038

              Hi, @coises,

              If you need my Total_Chars.txt file, simply extract it from the Unicode.zip archive, within my Google Drive account :

              https://drive.google.com/file/d/1kYtbIGPRLdypY7hNMI-vAJXoE7ilRMOC/view?usp=sharing

              You do not need the other files of this archive, as the main information is described below !


              The Total_Chars.txt file is a true UTF-8 file with a BOM, which contains each Unicode assigned and unassigned code-point, once only, from \x{0000} to \x{EFFFD}

              Pysically, it contains 3 lines :

              • A first line, from \x{0000} to \x{0009}, with the \x{000A} line-break

              • A second line, from \x{000B} to \x{000C}, with the \x{000D} line-break

              • A third very LONG line with all characters, from \x{000E} to \x{EFFFD}, without some excluded ones ( refer below )


              In UTF-8 terms, the Total_Chars.txt file can be decomposed as :

                  •  [\x{0000}-\x{007F}]                       128  chars coded with 1 byte   =>         128
              
                  •  [\x{0080}-\x{07FF}]                     1,920  chars coded with 2 bytes  =>       3,840
              
                  •  [\x{0800}-\x{FFFD}]                    61,406  chars coded with 3 bytes  =>     184,218
              
                  •  Planes 1, 2, 3, 14 =  4 × 65,534  =   262,136  chars coded with 4 bytes  =>   1,048,544
                                                         -----------                             --------------
                                                           325,590  chars                          1 236 730  bytes
              
                  •  BOM                                                                                   3  bytes
                                                         -----------                             --------------
                                                           325,590  chars                          1 236 733  bytes
              

              As mentionned above, the Total_Chars.txt does NOT contain the following zones :

                  • The SURROGATES block, from                       \x{D800}  to    \x{DFFF}
              
                  • The 32 NOT-Unicode chars, from                   \x{FDD0}  to    \x{FDEF}
              
                  • The two NOT-Unicode chars, ending the Plane 0    \x{FFFE}  and   \x{FFFF}
              
                  • The two NOT-Unicode chars, ending the Plane 1   \x{1FFFE}  and  \x{1FFFF}
              
                  • The two NOT-Unicode chars, ending the Plane 2   \x{2FFFE}  and  \x{2FFFF}
              
                  • The two NOT-Unicode chars, ending the Plane 3   \x{3FFFE}  and  \x{3FFFF}
              
                  • The COMPLETE planes 4 to 13, from               \x{40000}  to   \x{DFFFF}
              
                  • The two NOT-Unicode chars, ending the plane 14  \x{EFFFE}  and  \x{EFFFF}
              
                  • The PRIVATE-USE planes 15 to 16, from           \x{F0000}  to  \x{10FFFF}
              

              Here is, below, the list of all INCLUDED planes, followed with all the EXCLUDED zones of the Total_Chars.txt file :

                  •=========================================•=======================================•
                  |   Zones INCLUDED in 'Total_Chars.txt'   |     Range      |  Plane  |   # Chars  |
                  •=========================================•================•=========•============•
                  |                                         |   0000..FFFD   |     0   |    63,454  |
                  •-----------------------------------------•----------------•---------•------------•
                  |                                         |  10000..1FFFD  |     1   |    65,534  |
                  •-----------------------------------------•----------------•---------•------------•
                  |                                         |  20000..2FFFD  |     2   |    65,534  |
                  •-----------------------------------------•----------------•---------•------------•
                  |                                         |  30000..3FFFD  |     3   |    65,534  |
                  •-----------------------------------------•----------------•---------•------------•
                  |                                         |  E0000..EFFFD  |    14   |    65,534  |
                  •=========================================•================•=========•============•
                  |       Total INCLUDED characters         |                |         |   325,590  |
                  •=========================================•================•=========•============•
              
              
                  •=========================================•================•=========•===========•
                  |  Zones EXCLUDED from 'Total_Chars.txt'  |     Range      |  Plane  |  # Chars  |
                  •=========================================•================•=========•===========•
                  |  Surrogates                             |   D800..DFFF   |    0    |    2,048  |
                  |  Not Unicode                            |   FDD0..FDEF   |    0    |       32  |
                  |  Not Unicode                            |   FFFE..FFFF   |    0    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Not Unicode                            |  1FFFE..1FFFF  |    1    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Not Unicode                            |  2FFFE..2FFFF  |    2    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Not Unicode                            |  3FFFE..3FFFF  |    3    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Unassigned                             |  40000..4FFFD  |    4    |   65,534  |
                  |  Not Unicode                            |  4FFFE..4FFFF  |    4    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Unassigned                             |  50000..5FFFD  |    5    |   65,534  |
                  |  Not Unicode                            |  5FFFE..5FFFF  |    5    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Unassigned                             |  60000..6FFFD  |    6    |   65,534  |
                  |  Not Unicode                            |  6FFFE..6FFFF  |    6    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Unassigned                             |  70000..7FFFD  |    7    |   65,534  |
                  |  Not Unicode                            |  7FFFE..7FFFF  |    7    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Unassigned                             |  80000..8FFFD  |    8    |   65,534  |
                  |  Not Unicode                            |  8FFFE..8FFFF  |    8    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Unassigned                             |  90000..9FFFD  |    9    |   65,534  |
                  |  Not Unicode                            |  9FFFE..9FFFF  |    9    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Unassigned                             |  A0000..AFFFD  |   10    |   65,534  |
                  |  Not Unicode                            |  AFFFE..AFFFF  |   10    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Unassigned                             |  B0000..BFFFD  |   11    |   65,534  |
                  |  Not Unicode                            |  BFFFE..BFFFF  |   11    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Unassigned                             |  C0000..CFFFD  |   12    |   65,534  |
                  |  Not Unicode                            |  CFFFE..CFFFF  |   12    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Unassigned                             |  D0000..DFFFD  |   13    |   65,534  |
                  |  Not Unicode                            |  DFFFE..DFFFF  |   13    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Not Unicode                            |  EFFFE..EFFFF  |   14    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Supplementary_Private_Use_Area-A       |  F0000..FFFFD  |   15    |   65,534  |
                  |  Not Unicode                            |  FFFFE..FFFFF  |   15    |        2  |
                  •----------------------------------------------------------•---------•-----------•
                  |  Supplementary_Private_Use_Area-B       | 100000..10FFFD |   16    |   65,534  |
                  |  Not Unicode                            | 10FFFE..10FFFF |   16    |        2  |
                  •=========================================•================•=========•===========•
                  |        Total EXCLUDED characters        |                |         |  788,522  |
                  •=========================================•================•=========•===========•
              
              
                  •-----------------------------------------•----------------•---------•-----------•
                  |        Total UNICODE characters         |  0000..10FFFF  | 0 - 16  | 1,114,112 |
                  •-----------------------------------------•----------------•---------•-----------•
              

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 0
              • CoisesC
                Coises @guy038
                last edited by

                @guy038 said in Columns++ version 1.3: All Unicode, all the time:

                As you can see, a lot of Format characters return an erroneous result of 3,240 occurrences. But we’re not going to bother about these wrong equivalence classes, as long as the similar collating names, with the [[.XXX.]] syntax, are totally correct !

                Luckily, all the other equivalence classes are also correct, except for [[=ls=]] which returns 2 matches \x{2028} and \x{FE47} ??

                Still looking into this, I find this statement in the Boost::regex documentation (emphasis mine):

                An expression of the form [[=col=]], matches any character or collating element whose primary sort key is the same as that for collating element col, as with collating elements the name col may be a symbolic name. A primary sort key is one that ignores case, accentation, or locale-specific tailorings; so for example [[=a=]] matches any of the characters: a, À, Á, Â, Ã, Ä, Å, A, à, á, â, ã, ä and å. Unfortunately implementation of this is reliant on the platform’s collation and localisation support; this feature can not be relied upon to work portably across all platforms, or even all locales on one platform.

                I used:

                LCMapStringEx(locale.data(),
                    LCMAP_SORTKEY | LINGUISTIC_IGNOREDIACRITIC | NORM_IGNORECASE | NORM_IGNOREKANATYPE
                    | NORM_IGNOREWIDTH | NORM_LINGUISTIC_CASING,
                    ...
                

                as my best guess at how to do this.

                There are some differences other than the format characters between my search and Notepad++. For example, [[=k=]] matches Ʞ (U+A7B0) in Columns++ search, but not in Notepad++ native search; though both match its lower-case counterpart, ʞ (U+029E).

                I do wonder why [[=ls=]] matches ﹇ (U+FE47) as well as U+2028. Though Notepad++ native search does not accept the [[=ls=]] syntax, substituting the actual U+2028 character, [[=
=]] (you can copy that even though you can’t see it), yields 12 matches, including U+FE47.

                Do you know if there is a precise definition of what should count as an equivalence class in Unicode regular expressions? It is unclear to me for what target I should be aiming.

                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, @coises and All,

                  I’m elaborating a list of ALL the word characters of ANY Unicode block and I’ve noticed a strange behavior in three Unicode blocks ( Latin Extended-A, Georgian and Latin Extended-C )

                  Indeed, when you use the following regexes, against my Total_Chars.txt file, with the Columns++ plugin :

                  • (?=\w)[\x{0100}-\x{017F}]

                  • (?=\w)[\x{10A0}-\x{10FF}]

                  • (?=\w)[\x{2C60}-\x{2C7F}]

                  They all return an error ?!


                  However, note that the regexes :

                  • (?=\w)[\x{0100}-\x{017E}] return 127 word chars

                  • (?=\w)\x{017F} return 1 word char

                  Giving the exact total of word chars of the Latin Extended-A Unicode block ( 128 )


                  Note also that the regexes :

                  • (?=\w)[\x{10A0}-\x{10C7}] return 39 word chars

                  • (?=\w)[\x{10C8}-\x{10FF}] return 48 word chars

                  Giving the exact number of word chars of the Georgian Unicode block ( 87 )


                  Finally, note that the regexes :

                  • (?=\w)[\x{2C60}-\x{2C7D}] return 30 word chars

                  • (?=\w)[\x{2C7E}-\x{2C7F}] return 2 word chars

                  Giving the exact number of word chars of the Latin Extended-C Unicode block ( 32 )

                  TIA, @coises, for investigating !

                  Best Regards,

                  guy038

                  CoisesC 2 Replies Last reply Reply Quote 1
                  • CoisesC
                    Coises @guy038
                    last edited by

                    @guy038 said in Columns++ version 1.3: All Unicode, all the time:

                    They all return an error ?!

                    Thank you for discovering this!

                    I’ve identified the problem. It is an error in how I handle match case. If you test with (?-i) before the expressions you’ll find that they work.

                    To follow the explanation, note these characteristics of ranges in Boost::regex:

                    • Ranges must have the lower bound first and the upper bound second. Reverse order is not allowed and produces an error message.

                    • Case insensitive ranges are processed by first case folding both ends of the range, then accepting any character which case folds to a character within the range.

                    The reason the ranges you tried don’t work with match case checked is that I neglected to include that switch when testing the validity of a regex, thinking (wrongly) that case sensitivity could not affect the validity of a regex.

                    I am reasonably certain (but haven’t yet verified in detail) that the reason the first and third expressions work case-insensitive in Notepad++ native search, but don’t work case-insensitive in Columns++ search, is that Columns++ uses Unicode-defined case folding, while I believe Notepad++ (as a Boost::regex default) uses Windows lower-casing. Those two aren’t always the same.

                    I will prepare a new version of Columns++ to fix this. In the meantime, you can work around it by prefixing (?-i) to case sensitive searches instead of depending on the match case check box.

                    1 Reply Last reply Reply Quote 3
                    • CoisesC
                      Coises @guy038
                      last edited by

                      @guy038 said in Columns++ version 1.3: All Unicode, all the time:

                      Indeed, when you use the following regexes, against my Total_Chars.txt file, with the Columns++ plugin :

                      (?=\w)[\x{0100}-\x{017F}]
                      
                      (?=\w)[\x{10A0}-\x{10FF}]
                      
                      (?=\w)[\x{2C60}-\x{2C7F}]
                      

                      They all return an error ?!

                      Columns++ version 1.3.1 should fix this (when Match case is checked; odd behavior for ranges seems unavoidable when case insensitive mode is in effect; note that Notepad++ native search also gives an error on the second expression with Match case not checked).

                      Notepad++ version 8.9.1 release candidate is expected any day now, so I rushed this in… hopefully I didn’t make any major mistakes.

                      Thank you again, @guy038, for catching this bug.

                      1 Reply Last reply Reply Quote 2
                      • guy038G
                        guy038
                        last edited by guy038

                        Hello, @coises and All,

                        I’ve found out a small anomaly concerning hexadecimal characters :

                        • If I use the native Notepad++ search to match any hexadecimal character, with the regex [[:xdigit:]], against my Total_Chars.txt file, it returns 44 matches

                        • If I use the Columns++ search to match any hexadecimal character, with the regex [[:xdigit:]], against my Total_Chars.txt file, it returns 22 matches

                        I suppose that the N++ answer is the right one. Indeed, in the https://www.unicode.org/reports/tr18/#Compatibility_Properties article , ( Annexe C about UNICODE REGULAR EXPRESSIONS ), it is said :

                        Hex_Digit contains 0-9 A-F fullwidth and halfwidth, upper and lowercase

                        Note that the \p{Hex_Digit} regex is erroneous ! The right one is \p{xdigit}, at least, within Columns++

                        Here is an other proof from https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt. Search for the string Hex in your browser : it clearly shows that the total should be 44 !


                        Now, I found out some other syntaxes about the Unicode classes :

                        Any Unicode class regex can be expressed with one among these four syntaxes :

                        \p{Xx} , \p{Xxxxxxx} , [[:Xx:]] , [[:Xxxxxxx:]]

                        Therefore, here is an update of my previous post https://community.notepad-plus-plus.org/post/104377 :


                        Against the Total_Chars.txt file, all these general results, below, are correct :

                        (?s).  =  \I  =  \p{Any}  =  [\x{0000}-\x{EFFFD}]                                         =>                Total =  325,590
                        
                        
                        \p{Unicode}  =  [[:Unicode:]]                                                             =>  325,334    |
                                                                                                                                 |  Total =  325,590
                        \P{Unicode}  =  [[:^Unicode:]]                                                            =>      256    |
                        
                        
                        \p{Ascii}  =  \o                                                                          =>      128    |
                                                                                                                                 |  Total =  325,590
                        \P{Ascii}  =  \O                                                                          =>  325,462    |
                        
                        
                        \X                                                                                        =>  322,586    |
                                                                                                                                 |  Total =  325,590
                        (?!\X).                                                                                   =>    3,004    |
                        
                        
                        [\x{E000}-\x{F8FF}]|\y     =  [\x{E000}-\x{F8FF}]|[[:defined:]]      =  \p{Assigned}      =>  166,266    |
                                                                                                                                 |  Total =  325,590
                        (?![\x{E000}-\x{F8FF}])\Y  =  (?![\x{E000}-\x{F8FF}])[^[:defined:]]  =  \p{Not Assigned}  =>  159,324    |
                        
                        

                        Note : if we add, to the number of characters of Total_Chars.txt, the contents of any omitted planes ( Planes 4 to 13, 16 and 17 ), less the TWO non-characters for each, plus the Surrogate characters and all the Unicode non-chars, we obtain :

                        325,590 + (65536 - 2) * 12 + 2,048 + 66 = 1,114,112 which is, indeed, the total amount of Unicode chars, , both assigned or not assigned !


                        Here are the correct results, concerning all the Posix character classes, against the Total_Chars.txt file

                        [[:ascii:]]                                              an UNDER \x{0080}         char        128   =  [\x{0000}-\x{007F}]  =  \p{ascii} = \o
                        
                        [[:unicode:]] = \p{unicode                               an OVER  \x{00FF}         char    325,334   =  [\x{0100}-\x{EFFFD}] ( RESTRICTED to 'Total_Chars.txt' )
                        
                        
                        [[:space:]]   = \p{space} = [[:s:]] = \p{s} = \ps = \s   a             WHITE-SPACE char         25   =  [\t\n\x{000B}\f\r\x20\x{0085}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}]
                                                    [[:h:]] = \p{h} = \ph = \h   an HORIZONTAL white space char         18   =  [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]  =  \p{Zs}|\t
                        [[:blank:]]   = \p{blank}                                a  BLANK                  char         18   =  [\t\x20\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]  =  \p{Zs}|\t
                                                    [[:v:]] = \p{v} = \pv = \v   a  VERTICAL   white space char          7   =  [\n\x{000B}\f\r\x{0085}\x{2028}\x{2029}]
                        
                        [[:cntrl:]]   = \p{cntrl}                                a  CONTROL code           char         65   =  [\x{0000}-\x{001F}\x{007F}\x{0080}-\x{009F}]
                        
                        [[:upper:]]   = \p{upper} = [[:u:]] = \p{u} = \pu = \u   an  UPPER case letter     char      1,886   =  \p{Lu}
                        [[:lower:]]   = \p{lower} = [[:l:]] = \p{l} = \pl = \l   a   LOWER case letter     char      2,283   =  \p{Ll}
                                                                                 a   DI-GRAPIC letter      char         31   =  \p{Lt}
                                                                                 a   MODIFIER letter       char        410   =  \p{Lm}
                                                                                 an  OTHER letter          char    141,062   =  \p{Lo}
                                                                                   + SYLLABLES / IDEOGRAPHS
                        [[:digit:]]   = \p{digit} = [[:d:]] = \p{d} = \pd = \d   a   DECIMAL       number              770   =  \p{Nd}
                         _            = \x{005F}                                 the LOW_LINE              char          1
                                                                                                                  ---------
                        [[:word:]]    = \p{word}  = [[:w:]] = \p{w} = \pw = \w   a   WORD                  char    146,443   =  \p{L*}|\p{Nd}|_   ( But it should be \p{L*}|\p{Nd}|\p{M*}|\p{Pc}|\x{200C}|\x{200D} ! )
                        
                        [[:alnum:]]   = \p{alnum}                                an  ALPHANUMERIC          char    146,442   =  \p{L*}|\p{Nd}
                        
                        [[:alpha:]]   = \p{alpha}                                any LETTER                char    145,672   =  \p{L*}
                        
                        [[:graph:]]   = \p{graph}                                any VISIBLE               char    159,612   =  [^\s[:C*:]]  =  (?=\S)\P{Other}
                        
                        [[:print:]]   = \p{print}                                any PRINTABLE             char    159,637   =  [[:graph:]]|\s
                        
                        [[:punct:]]   = \p{punct}                                any PUNCTUATION or SYMBOL char      9,473   =  \p{P*}|\p{S*}  =  \p{Punctuation}|\p{Symbol}  =  856 + 8,617
                        
                        [[:xdigit:]]  = \p{xdigit}                               an HEXADECIMAL            char         22   =  [0-9A-Fa-f]   ( But it should be [\x{0030-\x{0039}\x{0041}-\x{005A}\x{0061}-\x{007A}\x{FF10}-\x{FF19}\x{FF21}-\x{FF3A}\x{FF41}-\x{FF5A}] ! )
                        

                        And here, are the correct results regarding the Unicode character classes, against the Total_Chars.txt file :

                                   \p{Any}                               = [[:Any:]]                    = ANY                            char        325,590  =  (?s).  =  \I  =  [\x{0000}-\x{EFFFD}]
                        
                                   \p{Ascii}                             = [[:Ascii:]]                  = an UNDER \x80                  char            128  =  [[:ascii:]]  =  \o
                        
                                   \p{Assigned}                          = [[:Assigned:]]               = an ASSIGNED                    char        166,266   ( of Total_Chars.txt, ONLY )
                        
                        \p{Cc}  =  \p{Control}                = [[:Cc:]] = [[:Control:]]                = a  C0 or C1 CONTROL code       char             65
                        \p{Cf}  =  \p{Format}                 = [[:Cf:]] = [[:Format:]]                 = a  FORMAT CONTROL              char            170
                        \p{Cn}  =  \p{Not Assigned}           = [[:Cn:]] = [[:Not Assigned:]]           = an UNASSIGNED or NON-CHARACTER char        159,324   ( 'Total_Chars.txt' does NOT contain the 66 NON-CHARACTER chars )
                        \p{Co}  =  \p{Private Use}            = [[:Co:]] = [[:Private Use:]]            = a  PRIVATE-USE                 char          6,400
                        \p{Cs}  =  \p{Surrogate}              = [[:Cs:]] = [[:Surrogate:]]              = a  SURROGATE                   char         [2,048]  ( 'Total_Chars.txt' does NOT contain the 2,048 SURROGATE chars )
                                                                                                                                                   -----------
                        \p{C*}  =  \p{Other}                  = [[:C*:]] = [[:Other:]]                  =                                            165,959  =  \p{Cc}|\p{Cf}|\p{Cn}|\p{Co}
                        
                        \p{Lu}  =  \p{Uppercase Letter}       = [[:Lu:]] = [[:Uppercase Letter:]]       = an UPPER case letter           char          1,886  =  \u  =  [[:upper:]]  =  \p{upper}
                        \p{Ll}  =  \p{Lowercase Letter}       = [[:Ll:]] = [[:Lowercase Letter:]]       = a  LOWER case letter           char          2,283  =  \l  =  [[:lower:]]  =  \p{lower}
                        \p{Lt}  =  \p{Titlecase}              = [[:Lt:]] = [[:Titlecase:]]              = a  DI-GRAPHIC letter           char             31
                        \p{Lm}  =  \p{Modifier Letter}        = [[:Lm:]] = [[:Modifier Letter:]]        = a  MODIFIER   letter           char            410
                        \p{Lo}  =  \p{Other Letter}           = [[:Lo:]] = [[:Other Letter:]]           = an OTHER letter                char        141,062
                                                                                                            + SYLLABLES / IDEOGRAPHS              -----------
                        \p{L*}  =  \p{Letter}                 = [[:L*:]] = [[:Letter:]]                 =                                            145,672  =  \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}  =  [[:alpha:]]   =  \p{alpha}
                        
                        \p{Mc}  =  \p{Spacing Combining Mark} = [[:Mc:]] = [[:Spacing Combining Mark:]] = a   SPACING  COMBINING         char            471
                        \p{Me}  =  \p{Enclosing Mark}         = [[:Me:]] = [[:Enclosing Mark!:]]        = an  ENCLOSING                  char             13
                        \p{Mn}  =  \p{Non-Spacing Mark}       = [[:Mn:]] = [[:Non-Spacing Mark:]]       = a   NON-SPACING COMBINING      char          2,059
                                                                                                                                                      --------
                        \p{M*}  =  \p{Mark}                   = [[:M*:]] = [[:Mark:]]                                                                  2,543  =  \p{Mc}|\p{Me}|\p{Mn}
                        
                        
                        \p{Nd}  =  \p{Decimal Digit Number}   = [[:Nd:]] = [[:Decimal Digit Number:]]   = a DECIMAL number               char            770
                        \p{Nl}  =  \p{Letter Number}          = [[:Nl:]] = [[:Letter Number:]]          = a LETTERLIKE numeric           char            239
                        \p{No}  =  \p{Other Number}           = [[:No:]] = [[:Other Number:]]           = OTHER NUMERIC                  char            915
                                                                                                                                                      --------
                        \p{N*}  =  \p{Number}                 = [[:N*:]] = [[:Number:]]                                                                1,924  =  \p{Nd}|\p{Nl}|\p{No}
                        
                        \p{Pd}  =  \p{Dash Punctuation}       = [[:Pd:]] = [[:Dash Punctuation:]]       = a  DASH or HYPHEN punctuation  char             27
                        \p{Ps}  =  \p{Open Punctuation}       = [[:Ps:]] = [[:Open Punctuation:]]       = an OPENING    PUNCTUATION      char             79
                        \p{Pc}  =  \p{Connector Punctuation}  = [[:Pc:]] = [[:Connector Punctuation:]]  = a  CONNECTING PUNCTUATION      char             10
                        \p{Pe}  =  \p{Close Punctuation}      = [[:Pe:]] = [[:Close Punctuation:]]      = a  CLOSING    PUNCTUATION      char             77
                        \p{Pi}  =  \p{Initial Punctuation}    = [[:Pi:]] = [[:Initial Punctuation:]]    = an INITIAL QUOTATION           char             12
                        \p{Pf}  =  \p{Final Punctuation}      = [[:Pf:]] = [[:Final Punctuation:]]      = a  FINAL   QUOTATION           char             10
                        \p{Po}  =  \p{Other Punctuation}      = [[:Po:]] = [[:Other Punctuation:]]      = OTHER PUNCTUATION              char            641
                                                                                                                                                       -------
                        \p{P*}  =  \p{Punctuation}            = [[:P*:]] = [[:Punctuation:]]            =                                                856  =  \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po}
                        
                        \p{Sm}  =  \p{Math Symbol}            = [[:Sm:]] = [[:Math Symbol:]]            = a MATHEMATICAL symbol          char            960
                        \p{Sc}  =  \p{Currency Symbol}        = [[:Sc:]] = [[:Currency Symbol:]]        = a CURRENCY                     char             64
                        \p{Sk}  =  \p{Modifier Symbol}        = [[:Sk:]] = [[:Modifier Symbol:]]        = a NON-LETTERLIKE MODIFIER      char            125
                        \p{So}  =  \p{Other Symbol}           = [[:So:]] = [[:Other Symbol:]]           = OTHER SYMBOL                   char          7,468
                                                                                                                                                     ---------
                        \p{S*}  =  \p{Symbol}                 = [[:S*:]] = [[:Symbol:]]                 =                                              8,617  =  \p{Sm}|\p{Sc}|\p{Sk}|\p{So}
                        
                        \p{Zs}  =  \p{Space Separator}        = [[:Zs:]] = [[:Space Separator:]]        = a NON-ZERO width SPACE         char             17  =  [\x{0020}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]  =  (?!\t)\h
                        \p{Zl}  =  \p{Line Separator}         = [[:Zl:]] = [[:Line Separator:]]         = the LINE SEPARATOR             char              1  =  \x{2028}
                        \p{Zp}  =  \p{Paragraph Separator}    = [[:Zp:]] = [[:Paragraph Separator:]]    = the PARAGRAPH SEPARATOR        char              1  =  \x{2029}
                                                                                                                                                        ------
                        \p{Z*}  =  \p{Separator}              = [[:Z*:]] = [[:Separator:]]              =                                                 19  =  \p{Zs}|\p{Zl}|\p{Zp}
                        

                        Remark :

                        • A negative POSIX character class can be expressed as [^[:........:]] or [[:^........:]]

                        • A negative UNICODE character class can be expressed as \P{..}, with an uppercase letter P


                        Now, if you follow the procedure explained in the last part of this post :

                        https://community.notepad-plus-plus.org/post/99844

                        The regexes [\x{DC80}-\x{DCFF}] or \i or [[:invalid:]] do give 134 occurrences, which is the exact number of invalid UTF-8 characters of that example !

                        Best Regards,

                        guy038

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors