Search for character classes but not replace them

guy038

Hello, @carypt, @benjamin-sasse, @peterjones, @alan-kilborn and All,

Finally, I answer you !

First, you said :

so now i know npp has a faster speed for dropping higher unicode characters , ok , the main used chinese etc characters seem to be contained in the base multi plane of unicode .

This statement is not correct and, may be, I was misunderstood !

As I said, the fact to not use the full Unicode support with the N++ Boost regex implementation surely speed up the regex engine, but, in return, prevent us to use any the Unicode regex syntaxes, listed in this page :

https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/character_classes/optional_char_class_names.html

However, we still should be able to get the individual characters, with code-point over the BMP plane, with the logical syntax \x{.....}
( from \x{10000} to \x{10FFFF} )

Luckily we can access to an individual character, over the BMP, by using the surrogate mechanism ! For instance, to match the 🚂 character (STEAM LOCOMOTIVE), with Unicode code-point U+1F682, we can use the couple \x{D83D}\x{DE82} as the values D83D and DE82 represent the high and low surrogate pair, of the UTF-16 encoding of the code-point U+1F682 !

As I said, in my short previous post, I succeeded to create an UTF-8-BOM encoded file containing all existing Unicode characters. But, unlike I said, I don’t have to store all the Unicode characters ( 1,114,112 ) as :

Some zones are forbidden, as definitively declared NON-Characters zones by the Unicode Consortium
The Surrogates zone ( [\x{D800}-\x{DFFF}] ), used to code the characters over the BMP in an UTF-16 encoded file are forbidden
Some Unicode planes ( Planes 3 to 14 ) are totally empty, as not used, up to now and probably for a long time
The Unicode planes 15 and 16, standing for the Supplementary Private Use Areas, are generally not used, either

Here is a table which recapitulates the layout of all the Unicode characters :

    •--------------------•-------------------•------------•---------------------------•------------•-------------------•
    |       Range        |    Description    |   Status   |      Number of Chars      |  Encoding  |  Number of Bytes  |
    •--------------------•-------------------•------------•-------------•-------------•------------•-------------------•
    |    0000  -   007F  |  PLANE 0 - BMP    |  Included  |             |        128  |   1 Byte   |              128  |
    •--------------------•-------------------•------------•-------------•-------------•------------•-------------------•
    |    0080  -   0FFF  |  PLANE 0 - BMP    |  Included  |             |    + 1.920  |   2 Bytes  |            3,840  |
    •--------------------•-------------------•------------•-------------•-------------•------------•-------------------•
    |    0800  -   D7FF  |  PLANE 0 - BMP    |  Included  |             |   + 53,248  |            |          159,744  |
    |                    |                   |            |             |             |            |                   |
    |    D800  -   DFFF  |  SURROGATES zone  |  EXCLUDED  |    - 2.048  |             |            |                   |
    |                    |                   |            |             |             |            |                   |
    |    E000  -   F8FF  |  PLANE 0 - PUA    |  Included  |             |    + 6,400  |            |           19,200  |
    |                    |                   |            |             |             |            |                   |
    |    F900  -   FDFC  |  PLANE 0 - BMP    |  Included  |             |    + 1,232  |   3 Bytes  |            3,696  |
    |                    |                   |            |             |             |            |                   |
    |    FDD0  -   FDEF  |  NON-characters   |  EXCLUDED  |       - 32  |             |            |                   |
    |                    |                   |            |             |             |            |                   |
    |    FDF0  -   FFFD  |  PLANE 0 - BMP    |  Included  |             |      + 526  |            |            1,578  |
    |                    |                   |            |             |             |            |                   |
    |    FFFE  -   FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |            |                   |
    •--------------------•-------------------•------------•-------------•-------------•------------•-------------------•
    |                       Plane 0 - BMP    | SUB-Totals |    - 2,082  |   + 63,454  |      /     |          188,186  |
    •--------------------•-------------------•------------•-------------•-------------•------------•-------------------•
    |   10000  -  1FFFD  |  PLANE 1 - SMP    |  Included  |             |   + 65,534  |            |          262,136  |
    |                    |                   |            |             |             |            |                   |
    |   1FFFE  -  1FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |            |                   |
    •--------------------•-------------------•------------•-------------•-------------•            •-------------------•
    |   20000  -  2FFFD  |  PLANE 2 - SIP    |  Included  |             |   + 65,534  |            |          262,136  |
    |                    |                   |            |             |             |   4 Bytes  |                   |
    |   2FFFE  -  2FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |            |                   |
    •--------------------•-------------------•------------•-------------•-------------•            •-------------------•
    |   30000  -  3FFFD  |  PLANE 3 - TIP    |  Included  |             |   + 65,534  |            |          262,136  |
    |                    |                   |            |             |             |            |                   |
    |   3FFFE  -  3FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |            |                   |
    •--------------------•-------------------•------------•-------------•-------------•------------•-------------------•
    |   40000  -  DFFFF  |  PLANES 4 to 13   |  NOT USED  |  - 655,360  |             |   4 Bytes  |                   |
    •--------------------•-------------------•------------•-------------•-------------•------------•-------------------•
    |   E0000  -  EFFFD  |  PLANE 14 - SPP   |  Included  |             |   + 65,534  |            |          262,136  |
    |                    |                   |            |             |             |            |                   |
    |   EFFFE  -  EFFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |            |                   |
    •--------------------•-------------------•------------•-------------•-------------•            •-------------------•
    |   FFFF0  -  FFFFD  |  PLANE 15 - SPUA  |  NOT USED  |   - 65,334  |             |            |                   |
    |                    |                   |            |             |             |            |                   |
    |   FFFFE  -  FFFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |            |                   |
    •--------------------•-------------------•------------•-------------•-------------•  4 Bytes   •-------------------•
    |  100000  - 10FFFD  |  PLANE 16 - SPUA  |  NOT USED  |   - 65,334  |             |            |                   |
    |                    |                   |            |             |             |            |                   |
    |  10FFFE  - 10FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |            |                   |
    •--------------------•-------------------•------------•-------------•-------------•------------•-------------------•
    |                                       GRAND Totals  |  - 788,522  |  + 325,590  |            |        1,236,730  |
    |                                                     |             |             |            |                   |
    |                              Byte Order Mark - BOM  |             |             |     /      |                3  |
    •-----------------------------------------------------•-------------•-------------•            •-------------------•
    |                                                     |  1,114,112 Unicode chars  |            |  Size  1,236,733  |
    •-----------------------------------------------------•---------------------------•------------•-------------------•

Refer here for additional information

Thus, I’m left with a file with size 1,236,733 and containing, exactly, 325,590 Unicode characters. Of course, depending on the current font used, it is generally not able to display the glyphs of all the characters ! But, it doesn’t matter because we just want to know which, and how many, characters are matched by a specific, POSIX or not, Character class ;-))

I close this post because any post is limited to 16,000 bytes about !

guy038

Hi, @carypt, @benjamin-sasse, @peterjones, @alan-kilborn and All,

Continuation of the discussion :

Now, from this page, here is the summary list of the 15 available Character class, known of our Boost regex engine :

    •=========================•===============================•==============•===========•===========================================•===============================================================================================================================================•
    |  INSIDE a Class [....]  |     OUTSIDE a Class [....]    |  EVERYTHERE  |   Total   |  SIMPLIFIED and / or APPROXIMATIVE regex  |                                                EXACT or Win-1252-EQUIVALENT regex                                                             |
    •=========================•===============================•==============•===========•===========================================•===============================================================================================================================================•
    |        [:alpha:]        |  \p{alpha}  |         |       |              |   45,813  |  (?i)[A-Z]                                |  [^\W\d\x5f]                                                                                                                                  |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |   [:digit:]   |  [:d:]  |  \p{digit}  |  \p{d}  |  \pd  |      \d      |      201  |  [0-9]                                    |  [0-9¹²³.....]   or  [0-9¹²³]  ( with "Win-1252" Encoding )                                                                                   |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |        [:alnum:]        |  \p{alnum}  |         |       |              |   46,014  |  (?i)[0-9A-Z]                             |  [^\W\x5f]                                                                                                                                    |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |   [:word:]    |  [:w:]  |  \p{word}   |  \p{w}  |  \pw  |      \w      |   46,015  |  (?i)[0-9_A-Z]                            |  [[:alnum:]\x5F]                                                                                                                              |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |        [:punct:]        |  \p{punct}  |         |       |              |      334  |  (?!\w)[[:graph:]]                        |  [!"#$%&'()*+,-./:;<=>?@[\\]^`{|}~‚„…†‡‰‹‘’“”•–—›¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿×÷]    ( with "Win-1252" Encoding )                             |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |        [:graph:]        |  \p{graph}  |         |       |              |   46,342  |  [[:punct:]\w]                            |  (?!ªº_¹²³µ)[[:punct:]]|[[:word:]]  or  [^[:^punct:][:^word:]]                                                                                |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |        [:print:]        |  \p{print}  |         |       |              |   46,368  |  [[:punct:]\w\s]                          |  [[:space:][:graph:]\x{FEFF}]                                                                                                                 |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |   [:space:]   |  [:s:]  |  \p{space}  |  \p{s}  |  \ps  |      \s      |       25  |  [\t\n\r\x20]                             |  [\t\n\x0B\f\r\x20\x85\xA0\x{1680}\x{2000}-\x{200B}\x{2028}\x{2029}\x{202F}\x{3000}]                                                          |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |          [:h:]          |         \p{h}         |  \ph  |      \h      |       18  |  [\t\x20]                                 |  [\t\x20\xA0\x{1680}\x{2000}-\x{200B}\x{202F}\x{3000}]                                                                                        |
    •-------------------------•-----------------------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |          [:v:]          |         \p{v}         |  \pv  |      \v      |        7  |  [\r\n]                                   |  [\n\x0b\f\r\x85\x{2028}\x{2029}]                                                                                                             |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |   [:upper:]   |  [:u:]  |  \p{upper}  |  \p{u}  |  \pu  |      \u      |      717  |  (?-i)[A-Z]                               |  (?-i)[A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ..........]  or  (?-i)[A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]       ( with "Win-1252" Encoding )  |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |   [:lower:]   |  [:l:]  |  \p{lower}  |  \p{l}  |  \pl  |      \l      |      835  |  (?-i)[a-z]                               |  (?-i)[a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ.....]  or  (?-i)[a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]  ( with "Win-1252" Encoding )  |
    •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |        [:cntrl:]        |  \p{cntrl}  |         |       |              |       89  |  [\x00-\x1F\x7F\x80-\x9F]                 |  [\x00-\x1F\x7F\x80-\x9F\x{070F}\x{180B}-\x{180E}\x{200C}-\x{200F}\x{202A}-\x{202E}\x{206A}-\x{206F}\x{FEFF}\x{FFF9}-\x{FFFB}]                |
    •-------------------------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |       [:xdigit:]        | \p{xdigit}  |         |       |              |       44  |  (?i)[A-F0-9]                             |  (?i)[A-F0-9\x{FF10}-\x{FF19}\x{FF21}-\x{FF26}]                                                                                               |
    •-------------------------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |        [:blank:]        |  \p{blank}  |         |       |              |        5  |  [\t\x20\xA0]                             |  [\t\x20\xA0\x{3000}\x{FEFF}]                                                                                                                 |
    •-------------------------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |       [:unicode:]       | \p{unicode} |         |       |              |  325,334  |  [^\x00-\xFF]                             |  [^\x00-\xFF]                                                                                                                                 |
    •------------------------------------------------------------------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------•
    |                         ANY Unicode character                          |  325,590  |  (?s).                                    |  (?s).                                                                                                                                        |
    •========================================================================•===========•===========================================•===============================================================================================================================================•

Notes :

As you can see, the regex syntaxes are different according to the location of the Character class !
The Total column shows the number of characters, matched by the respective Character class, out of the 325,590 characters
To express a negative Character class, use the syntax :
- [:^class:] or [:^c:] when this POSIX class is located inside a classical [.....] Character class
- \P{class} or \P{c} or \Pc, when located outside a classical [.....] Character class
- \<Uppercase_letter>, whatever its location
If a POSIX class is isolated into a Character class, you can use, either, the [^[:class:]] or [^[:class:]] syntax

Between Character Classes, we have the following mathematical relations :

[[:alnum:]] = [[alpha:]] + [[digit:]]
[[word:]] = [[:alnum:]] + \x5F ( _ char )
[[:graph:]] = [[:punct:]] - [ªº_¹²³µ] + [[word:]]
[[:print:]] = [[:space:]] + [[:graph:]] + \x{FEFF} ( ZWNBSP = Zero_With_No_Break_Space )
[[:space:]] = [[:h:]] + [[:v:]]
[[unicode:]] = <All> ( 325,590 ) - First 256 ( from \x00 to \xFF )

See you in next post !

guy038

Hi, @carypt, @benjamin-sasse, @peterjones, @alan-kilborn and All,

End of the discussion :

Now, our Boost regex engine correctly handles the Collating symbol syntax ( [.•••••.] )

Here is the table of all the available POSIX symbolic names, as described here

    •============================•============================•=============•=============•=============•
    |    POSIX symbolic name     |   ESCAPED symbolic name    |  Character  |  DEC value  |  HEX Value  |
    •============================•============================•===========================•=============•
    |  [.NUL.]                   |  \N{NUL}                   |     NUL     |     000     |     \x00    |
    |  [.SOH.]                   |  \N{SOH}                   |     SOH     |     001     |     \x01    |
    |  [.STX.]                   |  \N{STX}                   |     STX     |     002     |     \x02    |
    |  [.ETX.]                   |  \N{ETX}                   |     ETX     |     003     |     \x03    |
    |  [.EOT.]                   |  \N{EOT}                   |     EOT     |     004     |     \x04    |
    |  [.ENQ.]                   |  \N{ENQ}                   |     ENQ     |     005     |     \x05    |
    |  [.ACK.]                   |  \N{ACK}                   |     ACK     |     006     |     \x06    |
    |  [.alert.]                 |  \N{alert}                 |     BEL     |     007     |     \x07    |
    |  [.backspace.]             |  \N{backspace}             |     BS      |     008     |     \x08    |
    |  [.tab.]                   |  \N{tab}                   |     TAB     |     009     |     \x09    |
    |  [.newline.]               |  \N{newline}               |     LF      |     010     |     \x0A    |
    |  [.vertical-tab.]          |  \N{vertical-tab}          |     VT      |     011     |     \x0B    |
    |  [.form-feed.]             |  \N{form-feed}             |     FF      |     012     |     \x0C    |
    |  [.carriage-return.]       |  \N{carriage-return}       |     CR      |     013     |     \x0D    |
    |  [.SO.]                    |  \N{SO}                    |     SO      |     014     |     \x0E    |
    |  [.SI.]                    |  \N{SI}                    |     SI      |     015     |     \x0F    |
    |  [.DLE.]                   |  \N{DLE}                   |     DLE     |     016     |     \x10    |
    |  [.DC1.]                   |  \N{DC1}                   |     DC1     |     017     |     \x11    |
    |  [.DC2.]                   |  \N{DC2}                   |     DC2     |     018     |     \x12    |
    |  [.DC3.]                   |  \N{DC3}                   |     DC3     |     019     |     \x13    |
    |  [.DC4.]                   |  \N{DC4}                   |     DC4     |     020     |     \x14    |
    |  [.NAK.]                   |  \N{NAK}                   |     NAK     |     021     |     \x15    |
    |  [.SYN.]                   |  \N{SYN}                   |     SYN     |     022     |     \x16    |
    |  [.ETB.]                   |  \N{ETB}                   |     ETB     |     023     |     \x17    |
    |  [.CAN.]                   |  \N{CAN}                   |     CAN     |     024     |     \x18    |
    |  [.EM.]                    |  \N{EM}                    |     EM      |     025     |     \x19    |
    |  [.SUB.]                   |  \N{SUB}                   |     SUB     |     026     |     \x1A    |
    |  [.ESC.]                   |  \N{ESC}                   |     ESC     |     027     |     \x1B    |
    |  [.IS4.]                   |  \N{IS4}                   |     FS      |     028     |     \x1C    |
    |  [.IS3.]                   |  \N{IS3}                   |     GS      |     029     |     \x1D    |
    |  [.IS2.]                   |  \N{IS2}                   |     RS      |     030     |     \x1E    |
    |  [.IS1.]                   |  \N{IS1}                   |     US      |     031     |     \x1F    |
    |  [.space.]                 |  \N{space}                 |     SP      |     032     |     \x20    |
    |  [.exclamation-mark.]      |  \N{exclamation-mark}      |      !      |     033     |     \x21    |
    |  [.quotation-mark.]        |  \N{quotation-mark}        |      "      |     034     |     \x22    |
    |  [.number-sign.]           |  \N{number-sign}           |      #      |     035     |     \x23    |
    |  [.dollar-sign.]           |  \N{dollar-sign}           |      $      |     036     |     \x24    |
    |  [.percent-sign.]          |  \N{percent-sign}          |      %      |     037     |     \x25    |
    |  [.ampersand.]             |  \N{ampersand}             |      &      |     038     |     \x26    |
    |  [.apostrophe.]            |  \N{apostrophe}            |      '      |     039     |     \x27    |
    |  [.left-parenthesis.]      |  \N{left-parenthesis}      |      (      |     040     |     \x28    |
    |  [.right-parenthesis.]     |  \N{right-parenthesis}     |      )      |     041     |     \x29    |
    |  [.asterisk.]              |  \N{asterisk}              |      *      |     042     |     \x2A    |
    |  [.plus-sign.]             |  \N{plus-sign}             |      +      |     043     |     \x2B    |
    |  [.comma.]                 |  \N{comma}                 |      ,      |     044     |     \x2C    |
    |  [.hyphen.]                |  \N{hyphen}                |      -      |     045     |     \x2D    |
    |  [.period.]                |  \N{period}                |      .      |     046     |     \x2E    |
    |  [.slash.]                 |  \N{slash}                 |      /      |     047     |     \x2F    |
    |  [.zero.]                  |  \N{zero}                  |      0      |     048     |     \x30    |
    |  [.one.]                   |  \N{one}                   |      1      |     049     |     \x31    |
    |  [.two.]                   |  \N{two}                   |      2      |     050     |     \x32    |
    |  [.three.]                 |  \N{three}                 |      3      |     051     |     \x33    |
    |  [.four.]                  |  \N{four}                  |      4      |     052     |     \x34    |
    |  [.five.]                  |  \N{five}                  |      5      |     053     |     \x35    |
    |  [.six.]                   |  \N{six}                   |      6      |     054     |     \x36    |
    |  [.seven.]                 |  \N{seven}                 |      7      |     055     |     \x37    |
    |  [.eight.]                 |  \N{eight}                 |      8      |     056     |     \x38    |
    |  [.nine.]                  |  \N{nine}                  |      9      |     057     |     \x39    |
    |  [.colon.]                 |  \N{colon}                 |      :      |     058     |     \x3A    |
    |  [.semicolon.]             |  \N{semicolon}             |      ;      |     059     |     \x3B    |
    |  [.less-than-sign.]        |  \N{less-than-sign}        |      <      |     060     |     \x3C    |
    |  [.equals-sign.]           |  \N{equals-sign}           |      =      |     061     |     \x3D    |
    |  [.greater-than-sign.]     |  \N{greater-than-sign}     |      >      |     062     |     \x3E    |
    |  [.question-mark.]         |  \N{question-mark}         |      ?      |     063     |     \x3F    |
    |  [.commercial-at.]         |  \N{commercial-at}         |      @      |     064     |     \x40    |
    |  [.A.]                     |  \N{A}                     |      A      |     065     |     \x41    |
    | .......                    | ........                   |     ...     |    .....    |    ......   |
    |  [.Z.]                     |  \N{Z}                     |      Z      |     090     |     \x5A    |
    |  [.left-square-bracket.]   |  \N{left-square-bracket}   |      [      |     091     |     \x5B    |
    |  [.backslash.]             |  \N{backslash}             |      \      |     092     |     \x5C    |
    |  [.right-square-bracket.]  |  \N{right-square-bracket}  |      ]      |     093     |     \x5D    |
    |  [.circumflex.]            |  \N{circumflex}            |      ^      |     094     |     \x5E    |
    |  [.underscore.]            |  \N{underscore}            |      _      |     095     |     \x5F    |
    |  [.grave-accent.]          |  \N{grave-accent}          |      `      |     096     |     \x60    |
    |  [.a.]                     |  \N{a}                     |      a      |     097     |     \x61    |
    | .......                    | ........                   |     ...     |    .....    |    ......   |
    |  [.z.]                     |  \N{z}                     |      z      |     122     |     \x7A    |
    |  [.left-curly-bracket.]    |  \N{left-curly-bracket}    |      {      |     123     |     \x7B    |
    |  [.vertical-line.]         |  \N{vertical-line}         |      |      |     124     |     \x7C    |
    |  [.right-curly-bracket.]   |  \N{right-curly-bracket}   |      }      |     125     |     \x7D    |
    |  [.tilde.]                 |  \N{tilde}                 |      ~      |     126     |     \x7E    |
    |  [.DEL.]                   |  \N{DEL}                   |     DEL     |     127     |     \x7F    |
    •============================•============================•=============•=============•=============•

Notes :

The case of the symbolic name must be exactly respected !
The POSIX [.•••.] syntax must be used inside a classical Character class, only
The \N{...} syntax can be used whatever its location
A POSIX symbolic name can also represents the character itself !

Examples :

[[.IS2.]] represents the RECORD SEPARATOR ( RS ) character, of code \x1E
\N{plus-sign} is the + sign, of code \x2B
[\N{number-sign}[.six.][.].]] represents the # sign or the 6 digit or the closing bracket ]

As you can see, @carypt, this above list respects, exactly, the Portable character Set norm, as described in these articles :

https://en.wikipedia.org/wiki/Portable_character_set

https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html

Our Boost regex engine also knows some digraphs, when used as a collating name :

    •----------•----------•------------------------------------------------•
    |  Regex   |  Digraph |                     Origin                     |
    •----------•----------•------------------------------------------------•
    |  [.AE.]  |    AE    |                                                |
    |  [.Ae.]  |    Ae    |  Latin ligature                                |
    |  [.ae.]  |    ae    |                                                |
    •----------•----------•------------------------------------------------•
    |  [.CH.]  |    CH    |                                                |
    |  [.Ch.]  |    Ch    |  Spanish                                       |
    |  [.ch.]  |    ch    |                                                |
    •----------•----------•------------------------------------------------•
    |  [.DZ.]  |    DZ    |                                                |
    |  [.Dz.]  |    Dz    |  Hungarian - Polish - Slovak - Serbo-Croatian  |
    |  [.dz.]  |    dz    |                                                | 
    •----------•----------•------------------------------------------------•
    |  [.LJ.]  |    LJ    |                                                |
    |  [.Lj.]  |    Lj    |  Serbo-Croatian                                |
    |  [.lj.]  |    lj    |                                                |
    •----------•----------•------------------------------------------------•
    |  [.LL.]  |    LL    |                                                |
    |  [.Ll.]  |    Ll    |  Spanish                                       |
    |  [.ll.]  |    ll    |                                                |
    •----------•----------•------------------------------------------------•
    |  [.NJ.]  |    NJ    |                                                |
    |  [.Nj.]  |    Nj    |  Serbo-Croatian                                |
    |  [.nj.]  |    nj    |                                                |
    •----------•----------•------------------------------------------------•
    |  [.SS.]  |    SS    |                                                |
    |  [.Ss.]  |    Ss    |  German                                        |
    |  [.ss.]  |    ss    |                                                |
    •----------•----------•------------------------------------------------•

Refer here and here for further information !

Example :

The regex (?-i)[[.Dz.]-[.Lj.]] matches the digraph Dz ( but not D ), or one of the uppercase letters [EFGHIJKL] or the digraph Lj. Test this regex against this text :

C  c  D  d  DZ  Dz  dz  E  e  F  f  G  g  H  h  I  i  J  j  K  k  L  l LJ  Lj  lj LL  Ll  ll  M  m  N  n 
                --      -     -     -     -     -     -     -     -    ••  --     ••  •

Note that, if the N++ Boost library had been build with full Unicode support, all the Unicode names would had been recognized ! For example, in this page :

https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/collating_names/named_unicode.html

Instead of using the classical syntax \x{0418} to match the Cyrillic capital letter I, we could use a Unicode symbolic name, with the collating name [[.CYRILLIC CAPITAL LETTER I.]] , which match the Cyrillic letter И

Finally, we must speak of an interesting feature, named Equivalence class :

https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.equivalence_classes

An equivalent class matches all the equivalent characters of a specific Unicode character, whatever the case, the accentuation, the size and other specificities of these characters

Its syntax is [=Char=], where char represents an unique character and must be inserted in a classical Character class

For instance :

The regex [[=A=]] matches one <A> character of the range : [AaªÀÁÂÃÄÅàáâãäåĀāĂăĄąǍǎǞǟǠǡǺǻȀȁȂȃɐɑɒḀḁẚẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặÅ⒜ⒶⓐＡａ]
The regex [[=1=]] is equivalent to the regex [1¹₁⅟①⑴⒈❶➀➊１]
The [[===]] finds any single character of the range [=⁼₌⊜＝]
The [[=plus-sign=]] regex matches one character from [⁺₊⊕⊞＋]
The [[=Ae=]] syntax finds any one-char from the range [ÆæǢǣǼǽ]

Notes :

The char, between the two = signs may also be digraph or a symbolic name
Any single character of the range may be used. For instance, the regexes [[=A=]] , [[=⒜=]], and [[=Ȁ=]] are equivalent and would match the same characters

To this purpose, have a look to the collation charts here

I hope, @carypt, that you have found some interesting things for your daily work !

Best Regards,

guy038

Alan Kilborn

@guy038 said:

the right regex, in order to match any character with code-point over \x{00FF}, is, rather : (?![\x00-\xFF]).[\x{D800}-\x{DFFF}]?

I’m looking for a regex to match “any single UTF-8 character”, and I found this topic thread.

At first I obviously tried . which clearly did NOT work. Why should it be that easy? :-)

BEFORE I found this thread, I was trying [^\x00-\xFF]|[\x00-\xFF] which seems to work but I’m guessing doesn’t always due to @guy038’s more complicated regex above. In other words, I’m sure the @guy038 regex is not shorter because it needs to be longer. :-)
Plus…this one maybe just seems “odd”. :-P

So then I modified @guy038’s regex to be (?![\x00-\xFF]).[\x{D800}-\x{DFFF}]?|[\x00-\xFF] and, again, that seems to work but I wonder if it really does?

Maybe @guy038 has some comments and advice…

guy038

Hello, @alan-kilborn and All,

You said :

I’m looking for a regex to match “any single UTF-8 character” and I found this topic thread.

Well, personally, for such a goal, I would simply use this regex (?s).. Running this regex against my Total_Chars.txt file, whose I spoke of, in the second part of this post, above :

https://community.notepad-plus-plus.org/post/66322

It does found out 325,590 characters which is the total number of chars for this UTF8-BOM file with size = 1,236,733 ( EF + BB + BF + LF + CR + 126 × 1 byte + 1,920 × 2 bytes + 61,406 × 3 bytes + 262,136 × 4 bytes )

Now, Alan, I edit my second post as I was wrong in many ways !

IMPORTANT : for testing the following regexes, you must check the Match case option

The regex [[:unicode:]] does find all unicode characters over \x{00FF}, as well as the regex [^\x00-\xFF]. So 325,334 chars in my Total_Chars.txt file
Thus the regex [^[:unicode:]] ( or [[:^unicode:]] ) is identical to the regex [\x{0000}-\x{00FF}] or simply [\x00-\xFF] which finds 256 chars
Finally, my complicated regex (?![\x00-\xFF]).[\x{D800}-\x{DFFF}]? finds 325,332 characters and is equivalent to the regex (?!\x{2028})(?!\x{2029})[[:unicode:]] ( Note : \x2028} is the LS char and \x{2029} is the PS char ). Don’t know why the tiny difference of two characters ?

BTW, Alan, the following regex would also grap all characters of an UTF-8[BOM] file :

[\x00-\xFF]|[[:unicode:]]

Like the (?s). regex, it would match the 325,590 characters of my Total_Chars.txt file !

Best Regards,

guy038

mkupper

@guy038 said in Search for character classes but not replace them:

( Note : \x2028} is the LS char and \x{2029} is the PS char ). Don’t know why the tiny difference of two characters ?

LS and PS are among the characters classified as “end of line” characters. LS and PS will get matched by things such as \R and \v. If you don’t have dot matches newline enabled then dot will not match either LS or PS. Searching for ~[[:unicode:]]~ will match both ~LS~ and ~PS~ but a search for ~.~ does not match either of those. All of the other characters matched by \R and \v have character values less than \xFF. The LS and PS characters are the exception.

I don’t know if that detail explains why @guy038 needed to special case them.

Alan Kilborn

@guy038 said:

for such a goal, I would simply use this regex (?s).

Now, you don’t think I’d bother you, or revive this old thread, if I were finding things that simple, do? I can easily show that that doesn’t work, on just a small bit of text:

💙☀🡢⮃🠧🠉…👍👌👎

I see and count 10 characters there.

If I do a Find All in Current Document, it yields 11 hits, but I only see 3 characters highlighted as matches:

Worse, if I put my caret at the start of line 1 and repeatedly press Find Next, I have to press it 18 times before it runs out of matches (Wrap around not enabled) – many of these matches are “zero-length”, not one character at a time.

I have yet to try some of the other suggestions…but I will.

guy038

Hello, @alan-kilborn, @mkupper and All,

@mkupper, a BIG thanks to you : your assumption was exact !

Indeed, my complicated regex (?![\x00-\xFF]).[\x{D800}-\x{DFFF}]? must be rewritten as (?s-i)(?![\x00-\xFF]).[\x{D800}-\x{DFFF}]?

And, against my Total_Chars.txt files, this new formulation does give the same amount of chars ( 325,334 ) than the [[:unicode:]] or the [^\x00-\xFF] regexes !

BTW, the nice thing about my Total_Chars.txt is that it does not bother whether the unicode code-point is assigned or unnassigned to a character !

Probably, depending on your current font, a lot of glyhs will not be reproduced correctly but we don’t care about it. We just want to be able to search any character from its code-point \x{####} if inside the BMP or from its surrogate pairs \x{D###}\x{D###} if outside the BMP

Presently, it just lists, one character after another, all valid characters from U + 0000 to U + EFFFD, described below ( as long as the Unicode Consortium does not decide to use the planes 4 to 13 )

    •--------------------•-------------------•------------•---------------------------•----------------•-------------------•
    |       Range        |    Description    |   Status   |      Number of Chars      | UTF-8 Encoding |  Number of Bytes  |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |    0000  -   007F  |  PLANE 0 - BMP    |  Included  |             |        128  |    1 Byte      |              128  |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |    0080  -   0FFF  |  PLANE 0 - BMP    |  Included  |             |    + 1,920  |    2 Bytes     |            3,840  |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |    0800  -   D7FF  |  PLANE 0 - BMP    |  Included  |             |   + 53,248  |                |          159,744  |
    |                    |                   |            |             |             |                |                   |
    |    D800  -   DFFF  |  SURROGATES zone  |  EXCLUDED  |    - 2,048  |             |                |                   |
    |                    |                   |            |             |             |                |                   |
    |    E000  -   F8FF  |  PLANE 0 - PUA    |  Included  |             |    + 6,400  |                |           19,200  |
    |                    |                   |            |             |             |                |                   |
    |    F900  -   FDFC  |  PLANE 0 - BMP    |  Included  |             |    + 1,232  |    3 Bytes     |            3,696  |
    |                    |                   |            |             |             |                |                   |
    |    FDD0  -   FDEF  |  NON-characters   |  EXCLUDED  |       - 32  |             |                |                   |
    |                    |                   |            |             |             |                |                   |
    |    FDF0  -   FFFD  |  PLANE 0 - BMP    |  Included  |             |      + 526  |                |            1,578  |
    |                    |                   |            |             |             |                |                   |
    |    FFFE  -   FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |                       Plane 0 - BMP    | SUB-Totals |    - 2,082  |   + 63,454  |                |          188,186  |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |   10000  -  1FFFD  |  PLANE 1 - SMP    |  Included  |             |   + 65,534  |                |          262,136  |
    |                    |                   |            |             |             |                |                   |
    |   1FFFE  -  1FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
    •--------------------•-------------------•------------•-------------•-------------•                •-------------------•
    |   20000  -  2FFFD  |  PLANE 2 - SIP    |  Included  |             |   + 65,534  |                |          262,136  |
    |                    |                   |            |             |             |    4 Bytes     |                   |
    |   2FFFE  -  2FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
    •--------------------•-------------------•------------•-------------•-------------•                •-------------------•
    |   30000  -  3FFFD  |  PLANE 3 - TIP    |  Included  |             |   + 65,534  |                |          262,136  |
    |                    |                   |            |             |             |                |                   |
    |   3FFFE  -  3FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |   40000  -  DFFFF  |  PLANES 4 to 13   |  NOT USED  |  - 655,360  |             |    4 Bytes     |                   |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |   E0000  -  EFFFD  |  PLANE 14 - SPP   |  Included  |             |   + 65,534  |                |          262,136  |
    |                    |                   |            |             |             |                |                   |
    |   EFFFE  -  EFFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
    •--------------------•-------------------•------------•-------------•-------------•                •-------------------•
    |   FFFF0  -  FFFFD  |  PLANE 15 - SPUA  |  NOT USED  |   - 65,334  |             |                |                   |
    |                    |                   |            |             |             |                |                   |
    |   FFFFE  -  FFFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
    •--------------------•-------------------•------------•-------------•-------------•    4 Bytes     •-------------------•
    |  100000  - 10FFFD  |  PLANE 16 - SPUA  |  NOT USED  |   - 65,334  |             |                |                   |
    |                    |                   |            |             |             |                |                   |
    |  10FFFE  - 10FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |                                       GRAND Totals  |  - 788,522  |  + 325,590  |                |        1,236,730  |
    |                                                     |             |             |                |                   |
    |                              Byte Order Mark - BOM  |             |             |                |                3  |
    •-----------------------------------------------------•-------------•-------------•                •-------------------•
    |                                                     |  1,114,112 Unicode chars  |                |  Size  1,236,733  |
    •-----------------------------------------------------•---------------------------•----------------•-------------------•

Of course, due to the line-breaks, produced by the LF and CR characters, this file contains three physical lines :

A first line from \x00 to \x0A, so 11 chars
A second line from \x0B to \x0D, so 3 chars
A third long line from \x0E to \xEFFFD, so 325,576 chars

If anyone is interested by this file, I could send it by e-mail. Just tell me ! But I suppose that it could be easily implemented with a Python script.

Simply list, in a UTF-8-BOM file, all ranges of characters defined as Included in the Status column of the above table !

You should get a file containing 325,590 characters for an exact size of 1,236,733 bytes

Now, if you decide to include all the NOT USED areas, too, you’ll get a Total_UNICODE_Chars.txt file, of 1,111,998 chars for a size of 4,372,765 bytes which would be exact for… eternity ;-))

Alan, I"ve just seen your last post ! Give me some time to study your example and I’ll answer you very soon !

Best Regards,

guy038

P.S. : I created a macro which changes any selected regex synntax \x{#####} into its correspondant surrogate pair \x{D###}\x{D###}:

        <Macro name="Surrogates Pairs in Selection" Ctrl="no" Alt="no" Shift="no" Key="0">
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?-i)\\x\{(10|[[:xdigit:]])[[:xdigit:]]{4}" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="$0\x1F" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?i)(?:(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(A)|(B)|(C)|(D)|(E)|(F)|(10))(?=[[:xdigit:]]{4}\x1F\})|(?:(0)|(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(A)|(B)|(C)|(D)|(E)|(F))(?=[[:xdigit:]]{0,3}\x1F\})" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="(?{1}0000)(?{2}0001)(?{3}0010)(?{4}0011)(?{5}0100)(?{6}0101)(?{7}0110)(?{8}0111)(?{9}1000)(?{10}1001)(?{11}1010)(?{12}1011)(?{13}1100)(?{14}1101)(?{15}1110)(?{16}1111)(?{17}0000)(?{18}0001)(?{19}0010)(?{20}0011)(?{21}0100)(?{22}0101)(?{23}0110)(?{24}0111)(?{25}1000)(?{26}1001)(?{27}1010)(?{28}1011)(?{29}1100)(?{30}1101)(?{31}1110)(?{32}1111)" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="([01]{10})([01]{10})(?=\x1F)" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="110110\1\x1F}\\x{110111\2" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?:(0000)|(0001)|(0010)|(0011)|(0100)|(0101)|(0110)|(0111)|(1000)|(1001)|(1010)|(1011)|(1100)|(1101)|(1110)|(1111))(?=[[:xdigit:]]*\x1F\})|\x1F" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="(?{1}0)(?{2}1)(?{3}2)(?{4}3)(?{5}4)(?{6}5)(?{7}6)(?{8}7)(?{9}8)(?{10}9)(?11A)(?12B)(?13C)(?14D)(?15E)(?16F)" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
        </Macro>

For instance, if you select the regex \x{10000}\x72\x{27}\x0\x{EFFFD}

Is changed, with this macro, into \x{D800}\x{DC00}\x72\x{27}\x0\x{DB7F}\x{DFFD} which correctly matches the 𐀀R’ 󯿽 string !

Alan Kilborn

@guy038 said:

P.S. : I created a macro which changes any selected regex synntax \x{#####} into its correspondant surrogate pair \x{D###}\x{D###}

This macro is also HERE.
Perhaps it would have been better to link to it rather than reposting?

guy038

Hi, @alan-kilborn, @mkupper and All,

Ah…, indeed, the (?s). regex seems to give incoherent results and the total number of hits is erroneous, too :-(( However, note that the Count operation remains correct !

But, luckily, the [[:unicode:]] regex does work nicely !

Thus, I extended your example to three other characters which lie in the [\x00-\xFF] range, so this string : Aé💙☀🡢⮃🠧🠉…👍👌👎. And, if we use the [\x00-\xFF]|[[:unicode:]] regex, it correctly matches 13 characters, as shown in the snapshot below :

Regarding my macro, I’m going to ask Don Ho to add a C++ equivalent ! A nice improvement would be to analyse the Search and Replace fields and modify all the \x{#####} regex syntaxes with their surrogate equivalents \x{D###}\x{D###} for correct searches and replacements in all circonstances. what’s your feeling about it ?

BR

guy038

Alan Kilborn

@guy038 said:

I’m going to ask Don Ho to add a C++ equivalent ! A nice improvement would be to analyse the Search and Replace fields and modify all the \x{#####} regex syntaxes with their surrogate equivalents \x{D###}\x{D###} for correct searches and replacements in all circonstances. what’s your feeling about it ?

It doesn’t sound like something Don Ho would be interested in, but…go for it.

mkupper

@guy038 said in Search for character classes but not replace them:

If anyone is interested by this file, I could send it by e-mail. Just tell me !

Is this the same Total_Chars.txt that you uploaded to a Google drive as part of this forum post?

mkupper

@Alan-Kilborn said in Search for character classes but not replace them:

If I do a Find All in Current Document, it yields 11 hits, but I only see 3 characters highlighted as matches:

The three you see are “☀⮃…” which can be searched using \x{2600}, \x{2B83}, and \x{2026}. All three are Basic Multilingual Plane (BMP) characters.

The other 7 characters, or “💙🡢🠧🠉👍👌👎”, are all extended Unicode. Here’s how to search for them using surrogate pairs:

💙 U+1F499 \x{D83D}\x{DC99}
🡢 U+1F862 \x{D83E}\x{DC62}
🠧 U+1F827 \x{D83E}\x{DC27}
🠉 U+1F809 \x{D83E}\x{DC09}
👍 U+1F44D \x{D83D}\x{DC4D}
👌 U+1F44C \x{D83D}\x{DC4C}
👎 U+1F44E \x{D83D}\x{DC4E}

While Notepad++ and Scintilla seem to store text as UTF-8 the search function has the appearance of converting what we search for into UTF-16 strings and seems to convert the text from UTF-8 into UTF-16 on the fly when searching it. This seems like a lot of overhead. I have never dug hard into what happens under the hood. My guess is that the search computes the surrogate pairs and then extracts the lower 10 bits from each word and spreads the 20 bits out into where they would appear in UTF-8 encoded data. I think that would work and be fast for scanning UTF-8 encoded data.

Coises

@guy038 said in Search for character classes but not replace them:

correct searches and replacements in all circonstances. what’s your feeling about it ?

I can add this note, for what (if anything) it’s worth.

At some point in the development of Columns++, I realized that to get around some limitations in the Scintilla search interface I’d need to use Boost Regex directly. I really wanted, as part of that, to handle Unicode properly, as Unicode characters instead of as UTF-16 bytes. Boost Regex includes support for Unicode, but to do that it depends on ICU.

I could not figure out how to include the necessary dependencies (whatever they are) from ICU as part of a DLL compilation. All instructions discussed installing it at the operating system level. I didn’t want to tell users they had to install something separate system-wide. I gave up on that approach.

So then I thought I could at least write a proper iterator for UTF-32 instead of wchar_t. And ran into character traits. I thought seriously of trying to leverage the traits for wchar_t and “guess” at what to do outside the BMP. (Looking into this made it clear why Boost relies on ICU instead of doing it themselves.) I eventually gave up and implemented UTF-16/wchar_t, essentially what Notepad++ does. It works reasonably well with Windows (which is also UTF-16 as wchar_t) when searching for specific character sequences and/or working with characters in the BMP.

Full and proper Unicode support, as best I can figure out, involves a large amount of detail, which is continuously being updated. (For those who don’t know: not every Unicode character is a single Unicode code point. And unlike the UTF-8/16/32 relationship, there’s no fixed algorithm to tell you which code points combine with others. Then there’s knowing what’s a capital letter, what’s a lower case letter, which letters are equal when case ignored… none of it follows a formula.) If there’s a more compact, contained implementation than ICU, that would be great, but I couldn’t find one. (The C++ standards committee has punted and deprecated the little bit of Unicode support C++ ever had. There are types defined, but nothing that does anything useful with them.)

I did, however, discover after reading this thread that my search doesn’t handle [[:unicode:]] the way Notepad++ does. There must be something clever hidden in the Notepad++ implementation that I missed which lets it “understand” characters outside the basic multilingual plane.

Alan Kilborn

@mkupper said:

which can be searched using…

Here’s how to search for them using surrogate pairs

Clearly you see why this isn’t a good answer to the original query?
I don’t want to search specifically, I want to search generically.

I started with (?s). as the simplest thing from this thread, as it was stated earlier that it “works”.
I showed (using some specific characters) that this generic search didn’t work.

Sure, I can try [[:unicode:]] for what I’m trying to do, and see what else – problemwise – I run into.

mkupper

@Alan-Kilborn said in Search for character classes but not replace them:

Sure, I can try [[:unicode:]] for what I’m trying to do, and see what else – problemwise – I run into.

I did an experiment with searching for [[:unicode:]] on @guy038’s Total_Chars.txt file and learned the following:

It does not match \x{0000} to \x{00FF}
It matches \x{0100} to \x{0177}
It does not match Ÿ which is \x{0178}
It matches \x{0179} to \x{FFFF}

Starting at U+10000 it gets weird. I made a UTF-8 encoded test file that has 78343 lines where each line starts with a Unicode character starting at U+10000 and running up to U+10FFFF. Each character is followed by a tab and then notes about the character. For example line 15125 has:

🌵		U+1F335		\x{D83C}\x{DF35}	\xF0\x9F\x8C\xB5

It lets me know the Unicode code point, the surrogate pairs, and the UTF-8 encoding for that character.

A count for [[:unicode:]] says 78343 which is the number of lines.
A search for ^[[:unicode:]] or \R[[:unicode:]] gets zero hits.
A search for [[:unicode:]]\t gets 78343 hits.

It seems that [[:unicode:]] is matching the second word of the surrogate pair but not the first. The first word of the pairs ranges from \x{D800} to \x{DBFF} while the second word is always in the range \x{DC00} to \x{DFFF}. The weird thing is that [[:unicode:]] matches orphan words in the range \x{D800} to \x{DBFF} and also matches orphans in the range \x{DC00} to \x{DFFF}. It’s possible that Notepad++ does something special with those orphans as you not supposed to have them as orphans plus there are intentional gaps in the UTF-8 encoding system so they can’t be encoded as UTF-8 … if you follow the rules.

guy038

Hello, @alan-kilborn, @mkupper, @coises and All,

First, @mkupper, you made the same mistake that I did when we spoke about the LS and PS characters and for which you had given me the solution !

Indeed, the regex (?i)[[:unicode:]] does not match the \x{0178} character
Luckily, the regexes (?-i)[[:unicode:]], even (?s-i)[[:unicode:]], do match the \x{0178} character as well as any character over \x{00FF}

Oh…, My God : regarding the Total_Chars.txt file, I’m really confused because I’ve completely forgotten that this file was accessible, among some others, on my google drive account ! So, for people interested, simply click on the link below :

https://drive.google.com/file/d/1kYtbIGPRLdypY7hNMI-vAJXoE7ilRMOC/view?usp=sharing

As a security, once the Total_Chars.txt file loaded in Notepad++, you can right-click on its tab and choose the Read-Only option

Thank you, @mkupper, for refreshing my memory ;-))

Best Regards

guy038

Coises

@mkupper said in Search for character classes but not replace them:

It’s possible that Notepad++ does something special with those orphans as you not supposed to have them as orphans plus there are intentional gaps in the UTF-8 encoding system so they can’t be encoded as UTF-8 … if you follow the rules.

I’ve been making attempts to follow this under debug in Visual Studio, but so far… I’m lost in the murky depths of Boost regex.

The iterator for UTF-8 documents is implemented in these files:

UTF8DocumentIterator.h
UTF8DocumentIterator.cxx

and you can see here how UTF-8 sequences are mapped to wchar_t/UTF-16.

But why . matches one of a surrogate pair but [[:unicode:]] matches both escapes me. (In my search in Columns++ both only match a single wchar_t. I don’t use the same iterator code, but I don’t know what I do that would produce different results, other than handling invalid UTF-8 differently.)

To make sense of invalid UTF-16, we’d have to look at the process by which Notepad++ loads UTF-16 and transforms it into UTF-8. I think there is some method of encoding wchar_t sequences that don’t represent valid UTF-16 as invalid, but still round-trip-able, UTF-8.

If you uncover a clue, I would welcome one.

mkupper

@guy038 said in Search for character classes but not replace them:

(?-i)[[:unicode:]]

Thank you for doing that test as I was thinking about doing something similar. I had seen that Ÿ - \x{0178} was the upper case form of ÿ - \x{00FF} and wondered if the failure to match was a one-off edge error. The failure to match still seems like a bug to me unless the rule for (?-i)[[:unicode:]] is that it only matches if both the upper and lower case form of a letter has a character code of \x0100 or higher. FWIW, Notepad++'s convert case functions work on ÿŸ.

I did a search for other letters where the one letter case was \x0000 to \x00FF and the other was \x100 or higher and found

ß	\x{00DF}	\xC3\x9F		LATIN SMALL   LETTER SHARP S
ẞ	\x{1E9E}	\xE1\xBA\x9E	LATIN CAPITAL LETTER SHARP S

(?i)[[:unicode:]] matches ẞ (U+1E9E) as expected. However, I also see that Notepad++'s case conversion functions fail to convert that letter to its upper or lower case version. A search using (?-i)ß or (?-i)ẞ also fails match both cases of that letter. According to U+00DF and U+1E9E on fileformat.info that pair should be case-convertible.

mkupper

@Coises, The UTF8DocumentIterator code seems straightforward and does more or less mindless conversion. It barely cares about invalid codes, etc. The logic silently allows overlong encoding where for example, a 3-byte UTF sequence is used to encode a value from 0x00 to 0x7F which is normally a 1 byte sequence or 0x0080 to 0x07FF which is normally a 2 byte sequence

The logic also silently allows 4-byte UTF-8 sequences that encode 0x110000 to 0x1FFFFF which is beyond the range assigned to Unicode. It will attempt to convert those values into surrogate pairs. The first word of the pair will overflow the 0xD800 to 0xDBFF range assigned to the first word. The second word is ok and will be a value in the range 0xDC00 to 0xDFFF which is correct for the second word of the pair. I’d have to trace a bit more carefully but the code also seems to silently allow for 5 and 6 byte long encodings that either contain underlong values or will overflow the first word of the surrogate pairs. Overall, it’s not a huge issue that results in garbage in, garbage out, but it should not crash the editor unless something is unhappy about orphan parts of surrogate pairs.

I’m now wondering if the internal storage is UTF-16. That would explain some of the search behavior.