Hi, @carypt, @benjamin-sasse, @peterjones, @alan-kilborn and All,
End of the discussion :
Now, our Boost regex engine correctly handles the Collating symbol syntax ( [.•••••.] )
Here is the table of all the available POSIX symbolic names, as described here
•============================•============================•=============•=============•=============• | POSIX symbolic name | ESCAPED symbolic name | Character | DEC value | HEX Value | •============================•============================•===========================•=============• | [.NUL.] | \N{NUL} | NUL | 000 | \x00 | | [.SOH.] | \N{SOH} | SOH | 001 | \x01 | | [.STX.] | \N{STX} | STX | 002 | \x02 | | [.ETX.] | \N{ETX} | ETX | 003 | \x03 | | [.EOT.] | \N{EOT} | EOT | 004 | \x04 | | [.ENQ.] | \N{ENQ} | ENQ | 005 | \x05 | | [.ACK.] | \N{ACK} | ACK | 006 | \x06 | | [.alert.] | \N{alert} | BEL | 007 | \x07 | | [.backspace.] | \N{backspace} | BS | 008 | \x08 | | [.tab.] | \N{tab} | TAB | 009 | \x09 | | [.newline.] | \N{newline} | LF | 010 | \x0A | | [.vertical-tab.] | \N{vertical-tab} | VT | 011 | \x0B | | [.form-feed.] | \N{form-feed} | FF | 012 | \x0C | | [.carriage-return.] | \N{carriage-return} | CR | 013 | \x0D | | [.SO.] | \N{SO} | SO | 014 | \x0E | | [.SI.] | \N{SI} | SI | 015 | \x0F | | [.DLE.] | \N{DLE} | DLE | 016 | \x10 | | [.DC1.] | \N{DC1} | DC1 | 017 | \x11 | | [.DC2.] | \N{DC2} | DC2 | 018 | \x12 | | [.DC3.] | \N{DC3} | DC3 | 019 | \x13 | | [.DC4.] | \N{DC4} | DC4 | 020 | \x14 | | [.NAK.] | \N{NAK} | NAK | 021 | \x15 | | [.SYN.] | \N{SYN} | SYN | 022 | \x16 | | [.ETB.] | \N{ETB} | ETB | 023 | \x17 | | [.CAN.] | \N{CAN} | CAN | 024 | \x18 | | [.EM.] | \N{EM} | EM | 025 | \x19 | | [.SUB.] | \N{SUB} | SUB | 026 | \x1A | | [.ESC.] | \N{ESC} | ESC | 027 | \x1B | | [.IS4.] | \N{IS4} | FS | 028 | \x1C | | [.IS3.] | \N{IS3} | GS | 029 | \x1D | | [.IS2.] | \N{IS2} | RS | 030 | \x1E | | [.IS1.] | \N{IS1} | US | 031 | \x1F | | [.space.] | \N{space} | SP | 032 | \x20 | | [.exclamation-mark.] | \N{exclamation-mark} | ! | 033 | \x21 | | [.quotation-mark.] | \N{quotation-mark} | " | 034 | \x22 | | [.number-sign.] | \N{number-sign} | # | 035 | \x23 | | [.dollar-sign.] | \N{dollar-sign} | $ | 036 | \x24 | | [.percent-sign.] | \N{percent-sign} | % | 037 | \x25 | | [.ampersand.] | \N{ampersand} | & | 038 | \x26 | | [.apostrophe.] | \N{apostrophe} | ' | 039 | \x27 | | [.left-parenthesis.] | \N{left-parenthesis} | ( | 040 | \x28 | | [.right-parenthesis.] | \N{right-parenthesis} | ) | 041 | \x29 | | [.asterisk.] | \N{asterisk} | * | 042 | \x2A | | [.plus-sign.] | \N{plus-sign} | + | 043 | \x2B | | [.comma.] | \N{comma} | , | 044 | \x2C | | [.hyphen.] | \N{hyphen} | - | 045 | \x2D | | [.period.] | \N{period} | . | 046 | \x2E | | [.slash.] | \N{slash} | / | 047 | \x2F | | [.zero.] | \N{zero} | 0 | 048 | \x30 | | [.one.] | \N{one} | 1 | 049 | \x31 | | [.two.] | \N{two} | 2 | 050 | \x32 | | [.three.] | \N{three} | 3 | 051 | \x33 | | [.four.] | \N{four} | 4 | 052 | \x34 | | [.five.] | \N{five} | 5 | 053 | \x35 | | [.six.] | \N{six} | 6 | 054 | \x36 | | [.seven.] | \N{seven} | 7 | 055 | \x37 | | [.eight.] | \N{eight} | 8 | 056 | \x38 | | [.nine.] | \N{nine} | 9 | 057 | \x39 | | [.colon.] | \N{colon} | : | 058 | \x3A | | [.semicolon.] | \N{semicolon} | ; | 059 | \x3B | | [.less-than-sign.] | \N{less-than-sign} | < | 060 | \x3C | | [.equals-sign.] | \N{equals-sign} | = | 061 | \x3D | | [.greater-than-sign.] | \N{greater-than-sign} | > | 062 | \x3E | | [.question-mark.] | \N{question-mark} | ? | 063 | \x3F | | [.commercial-at.] | \N{commercial-at} | @ | 064 | \x40 | | [.A.] | \N{A} | A | 065 | \x41 | | ....... | ........ | ... | ..... | ...... | | [.Z.] | \N{Z} | Z | 090 | \x5A | | [.left-square-bracket.] | \N{left-square-bracket} | [ | 091 | \x5B | | [.backslash.] | \N{backslash} | \ | 092 | \x5C | | [.right-square-bracket.] | \N{right-square-bracket} | ] | 093 | \x5D | | [.circumflex.] | \N{circumflex} | ^ | 094 | \x5E | | [.underscore.] | \N{underscore} | _ | 095 | \x5F | | [.grave-accent.] | \N{grave-accent} | ` | 096 | \x60 | | [.a.] | \N{a} | a | 097 | \x61 | | ....... | ........ | ... | ..... | ...... | | [.z.] | \N{z} | z | 122 | \x7A | | [.left-curly-bracket.] | \N{left-curly-bracket} | { | 123 | \x7B | | [.vertical-line.] | \N{vertical-line} | | | 124 | \x7C | | [.right-curly-bracket.] | \N{right-curly-bracket} | } | 125 | \x7D | | [.tilde.] | \N{tilde} | ~ | 126 | \x7E | | [.DEL.] | \N{DEL} | DEL | 127 | \x7F | •============================•============================•=============•=============•=============•Notes :
The case of the symbolic name must be exactly respected !
The POSIX [.•••.] syntax must be used inside a classical Character class, only
The \N{...} syntax can be used whatever its location
A POSIX symbolic name can also represents the character itself !
Examples :
[[.IS2.]] represents the RECORD SEPARATOR ( RS ) character, of code \x1E
\N{plus-sign} is the + sign, of code \x2B
[\N{number-sign}[.six.][.].]] represents the # sign or the 6 digit or the closing bracket ]
As you can see, @carypt, this above list respects, exactly, the Portable character Set norm, as described in these articles :
https://en.wikipedia.org/wiki/Portable_character_set
https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html
Our Boost regex engine also knows some digraphs, when used as a collating name :
•----------•----------•------------------------------------------------• | Regex | Digraph | Origin | •----------•----------•------------------------------------------------• | [.AE.] | AE | | | [.Ae.] | Ae | Latin ligature | | [.ae.] | ae | | •----------•----------•------------------------------------------------• | [.CH.] | CH | | | [.Ch.] | Ch | Spanish | | [.ch.] | ch | | •----------•----------•------------------------------------------------• | [.DZ.] | DZ | | | [.Dz.] | Dz | Hungarian - Polish - Slovak - Serbo-Croatian | | [.dz.] | dz | | •----------•----------•------------------------------------------------• | [.LJ.] | LJ | | | [.Lj.] | Lj | Serbo-Croatian | | [.lj.] | lj | | •----------•----------•------------------------------------------------• | [.LL.] | LL | | | [.Ll.] | Ll | Spanish | | [.ll.] | ll | | •----------•----------•------------------------------------------------• | [.NJ.] | NJ | | | [.Nj.] | Nj | Serbo-Croatian | | [.nj.] | nj | | •----------•----------•------------------------------------------------• | [.SS.] | SS | | | [.Ss.] | Ss | German | | [.ss.] | ss | | •----------•----------•------------------------------------------------•Refer here and here for further information !
Example :
The regex (?-i)[[.Dz.]-[.Lj.]] matches the digraph Dz ( but not D ), or one of the uppercase letters [EFGHIJKL] or the digraph Lj. Test this regex against this text : C c D d DZ Dz dz E e F f G g H h I i J j K k L l LJ Lj lj LL Ll ll M m N n -- - - - - - - - - •• -- •• •Note that, if the N++ Boost library had been build with full Unicode support, all the Unicode names would had been recognized ! For example, in this page :
Instead of using the classical syntax \x{0418} to match the Cyrillic capital letter I, we could use a Unicode symbolic name, with the collating name [[.CYRILLIC CAPITAL LETTER I.]] , which match the Cyrillic letter И
Finally, we must speak of an interesting feature, named Equivalence class :
An equivalent class matches all the equivalent characters of a specific Unicode character, whatever the case, the accentuation, the size and other specificities of these characters
Its syntax is [=Char=], where char represents an unique character and must be inserted in a classical Character class
For instance :
The regex [[=A=]] matches one <A> character of the range : [AaªÀÁÂÃÄÅàáâãäåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃɐɑɒḀḁẚẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặÅ⒜ⒶⓐAa]
The regex [[=1=]] is equivalent to the regex [1¹₁⅟①⑴⒈❶➀➊1]
The [[===]] finds any single character of the range [=⁼₌⊜=]
The [[=plus-sign=]] regex matches one character from [⁺₊⊕⊞+]
The [[=Ae=]] syntax finds any one-char from the range [ÆæǢǣǼǽ]
Notes :
The char, between the two = signs may also be digraph or a symbolic name
Any single character of the range may be used. For instance, the regexes [[=A=]] , [[=⒜=]], and [[=Ȁ=]] are equivalent and would match the same characters
To this purpose, have a look to the collation charts here
I hope, @carypt, that you have found some interesting things for your daily work !
Best Regards,
guy038