Search for character classes but not replace them
-
There are multiple ways of doing it.
You could put the alpha’s in capture groups, with
([[:alpha:]])2([[:alpha:]])
, then use a substitution of${1}4${2}
which will insert the contents of the first group using the ${ℕ} substitution escape sequence, then the literal4
, then the contents of the second group.Alternately, you could use lookahead and lookbehind assertions in the search expression to require that the characters are there, but don’t include them in the “matched text” for the replacement:
(?<=[[:alpha:]])2(?=[[:alpha:]])
with a replacement of4
(the links go to the appropriate section of https://npp-user-manual.org/docs/searching/)
-
@Alan-Kilborn, @PeterJones Thanks a lot, that worked! Thanks also for pointing towards the relevant passages of the user manual :)
-
@Alan-Kilborn said in Search for character classes but not replace them:
Also, not sure why you show only single [ and ] in your posting – that isn’t going to work for what you’re trying to do.
@Benjamin-Sasse , in case you check back… This is a subtle issue, that depending on your “test data” might not have come up.
[:alpha:]
is a normal character class which will match any one of the literal characters:
,a
,l
,p
, orh
. If your test expression wasa2a
, you would have thought it was working.[[:alpha:]]
is a named character class, which will match any alphabetic characters:
The usermanual tries to emphasize the need for both brackets for the named character class, but there’s a lot to read, so you might not have noticed that.
-
Hello @benjamin-sasse, @peterjones, @alan-kilborn, and All
I think that true
Posix
character classes are, indeed, defined as[:xxxxxx:]
However, the important point is that a
Posix
character class is active ONLY IF it is contained in a standard character class !
So, for instance, the character class
(?-i)[AB[:digit:]x-z]
would match, either :-
The uppercase letter
A
orB
-
Any single digit from
0
to9
-
The lowercase letter
x
,y
orz
As you can see, the outer character class contains two distinct values
A
andB
, aPosix
character class[:digit:]
and, finally, a range of charactersx-z
Thus, the negative class character
(?-i)[^AB[:digit:]x-z]
would match any character different from[AB0123456789xyz]
If a
Posix
character class is used alone, within a standard character class, two syntaxes are possible for the negative form :[^[:.....:]]
or[[:^.....:]]
For instance, all these regexes are equivalent :
[^[:word:]]
=[[:^word:]]
=[^[:w:]]
=[[:^w:]]
=\W
=[^_\d\l\u]
Our
Boost
regex engine handles the15
character classesPosix
, below, when embedded in a standard positive or negative character class :[:space:] [:digit:] [:lower:] [:upper:] [:word:] [:blank:] [:v:] [:alnum:] [:alpha:] [:cntrl:] [:graph:] [:print:] [:punct:] [:xdigit:] [:unicode:]
Best Regards,
guy038
-
-
@guy038 just for understanding : the posix character class
[:unicode:]
can only contain characters of the basic posix character set “portable character set” (256 characters) ? maybe outdated , but regex buddy only lists 12 posix character classeshttps://www.regular-expressions.info/posixbrackets.html#class .uhmm , when i follow the npp-manual , guiding me to the boost - perl-regex 1.7.0 there are again other character classes , default always supported and ones that are unicode extended . https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/character_classes/std_char_classes.html https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/character_classes/optional_char_class_names.html .
but still my question is : can only the posix portable character set match the search posix character classes ? i wouldnt have cared about , but the “posix” definition now confuses me .
-
Hello, @carypt and all,
- Firstly, Notepad++ uses the standard
Boost
regex library, which is not compiled with fullUnicode
support. Thus, all the regexes syntaxes of the page, below, in order to get all characters of a particularUnicode General Category
category, cannot be used and just output anInvalid regular expression
message ! Of course, we lose a functionality but, in return, we gain in speed of execution, as the regex engine does not have to distinguish the numerousUnicode
characters and properties !
- Secondly, from the three links, below :
https://www.regular-expressions.info/posixbrackets.html#clas
https://en.wikipedia.org/wiki/Regular_expression#Character_classes
We deduce that our
Boost
regex engine :-
Does not handle the
[[:ascii:]]
Posix character class, which is not standard and which can easily be replaced with the[\x00-\x7F]
regex -
Can handle the
[[:v:]]
Posix character class, equivalent to\v
, which is not a standardPosix
class and matches a vertical blank character, so the regex[\n\x0B\f\r\x85\x{2028}\x{2029}]
-
Can handle the
[[:unicode:]]
Posix character class. Apparently, after a lot ofGoogle
searches, it does not seem to be aPosix
standard class ( An addition to theBoost
regex library ? ). It matches any Unicode character which code-point over\x{00xFF}
Note, however, that the present N++ implementation misses all Unicode chars with code over\x{FFFF}
, so all the characters over theBMP
:-((
Its name seems also misnommed as, anyway, all characters are
Unicode
characters ! Actually, it should be the class of all characters which do not belong to theC0 Control and Basic Latin (ASCII)
andC1 Control and Latin-1 Supplement
Unicode scripts and should be found wih the regex[^\x00-\xFF]
Unfortunately, given the above restriction, the right regex, in order to match any character with code-point over
\x{00FF}
, is, rather :(?![\x00-\xFF]).[\x{D800}-\x{DFFF}]?
Best Regards,
guy038
- Firstly, Notepad++ uses the standard
-
@guy038 ty , for your detailed and elaborate answer . oh my , that question has caused much research i presume . excuse me . its a mess that all is dependent on widespread definitions .
so now i know npp has a faster speed for dropping higher unicode characters , ok , the main used chinese etc characters seem to be contained in the base multi plane of unicode .
i was interested in what characters are element of the set of posix character class (as in set theory). posix should give a set of characters for region specific needs , the set can be modified with localedef (in posix) and gives then a “posix locale” . the normal posix character set is the “portable character set” (103 characters).
i was interested if the posix-character-classes
[[:????:]]
were only matching to these specific “posix locale”- set of characters . as for a not posix-supporting operating system like windows this would be obsolete anyway .i assume the posix character class is just another writing for a bracket expression
[??-??]
, just the posix-syntax of regular expression , not the perl-syntax. a better name would then be posix-syntax-character-class . ffff , it is just a boost-regex-implemented-syntax from posix . so it does not only match a posix-set of characters . excuse my misinterpretation .posix resource : https://pubs.opengroup.org/onlinepubs/9699919799/
posix locale : https://pubs.opengroup.org/onlinepubs/9699919799/
posix portable character set : https://pubs.opengroup.org/onlinepubs/9699919799/ -
Hi, @carypt,
Be patient ! I will answer you, later. Right now, I’m trying to build up a valid
UTF-8
test file containing all the1,114,112
Unicode code-points !BR
guy038
-
@guy038 ty , there is no need to answer , i think i got it now . i just confused the posix character set with the posix syntax for regexes , a kind of dumb mistake . sry
-
Hello, @carypt, @benjamin-sasse, @peterjones, @alan-kilborn and All,
Finally, I answer you !
First, you said :
so now i know npp has a faster speed for dropping higher unicode characters , ok , the main used chinese etc characters seem to be contained in the base multi plane of unicode .
This statement is not correct and, may be, I was misunderstood !
As I said, the fact to not use the full Unicode support with the N++ Boost regex implementation surely speed up the regex engine, but, in return, prevent us to use any the Unicode regex syntaxes, listed in this page :
However, we still should be able to get the individual characters, with code-point over the
BMP
plane, with the logical syntax\x{.....}
( from\x{10000}
to\x{10FFFF}
)Luckily we can access to an individual character, over the
BMP
, by using thesurrogate
mechanism ! For instance, to match the🚂
character (STEAM LOCOMOTIVE), with Unicode code-pointU+1F682
, we can use the couple\x{D83D}\x{DE82}
as the valuesD83D
andDE82
represent the high and lowsurrogate
pair, of theUTF-16
encoding of the code-pointU+1F682
!
As I said, in my short previous post, I succeeded to create an
UTF-8-BOM
encoded file containing all existing Unicode characters. But, unlike I said, I don’t have to store all the Unicode characters (1,114,112
) as :-
Some zones are forbidden, as definitively declared
NON-Characters
zones by the Unicode Consortium -
The Surrogates zone (
[\x{D800}-\x{DFFF}]
), used to code the characters over theBMP
in anUTF-16
encoded file are forbidden -
Some Unicode planes ( Planes
3
to14
) are totally empty, as not used, up to now and probably for a long time -
The Unicode planes
15
and16
, standing for theSupplementary Private Use Areas
, are generally not used, either
Here is a table which recapitulates the layout of all the Unicode characters :
•--------------------•-------------------•------------•---------------------------•------------•-------------------• | Range | Description | Status | Number of Chars | Encoding | Number of Bytes | •--------------------•-------------------•------------•-------------•-------------•------------•-------------------• | 0000 - 007F | PLANE 0 - BMP | Included | | 128 | 1 Byte | 128 | •--------------------•-------------------•------------•-------------•-------------•------------•-------------------• | 0080 - 0FFF | PLANE 0 - BMP | Included | | + 1.920 | 2 Bytes | 3,840 | •--------------------•-------------------•------------•-------------•-------------•------------•-------------------• | 0800 - D7FF | PLANE 0 - BMP | Included | | + 53,248 | | 159,744 | | | | | | | | | | D800 - DFFF | SURROGATES zone | EXCLUDED | - 2.048 | | | | | | | | | | | | | E000 - F8FF | PLANE 0 - PUA | Included | | + 6,400 | | 19,200 | | | | | | | | | | F900 - FDFC | PLANE 0 - BMP | Included | | + 1,232 | 3 Bytes | 3,696 | | | | | | | | | | FDD0 - FDEF | NON-characters | EXCLUDED | - 32 | | | | | | | | | | | | | FDF0 - FFFD | PLANE 0 - BMP | Included | | + 526 | | 1,578 | | | | | | | | | | FFFE - FFFF | NON-characters | EXCLUDED | - 2 | | | | •--------------------•-------------------•------------•-------------•-------------•------------•-------------------• | Plane 0 - BMP | SUB-Totals | - 2,082 | + 63,454 | / | 188,186 | •--------------------•-------------------•------------•-------------•-------------•------------•-------------------• | 10000 - 1FFFD | PLANE 1 - SMP | Included | | + 65,534 | | 262,136 | | | | | | | | | | 1FFFE - 1FFFF | NON-characters | EXCLUDED | - 2 | | | | •--------------------•-------------------•------------•-------------•-------------• •-------------------• | 20000 - 2FFFD | PLANE 2 - SIP | Included | | + 65,534 | | 262,136 | | | | | | | 4 Bytes | | | 2FFFE - 2FFFF | NON-characters | EXCLUDED | - 2 | | | | •--------------------•-------------------•------------•-------------•-------------• •-------------------• | 30000 - 3FFFD | PLANE 3 - TIP | Included | | + 65,534 | | 262,136 | | | | | | | | | | 3FFFE - 3FFFF | NON-characters | EXCLUDED | - 2 | | | | •--------------------•-------------------•------------•-------------•-------------•------------•-------------------• | 40000 - DFFFF | PLANES 4 to 13 | NOT USED | - 655,360 | | 4 Bytes | | •--------------------•-------------------•------------•-------------•-------------•------------•-------------------• | E0000 - EFFFD | PLANE 14 - SPP | Included | | + 65,534 | | 262,136 | | | | | | | | | | EFFFE - EFFFF | NON-characters | EXCLUDED | - 2 | | | | •--------------------•-------------------•------------•-------------•-------------• •-------------------• | FFFF0 - FFFFD | PLANE 15 - SPUA | NOT USED | - 65,334 | | | | | | | | | | | | | FFFFE - FFFFF | NON-characters | EXCLUDED | - 2 | | | | •--------------------•-------------------•------------•-------------•-------------• 4 Bytes •-------------------• | 100000 - 10FFFD | PLANE 16 - SPUA | NOT USED | - 65,334 | | | | | | | | | | | | | 10FFFE - 10FFFF | NON-characters | EXCLUDED | - 2 | | | | •--------------------•-------------------•------------•-------------•-------------•------------•-------------------• | GRAND Totals | - 788,522 | + 325,590 | | 1,236,730 | | | | | | | | Byte Order Mark - BOM | | | / | 3 | •-----------------------------------------------------•-------------•-------------• •-------------------• | | 1,114,112 Unicode chars | | Size 1,236,733 | •-----------------------------------------------------•---------------------------•------------•-------------------•
Refer here for additional information
Thus, I’m left with a file with size
1,236,733
and containing, exactly,325,590
Unicode characters. Of course, depending on the current font used, it is generally not able to display the glyphs of all the characters ! But, it doesn’t matter because we just want to know which, and how many, characters are matched by a specific,POSIX
or not,Character class
;-))I close this post because any post is limited to
16,000
bytes about ! -
-
Hi, @carypt, @benjamin-sasse, @peterjones, @alan-kilborn and All,
Continuation of the discussion :
Now, from this page, here is the summary list of the
15
availableCharacter class
, known of ourBoost
regex engine :•=========================•===============================•==============•===========•===========================================•===============================================================================================================================================• | INSIDE a Class [....] | OUTSIDE a Class [....] | EVERYTHERE | Total | SIMPLIFIED and / or APPROXIMATIVE regex | EXACT or Win-1252-EQUIVALENT regex | •=========================•===============================•==============•===========•===========================================•===============================================================================================================================================• | [:alpha:] | \p{alpha} | | | | 45,813 | (?i)[A-Z] | [^\W\d\x5f] | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:digit:] | [:d:] | \p{digit} | \p{d} | \pd | \d | 201 | [0-9] | [0-9¹²³.....] or [0-9¹²³] ( with "Win-1252" Encoding ) | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:alnum:] | \p{alnum} | | | | 46,014 | (?i)[0-9A-Z] | [^\W\x5f] | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:word:] | [:w:] | \p{word} | \p{w} | \pw | \w | 46,015 | (?i)[0-9_A-Z] | [[:alnum:]\x5F] | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:punct:] | \p{punct} | | | | 334 | (?!\w)[[:graph:]] | [!"#$%&'()*+,-./:;<=>?@[\\]^`{|}~‚„…†‡‰‹‘’“”•–—›¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿×÷] ( with "Win-1252" Encoding ) | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:graph:] | \p{graph} | | | | 46,342 | [[:punct:]\w] | (?!ªº_¹²³µ)[[:punct:]]|[[:word:]] or [^[:^punct:][:^word:]] | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:print:] | \p{print} | | | | 46,368 | [[:punct:]\w\s] | [[:space:][:graph:]\x{FEFF}] | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:space:] | [:s:] | \p{space} | \p{s} | \ps | \s | 25 | [\t\n\r\x20] | [\t\n\x0B\f\r\x20\x85\xA0\x{1680}\x{2000}-\x{200B}\x{2028}\x{2029}\x{202F}\x{3000}] | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:h:] | \p{h} | \ph | \h | 18 | [\t\x20] | [\t\x20\xA0\x{1680}\x{2000}-\x{200B}\x{202F}\x{3000}] | •-------------------------•-----------------------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:v:] | \p{v} | \pv | \v | 7 | [\r\n] | [\n\x0b\f\r\x85\x{2028}\x{2029}] | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:upper:] | [:u:] | \p{upper} | \p{u} | \pu | \u | 717 | (?-i)[A-Z] | (?-i)[A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ..........] or (?-i)[A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ] ( with "Win-1252" Encoding ) | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:lower:] | [:l:] | \p{lower} | \p{l} | \pl | \l | 835 | (?-i)[a-z] | (?-i)[a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ.....] or (?-i)[a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] ( with "Win-1252" Encoding ) | •---------------•---------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:cntrl:] | \p{cntrl} | | | | 89 | [\x00-\x1F\x7F\x80-\x9F] | [\x00-\x1F\x7F\x80-\x9F\x{070F}\x{180B}-\x{180E}\x{200C}-\x{200F}\x{202A}-\x{202E}\x{206A}-\x{206F}\x{FEFF}\x{FFF9}-\x{FFFB}] | •-------------------------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:xdigit:] | \p{xdigit} | | | | 44 | (?i)[A-F0-9] | (?i)[A-F0-9\x{FF10}-\x{FF19}\x{FF21}-\x{FF26}] | •-------------------------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:blank:] | \p{blank} | | | | 5 | [\t\x20\xA0] | [\t\x20\xA0\x{3000}\x{FEFF}] | •-------------------------•-------------•---------•-------•--------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | [:unicode:] | \p{unicode} | | | | 325,334 | [^\x00-\xFF] | [^\x00-\xFF] | •------------------------------------------------------------------------•-----------•-------------------------------------------•-----------------------------------------------------------------------------------------------------------------------------------------------• | ANY Unicode character | 325,590 | (?s). | (?s). | •========================================================================•===========•===========================================•===============================================================================================================================================•
Notes :
-
As you can see, the regex syntaxes are different according to the location of the
Character class
! -
The Total column shows the number of characters, matched by the respective
Character class
, out of the325,590
characters -
To express a negative
Character class
, use the syntax :-
[:^class:]
or[:^c:]
when thisPOSIX
class is located inside a classical[.....]
Character class -
\P{class}
or\P{c}
or\Pc
, when located outside a classical[.....]
Character class -
\<Uppercase_letter>
, whatever its location
-
-
If a
POSIX
class is isolated into aCharacter class
, you can use, either, the[^[:class:]]
or[^[:class:]]
syntax
Between
Character Classes
, we have the following mathematical relations :-
[[:alnum:]]
=[[alpha:]]
+[[digit:]]
-
[[word:]]
=[[:alnum:]]
+\x5F
(_
char ) -
[[:graph:]]
=[[:punct:]]
-[ªº_¹²³µ]
+[[word:]]
-
[[:print:]]
=[[:space:]]
+[[:graph:]]
+\x{FEFF}
( ZWNBSP = Zero_With_No_Break_Space ) -
[[:space:]]
=[[:h:]]
+[[:v:]]
-
[[unicode:]]
= <All
> (325,590
) - First256
( from\x00
to\xFF
)
See you in next post !
-
-
Hi, @carypt, @benjamin-sasse, @peterjones, @alan-kilborn and All,
End of the discussion :
Now, our
Boost
regex engine correctly handles the Collating symbol syntax ([.•••••.]
)Here is the table of all the available
POSIX
symbolic names, as described here•============================•============================•=============•=============•=============• | POSIX symbolic name | ESCAPED symbolic name | Character | DEC value | HEX Value | •============================•============================•===========================•=============• | [.NUL.] | \N{NUL} | NUL | 000 | \x00 | | [.SOH.] | \N{SOH} | SOH | 001 | \x01 | | [.STX.] | \N{STX} | STX | 002 | \x02 | | [.ETX.] | \N{ETX} | ETX | 003 | \x03 | | [.EOT.] | \N{EOT} | EOT | 004 | \x04 | | [.ENQ.] | \N{ENQ} | ENQ | 005 | \x05 | | [.ACK.] | \N{ACK} | ACK | 006 | \x06 | | [.alert.] | \N{alert} | BEL | 007 | \x07 | | [.backspace.] | \N{backspace} | BS | 008 | \x08 | | [.tab.] | \N{tab} | TAB | 009 | \x09 | | [.newline.] | \N{newline} | LF | 010 | \x0A | | [.vertical-tab.] | \N{vertical-tab} | VT | 011 | \x0B | | [.form-feed.] | \N{form-feed} | FF | 012 | \x0C | | [.carriage-return.] | \N{carriage-return} | CR | 013 | \x0D | | [.SO.] | \N{SO} | SO | 014 | \x0E | | [.SI.] | \N{SI} | SI | 015 | \x0F | | [.DLE.] | \N{DLE} | DLE | 016 | \x10 | | [.DC1.] | \N{DC1} | DC1 | 017 | \x11 | | [.DC2.] | \N{DC2} | DC2 | 018 | \x12 | | [.DC3.] | \N{DC3} | DC3 | 019 | \x13 | | [.DC4.] | \N{DC4} | DC4 | 020 | \x14 | | [.NAK.] | \N{NAK} | NAK | 021 | \x15 | | [.SYN.] | \N{SYN} | SYN | 022 | \x16 | | [.ETB.] | \N{ETB} | ETB | 023 | \x17 | | [.CAN.] | \N{CAN} | CAN | 024 | \x18 | | [.EM.] | \N{EM} | EM | 025 | \x19 | | [.SUB.] | \N{SUB} | SUB | 026 | \x1A | | [.ESC.] | \N{ESC} | ESC | 027 | \x1B | | [.IS4.] | \N{IS4} | FS | 028 | \x1C | | [.IS3.] | \N{IS3} | GS | 029 | \x1D | | [.IS2.] | \N{IS2} | RS | 030 | \x1E | | [.IS1.] | \N{IS1} | US | 031 | \x1F | | [.space.] | \N{space} | SP | 032 | \x20 | | [.exclamation-mark.] | \N{exclamation-mark} | ! | 033 | \x21 | | [.quotation-mark.] | \N{quotation-mark} | " | 034 | \x22 | | [.number-sign.] | \N{number-sign} | # | 035 | \x23 | | [.dollar-sign.] | \N{dollar-sign} | $ | 036 | \x24 | | [.percent-sign.] | \N{percent-sign} | % | 037 | \x25 | | [.ampersand.] | \N{ampersand} | & | 038 | \x26 | | [.apostrophe.] | \N{apostrophe} | ' | 039 | \x27 | | [.left-parenthesis.] | \N{left-parenthesis} | ( | 040 | \x28 | | [.right-parenthesis.] | \N{right-parenthesis} | ) | 041 | \x29 | | [.asterisk.] | \N{asterisk} | * | 042 | \x2A | | [.plus-sign.] | \N{plus-sign} | + | 043 | \x2B | | [.comma.] | \N{comma} | , | 044 | \x2C | | [.hyphen.] | \N{hyphen} | - | 045 | \x2D | | [.period.] | \N{period} | . | 046 | \x2E | | [.slash.] | \N{slash} | / | 047 | \x2F | | [.zero.] | \N{zero} | 0 | 048 | \x30 | | [.one.] | \N{one} | 1 | 049 | \x31 | | [.two.] | \N{two} | 2 | 050 | \x32 | | [.three.] | \N{three} | 3 | 051 | \x33 | | [.four.] | \N{four} | 4 | 052 | \x34 | | [.five.] | \N{five} | 5 | 053 | \x35 | | [.six.] | \N{six} | 6 | 054 | \x36 | | [.seven.] | \N{seven} | 7 | 055 | \x37 | | [.eight.] | \N{eight} | 8 | 056 | \x38 | | [.nine.] | \N{nine} | 9 | 057 | \x39 | | [.colon.] | \N{colon} | : | 058 | \x3A | | [.semicolon.] | \N{semicolon} | ; | 059 | \x3B | | [.less-than-sign.] | \N{less-than-sign} | < | 060 | \x3C | | [.equals-sign.] | \N{equals-sign} | = | 061 | \x3D | | [.greater-than-sign.] | \N{greater-than-sign} | > | 062 | \x3E | | [.question-mark.] | \N{question-mark} | ? | 063 | \x3F | | [.commercial-at.] | \N{commercial-at} | @ | 064 | \x40 | | [.A.] | \N{A} | A | 065 | \x41 | | ....... | ........ | ... | ..... | ...... | | [.Z.] | \N{Z} | Z | 090 | \x5A | | [.left-square-bracket.] | \N{left-square-bracket} | [ | 091 | \x5B | | [.backslash.] | \N{backslash} | \ | 092 | \x5C | | [.right-square-bracket.] | \N{right-square-bracket} | ] | 093 | \x5D | | [.circumflex.] | \N{circumflex} | ^ | 094 | \x5E | | [.underscore.] | \N{underscore} | _ | 095 | \x5F | | [.grave-accent.] | \N{grave-accent} | ` | 096 | \x60 | | [.a.] | \N{a} | a | 097 | \x61 | | ....... | ........ | ... | ..... | ...... | | [.z.] | \N{z} | z | 122 | \x7A | | [.left-curly-bracket.] | \N{left-curly-bracket} | { | 123 | \x7B | | [.vertical-line.] | \N{vertical-line} | | | 124 | \x7C | | [.right-curly-bracket.] | \N{right-curly-bracket} | } | 125 | \x7D | | [.tilde.] | \N{tilde} | ~ | 126 | \x7E | | [.DEL.] | \N{DEL} | DEL | 127 | \x7F | •============================•============================•=============•=============•=============•
Notes :
-
The case of the symbolic name must be exactly respected !
-
The POSIX
[.•••.]
syntax must be used inside a classicalCharacter class
, only -
The
\N{...}
syntax can be used whatever its location -
A
POSIX
symbolic name can also represents the character itself !
Examples :
-
[[.IS2.]]
represents the RECORD SEPARATOR (RS
) character, of code\x1E
-
\N{plus-sign}
is the+
sign, of code\x2B
-
[\N{number-sign}[.six.][.].]]
represents the#
sign or the6
digit or the closing bracket]
As you can see, @carypt, this above list respects, exactly, the
Portable character Set
norm, as described in these articles :https://en.wikipedia.org/wiki/Portable_character_set
https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html
Our
Boost
regex engine also knows somedigraphs
, when used as acollating
name :•----------•----------•------------------------------------------------• | Regex | Digraph | Origin | •----------•----------•------------------------------------------------• | [.AE.] | AE | | | [.Ae.] | Ae | Latin ligature | | [.ae.] | ae | | •----------•----------•------------------------------------------------• | [.CH.] | CH | | | [.Ch.] | Ch | Spanish | | [.ch.] | ch | | •----------•----------•------------------------------------------------• | [.DZ.] | DZ | | | [.Dz.] | Dz | Hungarian - Polish - Slovak - Serbo-Croatian | | [.dz.] | dz | | •----------•----------•------------------------------------------------• | [.LJ.] | LJ | | | [.Lj.] | Lj | Serbo-Croatian | | [.lj.] | lj | | •----------•----------•------------------------------------------------• | [.LL.] | LL | | | [.Ll.] | Ll | Spanish | | [.ll.] | ll | | •----------•----------•------------------------------------------------• | [.NJ.] | NJ | | | [.Nj.] | Nj | Serbo-Croatian | | [.nj.] | nj | | •----------•----------•------------------------------------------------• | [.SS.] | SS | | | [.Ss.] | Ss | German | | [.ss.] | ss | | •----------•----------•------------------------------------------------•
Refer here and here for further information !
Example :
- The regex
(?-i)[[.Dz.]-[.Lj.]]
matches the digraphDz
( but notD
), or one of the uppercase letters[EFGHIJKL]
or the digraphLj
. Test this regex against this text :
C c D d DZ Dz dz E e F f G g H h I i J j K k L l LJ Lj lj LL Ll ll M m N n -- - - - - - - - - •• -- •• •
Note that, if the N++
Boost
library had been build with full Unicode support, all the Unicode names would had been recognized ! For example, in this page :Instead of using the classical syntax
\x{0418}
to match the Cyrillic capital letterI
, we could use a Unicode symbolic name, with the collating name[[.CYRILLIC CAPITAL LETTER I.]]
, which match the Cyrillic letterИ
Finally, we must speak of an interesting feature, named
Equivalence class
:An equivalent class matches all the equivalent characters of a specific Unicode character, whatever the case, the accentuation, the size and other specificities of these characters
Its syntax is
[=Char=]
, where char represents an unique character and must be inserted in a classicalCharacter class
For instance :
-
The regex
[[=A=]]
matches one<A>
character of the range :[AaªÀÁÂÃÄÅàáâãäåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃɐɑɒḀḁẚẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặÅ⒜ⒶⓐAa]
-
The regex
[[=1=]]
is equivalent to the regex[1¹₁⅟①⑴⒈❶➀➊1]
-
The
[[===]]
finds any single character of the range[=⁼₌⊜=]
-
The
[[=plus-sign=]]
regex matches one character from[⁺₊⊕⊞+]
-
The
[[=Ae=]]
syntax finds any one-char from the range[ÆæǢǣǼǽ]
Notes :
-
The char, between the two
=
signs may also bedigraph
or asymbolic name
-
Any single character of the range may be used. For instance, the regexes
[[=A=]]
,[[=⒜=]]
, and[[=Ȁ=]]
are equivalent and would match the same characters
To this purpose, have a look to the collation charts here
I hope, @carypt, that you have found some interesting things for your daily work !
Best Regards,
guy038
-