Hello, @steven123, @ekopalypse and All,
@steven123, I’m pleased to tell you that your request can be solved, easily, by using a POSIX collating sequence into a character class / bracket expression, which defines a list of equivalent characters to the character mentioned in the collating sequence
The regex syntax of a POSIX collating sequence is [=<Char>=]. Note also that this POSIX collating sequence must be inserted, itself, inside a normal character class [....] or a negative character class [^....] in order to match all the equivalent characters to character Char !
Refer, for additional information, to :
https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.equivalence_classes
https://www.regular-expressions.info/posixbrackets.html#eq
For instance, the regex (?-i)[A-E[=0=]5-9] would match a uppercase letter from A to E, a digit from 5 to 9, the digit 0 or any of the 4 equivalent characters to digit 0 of the list below :
0030 Nd 0 Basic Latin DIGIT ZERO
2070 No ⁰ Superscripts and Subscripts SUPERSCRIPT ZERO
2080 No ₀ Superscripts and Subscripts SUBSCRIPT ZERO
24EA No ⓪ Enclosed Alphanumerics CIRCLED DIGIT ZERO
FF10 Nd 0 Halfwidth and Fullwidth Forms FULLWIDTH DIGIT ZERO
So, in order to match, either, the string Ho Chi Minh and the string Hồ Chí Minh, simply use the regex :
H[[=o=]] Ch[[=i=]] Minh, with the collating sequences [=o=] and [=i=] embedded in a character class
Some other examples :
The regex
\t[[=a=]]\t, against the
list below, would match all the
equivalent characters of the
a letter, whatever, the
case, the
accentuation, the
size and other
specifications of these
equivalent chars. That is to say
69 characters !
0041 A ; Upper_Letter # Lu LATIN CAPITAL LETTER A
0061 a ; Lower_Letter # Ll LATIN SMALL LETTER A
00AA ª ; Other_Letter # Lo FEMININE ORDINAL INDICATOR
00C0 À ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH GRAVE
00C1 Á ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH ACUTE
00C2 Â ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3 Ã ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH TILDE
00C4 Ä ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DIAERESIS
00C5 Å ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH RING ABOVE
00E0 à ; Lower_Letter # Ll LATIN SMALL LETTER A WITH GRAVE
00E1 á ; Lower_Letter # Ll LATIN SMALL LETTER A WITH ACUTE
00E2 â ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX
00E3 ã ; Lower_Letter # Ll LATIN SMALL LETTER A WITH TILDE
00E4 ä ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DIAERESIS
00E5 å ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RING ABOVE
0100 Ā ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH MACRON
0101 ā ; Lower_Letter # Ll LATIN SMALL LETTER A WITH MACRON
0102 Ă ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE
0103 ă ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE
0104 Ą ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH OGONEK
0105 ą ; Lower_Letter # Ll LATIN SMALL LETTER A WITH OGONEK
01CD Ǎ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CARON
01CE ǎ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CARON
01DE Ǟ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
01DF ǟ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
01E0 Ǡ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
01E1 ǡ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
01FA Ǻ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
01FB ǻ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
0200 Ȁ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DOUBLE GRAVE
0201 ȁ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DOUBLE GRAVE
0202 Ȃ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH INVERTED BREVE
0203 ȃ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH INVERTED BREVE
0250 ɐ ; Lower_Letter # Ll LATIN SMALL LETTER TURNED A
0251 ɑ ; Lower_Letter # Ll LATIN SMALL LETTER ALPHA
0252 ɒ ; Lower_Letter # Ll LATIN SMALL LETTER TURNED ALPHA
1E00 Ḁ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH RING BELOW
1E01 ḁ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RING BELOW
1E9A ẚ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RIGHT HALF RING
1EA0 Ạ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DOT BELOW
1EA1 ạ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DOT BELOW
1EA2 Ả ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH HOOK ABOVE
1EA3 ả ; Lower_Letter # Ll LATIN SMALL LETTER A WITH HOOK ABOVE
1EA4 Ấ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE
1EA5 ấ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
1EA6 Ầ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE
1EA7 ầ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
1EA8 Ẩ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
1EA9 ẩ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
1EAA Ẫ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE
1EAB ẫ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
1EAC Ậ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW
1EAD ậ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
1EAE Ắ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND ACUTE
1EAF ắ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND ACUTE
1EB0 Ằ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND GRAVE
1EB1 ằ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND GRAVE
1EB2 Ẳ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE
1EB3 ẳ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
1EB4 Ẵ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND TILDE
1EB5 ẵ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND TILDE
1EB6 Ặ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW
1EB7 ặ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
212B Å ; Upper_Letter # Lu ANGSTROM SIGN
249C ⒜ ; Other_Symbol # So PARENTHESIZED LATIN SMALL LETTER A
24B6 Ⓐ ; Other_Alpha. # So CIRCLED LATIN CAPITAL LETTER A
24D0 ⓐ ; Other_Alpha. # So CIRCLED LATIN SMALL LETTER A
FF21 A ; Upper_Letter # Lu FULLWIDTH LATIN CAPITAL LETTER A
FF41 a ; Lower_Letter # Ll FULLWIDTH LATIN SMALL LETTER A
In the same way, the regex
\t[[===]]\t matches all the
equivalent characters of the
equal sign
= :
003D Sm = Basic Latin EQUALS SIGN
207C Sm ⁼ Superscripts and Subscripts SUPERSCRIPT EQUALS SIGN
208C Sm ₌ Superscripts and Subscripts SUBSCRIPT EQUALS SIGN
229C Sm ⊜ Mathematical Operators CIRCLED EQUALS
FF1D Sm = Halfwidth and Fullwidth Forms FULLWIDTH EQUALS SIGN
And the regex
\t[[=Q=]]\t matches all the
equivalent characters of the
Q letter :
0051 Lu Q Basic Latin LATIN CAPITAL LETTER Q
0071 Ll q Basic Latin LATIN SMALL LETTER Q
02A0 Ll ʠ IPA Extensions LATIN SMALL LETTER Q WITH HOOK
211A Lu ℚ Letterlike Symbols DOUBLE-STRUCK CAPITAL Q
24AC So ⒬ Enclosed Alphanumerics PARENTHESIZED LATIN SMALL LETTER Q
24C6 So Ⓠ Enclosed Alphanumerics CIRCLED LATIN CAPITAL LETTER Q
24E0 So ⓠ Enclosed Alphanumerics CIRCLED LATIN SMALL LETTER Q
FF31 Lu Q Halfwidth and Fullwidth Forms FULLWIDTH LATIN CAPITAL LETTER Q
FF51 Ll q Halfwidth and Fullwidth Forms FULLWIDTH LATIN SMALL LETTER Q
Best Regards,
guy038