Hello, @steven123, @ekopalypse and All,
@steven123, I’m pleased to tell you that your request can be solved, easily, by using a POSIX collating sequence into a character class / bracket expression, which defines a list of equivalent characters to the character mentioned in the collating sequence
The regex syntax of a POSIX collating sequence is [=<Char>=]. Note also that this POSIX collating sequence must be inserted, itself, inside a normal character class [....] or a negative character class [^....] in order to match all the equivalent characters to character Char !
Refer, for additional information, to :
https://www.regular-expressions.info/posixbrackets.html#eq
For instance, the regex (?-i)[A-E[=0=]5-9] would match a uppercase letter from A to E, a digit from 5 to 9, the digit 0 or any of the 4 equivalent characters to digit 0 of the list below :
0030 Nd 0 Basic Latin DIGIT ZERO 2070 No ⁰ Superscripts and Subscripts SUPERSCRIPT ZERO 2080 No ₀ Superscripts and Subscripts SUBSCRIPT ZERO 24EA No ⓪ Enclosed Alphanumerics CIRCLED DIGIT ZERO FF10 Nd 0 Halfwidth and Fullwidth Forms FULLWIDTH DIGIT ZEROSo, in order to match, either, the string Ho Chi Minh and the string Hồ Chí Minh, simply use the regex :
H[[=o=]] Ch[[=i=]] Minh, with the collating sequences [=o=] and [=i=] embedded in a character class
Some other examples :
The regex \t[[=a=]]\t, against the list below, would match all the equivalent characters of the a letter, whatever, the case, the accentuation, the size and other specifications of these equivalent chars. That is to say 69 characters ! 0041 A ; Upper_Letter # Lu LATIN CAPITAL LETTER A 0061 a ; Lower_Letter # Ll LATIN SMALL LETTER A 00AA ª ; Other_Letter # Lo FEMININE ORDINAL INDICATOR 00C0 À ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH GRAVE 00C1 Á ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH ACUTE 00C2 Â ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX 00C3 Ã ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH TILDE 00C4 Ä ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DIAERESIS 00C5 Å ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH RING ABOVE 00E0 à ; Lower_Letter # Ll LATIN SMALL LETTER A WITH GRAVE 00E1 á ; Lower_Letter # Ll LATIN SMALL LETTER A WITH ACUTE 00E2 â ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX 00E3 ã ; Lower_Letter # Ll LATIN SMALL LETTER A WITH TILDE 00E4 ä ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DIAERESIS 00E5 å ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RING ABOVE 0100 Ā ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH MACRON 0101 ā ; Lower_Letter # Ll LATIN SMALL LETTER A WITH MACRON 0102 Ă ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE 0103 ă ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE 0104 Ą ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH OGONEK 0105 ą ; Lower_Letter # Ll LATIN SMALL LETTER A WITH OGONEK 01CD Ǎ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CARON 01CE ǎ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CARON 01DE Ǟ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON 01DF ǟ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DIAERESIS AND MACRON 01E0 Ǡ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON 01E1 ǡ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON 01FA Ǻ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE 01FB ǻ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE 0200 Ȁ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DOUBLE GRAVE 0201 ȁ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DOUBLE GRAVE 0202 Ȃ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH INVERTED BREVE 0203 ȃ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH INVERTED BREVE 0250 ɐ ; Lower_Letter # Ll LATIN SMALL LETTER TURNED A 0251 ɑ ; Lower_Letter # Ll LATIN SMALL LETTER ALPHA 0252 ɒ ; Lower_Letter # Ll LATIN SMALL LETTER TURNED ALPHA 1E00 Ḁ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH RING BELOW 1E01 ḁ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RING BELOW 1E9A ẚ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RIGHT HALF RING 1EA0 Ạ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DOT BELOW 1EA1 ạ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DOT BELOW 1EA2 Ả ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH HOOK ABOVE 1EA3 ả ; Lower_Letter # Ll LATIN SMALL LETTER A WITH HOOK ABOVE 1EA4 Ấ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE 1EA5 ấ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE 1EA6 Ầ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE 1EA7 ầ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE 1EA8 Ẩ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE 1EA9 ẩ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE 1EAA Ẫ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE 1EAB ẫ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE 1EAC Ậ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW 1EAD ậ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW 1EAE Ắ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND ACUTE 1EAF ắ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND ACUTE 1EB0 Ằ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND GRAVE 1EB1 ằ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND GRAVE 1EB2 Ẳ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE 1EB3 ẳ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE 1EB4 Ẵ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND TILDE 1EB5 ẵ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND TILDE 1EB6 Ặ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW 1EB7 ặ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND DOT BELOW 212B Å ; Upper_Letter # Lu ANGSTROM SIGN 249C ⒜ ; Other_Symbol # So PARENTHESIZED LATIN SMALL LETTER A 24B6 Ⓐ ; Other_Alpha. # So CIRCLED LATIN CAPITAL LETTER A 24D0 ⓐ ; Other_Alpha. # So CIRCLED LATIN SMALL LETTER A FF21 A ; Upper_Letter # Lu FULLWIDTH LATIN CAPITAL LETTER A FF41 a ; Lower_Letter # Ll FULLWIDTH LATIN SMALL LETTER A In the same way, the regex \t[[===]]\t matches all the equivalent characters of the equal sign = : 003D Sm = Basic Latin EQUALS SIGN 207C Sm ⁼ Superscripts and Subscripts SUPERSCRIPT EQUALS SIGN 208C Sm ₌ Superscripts and Subscripts SUBSCRIPT EQUALS SIGN 229C Sm ⊜ Mathematical Operators CIRCLED EQUALS FF1D Sm = Halfwidth and Fullwidth Forms FULLWIDTH EQUALS SIGN And the regex \t[[=Q=]]\t matches all the equivalent characters of the Q letter : 0051 Lu Q Basic Latin LATIN CAPITAL LETTER Q 0071 Ll q Basic Latin LATIN SMALL LETTER Q 02A0 Ll ʠ IPA Extensions LATIN SMALL LETTER Q WITH HOOK 211A Lu ℚ Letterlike Symbols DOUBLE-STRUCK CAPITAL Q 24AC So ⒬ Enclosed Alphanumerics PARENTHESIZED LATIN SMALL LETTER Q 24C6 So Ⓠ Enclosed Alphanumerics CIRCLED LATIN CAPITAL LETTER Q 24E0 So ⓠ Enclosed Alphanumerics CIRCLED LATIN SMALL LETTER Q FF31 Lu Q Halfwidth and Fullwidth Forms FULLWIDTH LATIN CAPITAL LETTER Q FF51 Ll q Halfwidth and Fullwidth Forms FULLWIDTH LATIN SMALL LETTER QBest Regards,
guy038