Ignore diacritics in search
-
Is it possible to search for text that matches with the only difference being diacritics? E.g., a search for “Ho Chi Minh” should include “Hồ Chí Minh” in the results.
-
Afaik only with a regex that has all the alternatives
but I assume this is not the answer you want to get. -
Hello, @steven123, @ekopalypse and All,
@steven123, I’m pleased to tell you that your request can be solved, easily, by using a POSIX collating sequence into a character class / bracket expression, which defines a list of equivalent characters to the character mentioned in the collating sequence
The regex syntax of a POSIX collating sequence is
[=<Char>=]
. Note also that this POSIX collating sequence must be inserted, itself, inside a normal character class[....]
or a negative character class[^....]
in order to match all the equivalent characters to characterChar
!Refer, for additional information, to :
https://www.regular-expressions.info/posixbrackets.html#eq
For instance, the regex
(?-i)[A-E[=0=]5-9]
would match a uppercase letter fromA
toE
, a digit from5
to9
, the digit0
or any of the4
equivalent characters to digit0
of the list below :0030 Nd 0 Basic Latin DIGIT ZERO 2070 No ⁰ Superscripts and Subscripts SUPERSCRIPT ZERO 2080 No ₀ Superscripts and Subscripts SUBSCRIPT ZERO 24EA No ⓪ Enclosed Alphanumerics CIRCLED DIGIT ZERO FF10 Nd 0 Halfwidth and Fullwidth Forms FULLWIDTH DIGIT ZERO
So, in order to match, either, the string Ho Chi Minh and the string Hồ Chí Minh, simply use the regex :
H[[=o=]] Ch[[=i=]] Minh
, with the collating sequences[=o=]
and[=i=]
embedded in a character class
Some other examples :
- The regex
\t[[=a=]]\t
, against the list below, would match all the equivalent characters of thea
letter, whatever, the case, the accentuation, the size and other specifications of these equivalent chars. That is to say69
characters !
0041 A ; Upper_Letter # Lu LATIN CAPITAL LETTER A 0061 a ; Lower_Letter # Ll LATIN SMALL LETTER A 00AA ª ; Other_Letter # Lo FEMININE ORDINAL INDICATOR 00C0 À ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH GRAVE 00C1 Á ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH ACUTE 00C2 Â ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX 00C3 Ã ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH TILDE 00C4 Ä ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DIAERESIS 00C5 Å ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH RING ABOVE 00E0 à ; Lower_Letter # Ll LATIN SMALL LETTER A WITH GRAVE 00E1 á ; Lower_Letter # Ll LATIN SMALL LETTER A WITH ACUTE 00E2 â ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX 00E3 ã ; Lower_Letter # Ll LATIN SMALL LETTER A WITH TILDE 00E4 ä ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DIAERESIS 00E5 å ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RING ABOVE 0100 Ā ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH MACRON 0101 ā ; Lower_Letter # Ll LATIN SMALL LETTER A WITH MACRON 0102 Ă ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE 0103 ă ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE 0104 Ą ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH OGONEK 0105 ą ; Lower_Letter # Ll LATIN SMALL LETTER A WITH OGONEK 01CD Ǎ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CARON 01CE ǎ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CARON 01DE Ǟ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON 01DF ǟ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DIAERESIS AND MACRON 01E0 Ǡ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON 01E1 ǡ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON 01FA Ǻ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE 01FB ǻ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE 0200 Ȁ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DOUBLE GRAVE 0201 ȁ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DOUBLE GRAVE 0202 Ȃ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH INVERTED BREVE 0203 ȃ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH INVERTED BREVE 0250 ɐ ; Lower_Letter # Ll LATIN SMALL LETTER TURNED A 0251 ɑ ; Lower_Letter # Ll LATIN SMALL LETTER ALPHA 0252 ɒ ; Lower_Letter # Ll LATIN SMALL LETTER TURNED ALPHA 1E00 Ḁ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH RING BELOW 1E01 ḁ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RING BELOW 1E9A ẚ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH RIGHT HALF RING 1EA0 Ạ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH DOT BELOW 1EA1 ạ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH DOT BELOW 1EA2 Ả ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH HOOK ABOVE 1EA3 ả ; Lower_Letter # Ll LATIN SMALL LETTER A WITH HOOK ABOVE 1EA4 Ấ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE 1EA5 ấ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE 1EA6 Ầ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE 1EA7 ầ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE 1EA8 Ẩ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE 1EA9 ẩ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE 1EAA Ẫ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE 1EAB ẫ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE 1EAC Ậ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW 1EAD ậ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW 1EAE Ắ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND ACUTE 1EAF ắ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND ACUTE 1EB0 Ằ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND GRAVE 1EB1 ằ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND GRAVE 1EB2 Ẳ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE 1EB3 ẳ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE 1EB4 Ẵ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND TILDE 1EB5 ẵ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND TILDE 1EB6 Ặ ; Upper_Letter # Lu LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW 1EB7 ặ ; Lower_Letter # Ll LATIN SMALL LETTER A WITH BREVE AND DOT BELOW 212B Å ; Upper_Letter # Lu ANGSTROM SIGN 249C ⒜ ; Other_Symbol # So PARENTHESIZED LATIN SMALL LETTER A 24B6 Ⓐ ; Other_Alpha. # So CIRCLED LATIN CAPITAL LETTER A 24D0 ⓐ ; Other_Alpha. # So CIRCLED LATIN SMALL LETTER A FF21 A ; Upper_Letter # Lu FULLWIDTH LATIN CAPITAL LETTER A FF41 a ; Lower_Letter # Ll FULLWIDTH LATIN SMALL LETTER A
- In the same way, the regex
\t[[===]]\t
matches all the equivalent characters of the equal sign=
:
003D Sm = Basic Latin EQUALS SIGN 207C Sm ⁼ Superscripts and Subscripts SUPERSCRIPT EQUALS SIGN 208C Sm ₌ Superscripts and Subscripts SUBSCRIPT EQUALS SIGN 229C Sm ⊜ Mathematical Operators CIRCLED EQUALS FF1D Sm = Halfwidth and Fullwidth Forms FULLWIDTH EQUALS SIGN
- And the regex
\t[[=Q=]]\t
matches all the equivalent characters of theQ
letter :
0051 Lu Q Basic Latin LATIN CAPITAL LETTER Q 0071 Ll q Basic Latin LATIN SMALL LETTER Q 02A0 Ll ʠ IPA Extensions LATIN SMALL LETTER Q WITH HOOK 211A Lu ℚ Letterlike Symbols DOUBLE-STRUCK CAPITAL Q 24AC So ⒬ Enclosed Alphanumerics PARENTHESIZED LATIN SMALL LETTER Q 24C6 So Ⓠ Enclosed Alphanumerics CIRCLED LATIN CAPITAL LETTER Q 24E0 So ⓠ Enclosed Alphanumerics CIRCLED LATIN SMALL LETTER Q FF31 Lu Q Halfwidth and Fullwidth Forms FULLWIDTH LATIN CAPITAL LETTER Q FF51 Ll q Halfwidth and Fullwidth Forms FULLWIDTH LATIN SMALL LETTER Q
Best Regards,
guy038
- The regex