Ignore diacritics in search



  • Is it possible to search for text that matches with the only difference being diacritics? E.g., a search for “Ho Chi Minh” should include “Hồ Chí Minh” in the results.



  • Afaik only with a regex that has all the alternatives
    but I assume this is not the answer you want to get.



  • Hello, @steven123, @ekopalypse and All,

    @steven123, I’m pleased to tell you that your request can be solved, easily, by using a POSIX collating sequence into a character class / bracket expression, which defines a list of equivalent characters to the character mentioned in the collating sequence

    The regex syntax of a POSIX collating sequence is [=<Char>=]. Note also that this POSIX collating sequence must be inserted, itself, inside a normal character class [....] or a negative character class [^....] in order to match all the equivalent characters to character Char !

    Refer, for additional information, to :

    https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.equivalence_classes

    https://www.regular-expressions.info/posixbrackets.html#eq


    For instance, the regex (?-i)[A-E[=0=]5-9] would match a uppercase letter from A to E, a digit from 5 to 9, the digit 0 or any of the 4 equivalent characters to digit 0 of the list below :

     0030   Nd	0	Basic Latin                                       DIGIT ZERO
     2070   No	⁰	Superscripts and Subscripts                       SUPERSCRIPT ZERO
     2080   No	₀	Superscripts and Subscripts                       SUBSCRIPT ZERO
     24EA   No	⓪	Enclosed Alphanumerics                            CIRCLED DIGIT ZERO
     FF10   Nd	0	Halfwidth and Fullwidth Forms                     FULLWIDTH DIGIT ZERO
    

    So, in order to match, either, the string Ho Chi Minh and the string Hồ Chí Minh, simply use the regex :

    H[[=o=]] Ch[[=i=]] Minh, with the collating sequences [=o=] and [=i=] embedded in a character class


    Some other examples :

    • The regex \t[[=a=]]\t, against the list below, would match all the equivalent characters of the a letter, whatever, the case, the accentuation, the size and other specifications of these equivalent chars. That is to say 69 characters !
     0041	A	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A
     0061	a	  ; Lower_Letter # Ll         LATIN SMALL LETTER A
     00AA	ª	  ; Other_Letter # Lo         FEMININE ORDINAL INDICATOR
     00C0	À	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH GRAVE
     00C1	Á	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH ACUTE
     00C2	Â	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX
     00C3	Ã	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH TILDE
     00C4	Ä	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DIAERESIS
     00C5	Å	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING ABOVE
     00E0	à	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH GRAVE
     00E1	á	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH ACUTE
     00E2	â	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX
     00E3	ã	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH TILDE
     00E4	ä	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DIAERESIS
     00E5	å	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING ABOVE
     0100	Ā	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH MACRON
     0101	ā	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH MACRON
     0102	Ă	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE
     0103	ă	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE
     0104	Ą	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH OGONEK
     0105	ą	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH OGONEK
     01CD	Ǎ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CARON
     01CE	ǎ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CARON
     01DE	Ǟ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
     01DF	ǟ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
     01E0	Ǡ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
     01E1	ǡ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
     01FA	Ǻ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
     01FB	ǻ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
     0200	Ȁ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOUBLE GRAVE
     0201	ȁ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOUBLE GRAVE
     0202	Ȃ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH INVERTED BREVE
     0203	ȃ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH INVERTED BREVE
     0250	ɐ	  ; Lower_Letter # Ll         LATIN SMALL LETTER TURNED A
     0251	ɑ	  ; Lower_Letter # Ll         LATIN SMALL LETTER ALPHA
     0252	ɒ	  ; Lower_Letter # Ll         LATIN SMALL LETTER TURNED ALPHA
     1E00	Ḁ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING BELOW
     1E01	ḁ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING BELOW
     1E9A	ẚ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RIGHT HALF RING
     1EA0	Ạ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOT BELOW
     1EA1	ạ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOT BELOW
     1EA2	Ả	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH HOOK ABOVE
     1EA3	ả	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH HOOK ABOVE
     1EA4	Ấ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE
     1EA5	ấ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
     1EA6	Ầ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE
     1EA7	ầ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
     1EA8	Ẩ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
     1EA9	ẩ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
     1EAA	Ẫ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE
     1EAB	ẫ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
     1EAC	Ậ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW
     1EAD	ậ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
     1EAE	Ắ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND ACUTE
     1EAF	ắ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND ACUTE
     1EB0	Ằ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND GRAVE
     1EB1	ằ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND GRAVE
     1EB2	Ẳ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE
     1EB3	ẳ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
     1EB4	Ẵ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND TILDE
     1EB5	ẵ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND TILDE
     1EB6	Ặ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW
     1EB7	ặ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
     212B	Å	  ; Upper_Letter # Lu         ANGSTROM SIGN
     249C	⒜	  ; Other_Symbol # So         PARENTHESIZED LATIN SMALL LETTER A
     24B6	Ⓐ	  ; Other_Alpha. # So         CIRCLED LATIN CAPITAL LETTER A
     24D0	ⓐ	  ; Other_Alpha. # So         CIRCLED LATIN SMALL LETTER A
     FF21	A	  ; Upper_Letter # Lu         FULLWIDTH LATIN CAPITAL LETTER A
     FF41	a	  ; Lower_Letter # Ll         FULLWIDTH LATIN SMALL LETTER A
    
    • In the same way, the regex \t[[===]]\t matches all the equivalent characters of the equal sign = :
     003D   Sm	=	Basic Latin                                       EQUALS SIGN
     207C   Sm	⁼	Superscripts and Subscripts                       SUPERSCRIPT EQUALS SIGN
     208C   Sm	₌	Superscripts and Subscripts                       SUBSCRIPT EQUALS SIGN
     229C   Sm	⊜	Mathematical Operators                            CIRCLED EQUALS
     FF1D   Sm	=	Halfwidth and Fullwidth Forms                     FULLWIDTH EQUALS SIGN
    
    • And the regex \t[[=Q=]]\t matches all the equivalent characters of the Q letter :
     0051   Lu	Q	Basic Latin                                       LATIN CAPITAL LETTER Q
     0071   Ll	q	Basic Latin                                       LATIN SMALL LETTER Q
     02A0   Ll	ʠ	IPA Extensions                                    LATIN SMALL LETTER Q WITH HOOK
     211A   Lu	ℚ	Letterlike Symbols                                DOUBLE-STRUCK CAPITAL Q
     24AC   So	⒬	Enclosed Alphanumerics                            PARENTHESIZED LATIN SMALL LETTER Q
     24C6   So	Ⓠ	Enclosed Alphanumerics                            CIRCLED LATIN CAPITAL LETTER Q
     24E0   So	ⓠ	Enclosed Alphanumerics                            CIRCLED LATIN SMALL LETTER Q
     FF31   Lu	Q	Halfwidth and Fullwidth Forms                     FULLWIDTH LATIN CAPITAL LETTER Q
     FF51   Ll	q	Halfwidth and Fullwidth Forms                     FULLWIDTH LATIN SMALL LETTER Q
    

    Best Regards,

    guy038


Log in to reply