Community
    • Login

    Ignore diacritics in search

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    3 Posts 3 Posters 274 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Steven123S
      Steven123
      last edited by

      Is it possible to search for text that matches with the only difference being diacritics? E.g., a search for “Ho Chi Minh” should include “Hồ Chí Minh” in the results.

      1 Reply Last reply Reply Quote 0
      • EkopalypseE
        Ekopalypse
        last edited by

        Afaik only with a regex that has all the alternatives
        but I assume this is not the answer you want to get.

        1 Reply Last reply Reply Quote 1
        • guy038G
          guy038
          last edited by guy038

          Hello, @steven123, @ekopalypse and All,

          @steven123, I’m pleased to tell you that your request can be solved, easily, by using a POSIX collating sequence into a character class / bracket expression, which defines a list of equivalent characters to the character mentioned in the collating sequence

          The regex syntax of a POSIX collating sequence is [=<Char>=]. Note also that this POSIX collating sequence must be inserted, itself, inside a normal character class [....] or a negative character class [^....] in order to match all the equivalent characters to character Char !

          Refer, for additional information, to :

          https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.equivalence_classes

          https://www.regular-expressions.info/posixbrackets.html#eq


          For instance, the regex (?-i)[A-E[=0=]5-9] would match a uppercase letter from A to E, a digit from 5 to 9, the digit 0 or any of the 4 equivalent characters to digit 0 of the list below :

           0030   Nd	0	Basic Latin                                       DIGIT ZERO
           2070   No	⁰	Superscripts and Subscripts                       SUPERSCRIPT ZERO
           2080   No	₀	Superscripts and Subscripts                       SUBSCRIPT ZERO
           24EA   No	⓪	Enclosed Alphanumerics                            CIRCLED DIGIT ZERO
           FF10   Nd	0	Halfwidth and Fullwidth Forms                     FULLWIDTH DIGIT ZERO
          

          So, in order to match, either, the string Ho Chi Minh and the string Hồ Chí Minh, simply use the regex :

          H[[=o=]] Ch[[=i=]] Minh, with the collating sequences [=o=] and [=i=] embedded in a character class


          Some other examples :

          • The regex \t[[=a=]]\t, against the list below, would match all the equivalent characters of the a letter, whatever, the case, the accentuation, the size and other specifications of these equivalent chars. That is to say 69 characters !
           0041	A	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A
           0061	a	  ; Lower_Letter # Ll         LATIN SMALL LETTER A
           00AA	ª	  ; Other_Letter # Lo         FEMININE ORDINAL INDICATOR
           00C0	À	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH GRAVE
           00C1	Á	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH ACUTE
           00C2	Â	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX
           00C3	Ã	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH TILDE
           00C4	Ä	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DIAERESIS
           00C5	Å	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING ABOVE
           00E0	à	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH GRAVE
           00E1	á	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH ACUTE
           00E2	â	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX
           00E3	ã	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH TILDE
           00E4	ä	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DIAERESIS
           00E5	å	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING ABOVE
           0100	Ā	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH MACRON
           0101	ā	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH MACRON
           0102	Ă	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE
           0103	ă	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE
           0104	Ą	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH OGONEK
           0105	ą	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH OGONEK
           01CD	Ǎ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CARON
           01CE	ǎ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CARON
           01DE	Ǟ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
           01DF	ǟ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
           01E0	Ǡ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
           01E1	ǡ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
           01FA	Ǻ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
           01FB	ǻ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
           0200	Ȁ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOUBLE GRAVE
           0201	ȁ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOUBLE GRAVE
           0202	Ȃ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH INVERTED BREVE
           0203	ȃ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH INVERTED BREVE
           0250	ɐ	  ; Lower_Letter # Ll         LATIN SMALL LETTER TURNED A
           0251	ɑ	  ; Lower_Letter # Ll         LATIN SMALL LETTER ALPHA
           0252	ɒ	  ; Lower_Letter # Ll         LATIN SMALL LETTER TURNED ALPHA
           1E00	Ḁ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING BELOW
           1E01	ḁ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING BELOW
           1E9A	ẚ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RIGHT HALF RING
           1EA0	Ạ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOT BELOW
           1EA1	ạ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOT BELOW
           1EA2	Ả	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH HOOK ABOVE
           1EA3	ả	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH HOOK ABOVE
           1EA4	Ấ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE
           1EA5	ấ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
           1EA6	Ầ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE
           1EA7	ầ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
           1EA8	Ẩ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
           1EA9	ẩ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
           1EAA	Ẫ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE
           1EAB	ẫ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
           1EAC	Ậ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW
           1EAD	ậ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
           1EAE	Ắ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND ACUTE
           1EAF	ắ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND ACUTE
           1EB0	Ằ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND GRAVE
           1EB1	ằ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND GRAVE
           1EB2	Ẳ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE
           1EB3	ẳ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
           1EB4	Ẵ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND TILDE
           1EB5	ẵ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND TILDE
           1EB6	Ặ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW
           1EB7	ặ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
           212B	Å	  ; Upper_Letter # Lu         ANGSTROM SIGN
           249C	⒜	  ; Other_Symbol # So         PARENTHESIZED LATIN SMALL LETTER A
           24B6	Ⓐ	  ; Other_Alpha. # So         CIRCLED LATIN CAPITAL LETTER A
           24D0	ⓐ	  ; Other_Alpha. # So         CIRCLED LATIN SMALL LETTER A
           FF21	A	  ; Upper_Letter # Lu         FULLWIDTH LATIN CAPITAL LETTER A
           FF41	a	  ; Lower_Letter # Ll         FULLWIDTH LATIN SMALL LETTER A
          
          • In the same way, the regex \t[[===]]\t matches all the equivalent characters of the equal sign = :
           003D   Sm	=	Basic Latin                                       EQUALS SIGN
           207C   Sm	⁼	Superscripts and Subscripts                       SUPERSCRIPT EQUALS SIGN
           208C   Sm	₌	Superscripts and Subscripts                       SUBSCRIPT EQUALS SIGN
           229C   Sm	⊜	Mathematical Operators                            CIRCLED EQUALS
           FF1D   Sm	=	Halfwidth and Fullwidth Forms                     FULLWIDTH EQUALS SIGN
          
          • And the regex \t[[=Q=]]\t matches all the equivalent characters of the Q letter :
           0051   Lu	Q	Basic Latin                                       LATIN CAPITAL LETTER Q
           0071   Ll	q	Basic Latin                                       LATIN SMALL LETTER Q
           02A0   Ll	ʠ	IPA Extensions                                    LATIN SMALL LETTER Q WITH HOOK
           211A   Lu	ℚ	Letterlike Symbols                                DOUBLE-STRUCK CAPITAL Q
           24AC   So	⒬	Enclosed Alphanumerics                            PARENTHESIZED LATIN SMALL LETTER Q
           24C6   So	Ⓠ	Enclosed Alphanumerics                            CIRCLED LATIN CAPITAL LETTER Q
           24E0   So	ⓠ	Enclosed Alphanumerics                            CIRCLED LATIN SMALL LETTER Q
           FF31   Lu	Q	Halfwidth and Fullwidth Forms                     FULLWIDTH LATIN CAPITAL LETTER Q
           FF51   Ll	q	Halfwidth and Fullwidth Forms                     FULLWIDTH LATIN SMALL LETTER Q
          

          Best Regards,

          guy038

          1 Reply Last reply Reply Quote 4
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors