Community
    • Login

    Ignore diacritics in search

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    3 Posts 3 Posters 562 Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Steven123S Offline
      Steven123
      last edited by

      Is it possible to search for text that matches with the only difference being diacritics? E.g., a search for “Ho Chi Minh” should include “Hồ Chí Minh” in the results.

      1 Reply Last reply Reply Quote 0
      • EkopalypseE Offline
        Ekopalypse
        last edited by

        Afaik only with a regex that has all the alternatives
        but I assume this is not the answer you want to get.

        1 Reply Last reply Reply Quote 1
        • guy038G Offline
          guy038
          last edited by guy038

          Hello, @steven123, @ekopalypse and All,

          @steven123, I’m pleased to tell you that your request can be solved, easily, by using a POSIX collating sequence into a character class / bracket expression, which defines a list of equivalent characters to the character mentioned in the collating sequence

          The regex syntax of a POSIX collating sequence is [=<Char>=]. Note also that this POSIX collating sequence must be inserted, itself, inside a normal character class [....] or a negative character class [^....] in order to match all the equivalent characters to character Char !

          Refer, for additional information, to :

          https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.equivalence_classes

          https://www.regular-expressions.info/posixbrackets.html#eq


          For instance, the regex (?-i)[A-E[=0=]5-9] would match a uppercase letter from A to E, a digit from 5 to 9, the digit 0 or any of the 4 equivalent characters to digit 0 of the list below :

           0030   Nd	0	Basic Latin                                       DIGIT ZERO
           2070   No	⁰	Superscripts and Subscripts                       SUPERSCRIPT ZERO
           2080   No	₀	Superscripts and Subscripts                       SUBSCRIPT ZERO
           24EA   No	⓪	Enclosed Alphanumerics                            CIRCLED DIGIT ZERO
           FF10   Nd	0	Halfwidth and Fullwidth Forms                     FULLWIDTH DIGIT ZERO
          

          So, in order to match, either, the string Ho Chi Minh and the string Hồ Chí Minh, simply use the regex :

          H[[=o=]] Ch[[=i=]] Minh, with the collating sequences [=o=] and [=i=] embedded in a character class


          Some other examples :

          • The regex \t[[=a=]]\t, against the list below, would match all the equivalent characters of the a letter, whatever, the case, the accentuation, the size and other specifications of these equivalent chars. That is to say 69 characters !
           0041	A	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A
           0061	a	  ; Lower_Letter # Ll         LATIN SMALL LETTER A
           00AA	ª	  ; Other_Letter # Lo         FEMININE ORDINAL INDICATOR
           00C0	À	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH GRAVE
           00C1	Á	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH ACUTE
           00C2	Â	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX
           00C3	Ã	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH TILDE
           00C4	Ä	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DIAERESIS
           00C5	Å	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING ABOVE
           00E0	à	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH GRAVE
           00E1	á	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH ACUTE
           00E2	â	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX
           00E3	ã	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH TILDE
           00E4	ä	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DIAERESIS
           00E5	å	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING ABOVE
           0100	Ā	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH MACRON
           0101	ā	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH MACRON
           0102	Ă	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE
           0103	ă	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE
           0104	Ą	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH OGONEK
           0105	ą	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH OGONEK
           01CD	Ǎ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CARON
           01CE	ǎ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CARON
           01DE	Ǟ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
           01DF	ǟ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
           01E0	Ǡ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
           01E1	ǡ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
           01FA	Ǻ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
           01FB	ǻ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
           0200	Ȁ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOUBLE GRAVE
           0201	ȁ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOUBLE GRAVE
           0202	Ȃ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH INVERTED BREVE
           0203	ȃ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH INVERTED BREVE
           0250	ɐ	  ; Lower_Letter # Ll         LATIN SMALL LETTER TURNED A
           0251	ɑ	  ; Lower_Letter # Ll         LATIN SMALL LETTER ALPHA
           0252	ɒ	  ; Lower_Letter # Ll         LATIN SMALL LETTER TURNED ALPHA
           1E00	Ḁ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING BELOW
           1E01	ḁ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING BELOW
           1E9A	ẚ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RIGHT HALF RING
           1EA0	Ạ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOT BELOW
           1EA1	ạ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOT BELOW
           1EA2	Ả	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH HOOK ABOVE
           1EA3	ả	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH HOOK ABOVE
           1EA4	Ấ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE
           1EA5	ấ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
           1EA6	Ầ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE
           1EA7	ầ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
           1EA8	Ẩ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
           1EA9	ẩ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
           1EAA	Ẫ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE
           1EAB	ẫ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
           1EAC	Ậ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW
           1EAD	ậ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
           1EAE	Ắ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND ACUTE
           1EAF	ắ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND ACUTE
           1EB0	Ằ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND GRAVE
           1EB1	ằ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND GRAVE
           1EB2	Ẳ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE
           1EB3	ẳ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
           1EB4	Ẵ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND TILDE
           1EB5	ẵ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND TILDE
           1EB6	Ặ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW
           1EB7	ặ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
           212B	Å	  ; Upper_Letter # Lu         ANGSTROM SIGN
           249C	⒜	  ; Other_Symbol # So         PARENTHESIZED LATIN SMALL LETTER A
           24B6	Ⓐ	  ; Other_Alpha. # So         CIRCLED LATIN CAPITAL LETTER A
           24D0	ⓐ	  ; Other_Alpha. # So         CIRCLED LATIN SMALL LETTER A
           FF21	A	  ; Upper_Letter # Lu         FULLWIDTH LATIN CAPITAL LETTER A
           FF41	a	  ; Lower_Letter # Ll         FULLWIDTH LATIN SMALL LETTER A
          
          • In the same way, the regex \t[[===]]\t matches all the equivalent characters of the equal sign = :
           003D   Sm	=	Basic Latin                                       EQUALS SIGN
           207C   Sm	⁼	Superscripts and Subscripts                       SUPERSCRIPT EQUALS SIGN
           208C   Sm	₌	Superscripts and Subscripts                       SUBSCRIPT EQUALS SIGN
           229C   Sm	⊜	Mathematical Operators                            CIRCLED EQUALS
           FF1D   Sm	=	Halfwidth and Fullwidth Forms                     FULLWIDTH EQUALS SIGN
          
          • And the regex \t[[=Q=]]\t matches all the equivalent characters of the Q letter :
           0051   Lu	Q	Basic Latin                                       LATIN CAPITAL LETTER Q
           0071   Ll	q	Basic Latin                                       LATIN SMALL LETTER Q
           02A0   Ll	ʠ	IPA Extensions                                    LATIN SMALL LETTER Q WITH HOOK
           211A   Lu	ℚ	Letterlike Symbols                                DOUBLE-STRUCK CAPITAL Q
           24AC   So	⒬	Enclosed Alphanumerics                            PARENTHESIZED LATIN SMALL LETTER Q
           24C6   So	Ⓠ	Enclosed Alphanumerics                            CIRCLED LATIN CAPITAL LETTER Q
           24E0   So	ⓠ	Enclosed Alphanumerics                            CIRCLED LATIN SMALL LETTER Q
           FF31   Lu	Q	Halfwidth and Fullwidth Forms                     FULLWIDTH LATIN CAPITAL LETTER Q
           FF51   Ll	q	Halfwidth and Fullwidth Forms                     FULLWIDTH LATIN SMALL LETTER Q
          

          Best Regards,

          guy038

          1 Reply Last reply Reply Quote 4

          Hello! It looks like you're interested in this conversation, but you don't have an account yet.

          Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

          With your input, this post could be even better 💗

          Register Login
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors