• Login
Community
  • Login

Ignore diacritics in search

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
3 Posts 3 Posters 279 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • S
    Steven123
    last edited by Oct 24, 2020, 6:06 PM

    Is it possible to search for text that matches with the only difference being diacritics? E.g., a search for “Ho Chi Minh” should include “Hồ Chí Minh” in the results.

    1 Reply Last reply Reply Quote 0
    • E
      Ekopalypse
      last edited by Oct 24, 2020, 9:05 PM

      Afaik only with a regex that has all the alternatives
      but I assume this is not the answer you want to get.

      1 Reply Last reply Reply Quote 1
      • G
        guy038
        last edited by guy038 Oct 25, 2020, 1:20 AM Oct 25, 2020, 1:18 AM

        Hello, @steven123, @ekopalypse and All,

        @steven123, I’m pleased to tell you that your request can be solved, easily, by using a POSIX collating sequence into a character class / bracket expression, which defines a list of equivalent characters to the character mentioned in the collating sequence

        The regex syntax of a POSIX collating sequence is [=<Char>=]. Note also that this POSIX collating sequence must be inserted, itself, inside a normal character class [....] or a negative character class [^....] in order to match all the equivalent characters to character Char !

        Refer, for additional information, to :

        https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.equivalence_classes

        https://www.regular-expressions.info/posixbrackets.html#eq


        For instance, the regex (?-i)[A-E[=0=]5-9] would match a uppercase letter from A to E, a digit from 5 to 9, the digit 0 or any of the 4 equivalent characters to digit 0 of the list below :

         0030   Nd	0	Basic Latin                                       DIGIT ZERO
         2070   No	⁰	Superscripts and Subscripts                       SUPERSCRIPT ZERO
         2080   No	₀	Superscripts and Subscripts                       SUBSCRIPT ZERO
         24EA   No	⓪	Enclosed Alphanumerics                            CIRCLED DIGIT ZERO
         FF10   Nd	0	Halfwidth and Fullwidth Forms                     FULLWIDTH DIGIT ZERO
        

        So, in order to match, either, the string Ho Chi Minh and the string Hồ Chí Minh, simply use the regex :

        H[[=o=]] Ch[[=i=]] Minh, with the collating sequences [=o=] and [=i=] embedded in a character class


        Some other examples :

        • The regex \t[[=a=]]\t, against the list below, would match all the equivalent characters of the a letter, whatever, the case, the accentuation, the size and other specifications of these equivalent chars. That is to say 69 characters !
         0041	A	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A
         0061	a	  ; Lower_Letter # Ll         LATIN SMALL LETTER A
         00AA	ª	  ; Other_Letter # Lo         FEMININE ORDINAL INDICATOR
         00C0	À	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH GRAVE
         00C1	Á	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH ACUTE
         00C2	Â	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX
         00C3	Ã	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH TILDE
         00C4	Ä	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DIAERESIS
         00C5	Å	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING ABOVE
         00E0	à	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH GRAVE
         00E1	á	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH ACUTE
         00E2	â	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX
         00E3	ã	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH TILDE
         00E4	ä	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DIAERESIS
         00E5	å	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING ABOVE
         0100	Ā	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH MACRON
         0101	ā	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH MACRON
         0102	Ă	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE
         0103	ă	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE
         0104	Ą	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH OGONEK
         0105	ą	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH OGONEK
         01CD	Ǎ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CARON
         01CE	ǎ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CARON
         01DE	Ǟ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
         01DF	ǟ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
         01E0	Ǡ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
         01E1	ǡ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
         01FA	Ǻ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE
         01FB	ǻ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
         0200	Ȁ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOUBLE GRAVE
         0201	ȁ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOUBLE GRAVE
         0202	Ȃ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH INVERTED BREVE
         0203	ȃ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH INVERTED BREVE
         0250	ɐ	  ; Lower_Letter # Ll         LATIN SMALL LETTER TURNED A
         0251	ɑ	  ; Lower_Letter # Ll         LATIN SMALL LETTER ALPHA
         0252	ɒ	  ; Lower_Letter # Ll         LATIN SMALL LETTER TURNED ALPHA
         1E00	Ḁ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH RING BELOW
         1E01	ḁ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RING BELOW
         1E9A	ẚ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH RIGHT HALF RING
         1EA0	Ạ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH DOT BELOW
         1EA1	ạ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH DOT BELOW
         1EA2	Ả	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH HOOK ABOVE
         1EA3	ả	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH HOOK ABOVE
         1EA4	Ấ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE
         1EA5	ấ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
         1EA6	Ầ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE
         1EA7	ầ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
         1EA8	Ẩ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
         1EA9	ẩ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
         1EAA	Ẫ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE
         1EAB	ẫ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
         1EAC	Ậ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW
         1EAD	ậ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
         1EAE	Ắ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND ACUTE
         1EAF	ắ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND ACUTE
         1EB0	Ằ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND GRAVE
         1EB1	ằ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND GRAVE
         1EB2	Ẳ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE
         1EB3	ẳ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
         1EB4	Ẵ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND TILDE
         1EB5	ẵ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND TILDE
         1EB6	Ặ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW
         1EB7	ặ	  ; Lower_Letter # Ll         LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
         212B	Å	  ; Upper_Letter # Lu         ANGSTROM SIGN
         249C	⒜	  ; Other_Symbol # So         PARENTHESIZED LATIN SMALL LETTER A
         24B6	Ⓐ	  ; Other_Alpha. # So         CIRCLED LATIN CAPITAL LETTER A
         24D0	ⓐ	  ; Other_Alpha. # So         CIRCLED LATIN SMALL LETTER A
         FF21	A	  ; Upper_Letter # Lu         FULLWIDTH LATIN CAPITAL LETTER A
         FF41	a	  ; Lower_Letter # Ll         FULLWIDTH LATIN SMALL LETTER A
        
        • In the same way, the regex \t[[===]]\t matches all the equivalent characters of the equal sign = :
         003D   Sm	=	Basic Latin                                       EQUALS SIGN
         207C   Sm	⁼	Superscripts and Subscripts                       SUPERSCRIPT EQUALS SIGN
         208C   Sm	₌	Superscripts and Subscripts                       SUBSCRIPT EQUALS SIGN
         229C   Sm	⊜	Mathematical Operators                            CIRCLED EQUALS
         FF1D   Sm	=	Halfwidth and Fullwidth Forms                     FULLWIDTH EQUALS SIGN
        
        • And the regex \t[[=Q=]]\t matches all the equivalent characters of the Q letter :
         0051   Lu	Q	Basic Latin                                       LATIN CAPITAL LETTER Q
         0071   Ll	q	Basic Latin                                       LATIN SMALL LETTER Q
         02A0   Ll	ʠ	IPA Extensions                                    LATIN SMALL LETTER Q WITH HOOK
         211A   Lu	ℚ	Letterlike Symbols                                DOUBLE-STRUCK CAPITAL Q
         24AC   So	⒬	Enclosed Alphanumerics                            PARENTHESIZED LATIN SMALL LETTER Q
         24C6   So	Ⓠ	Enclosed Alphanumerics                            CIRCLED LATIN CAPITAL LETTER Q
         24E0   So	ⓠ	Enclosed Alphanumerics                            CIRCLED LATIN SMALL LETTER Q
         FF31   Lu	Q	Halfwidth and Fullwidth Forms                     FULLWIDTH LATIN CAPITAL LETTER Q
         FF51   Ll	q	Halfwidth and Fullwidth Forms                     FULLWIDTH LATIN SMALL LETTER Q
        

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 4
        2 out of 3
        • First post
          2/3
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors