Regex Misidentifying Foreign Characters



  • I’m trying to use the NPP regex search to find soft hyphens (ISO 8859: 0xAD, Unicode U+00AD SOFT HYPHEN) between non-word characters, for diagnostic purposes. Here’s the regular expression I’m using (NOTA BENE: The browser hides soft hyphens, so you can’t see them here. For an example of search text with embedded soft hyphens, look at the source code of the Web page at http://www.hymntime.com/tch/bio/r/i/v/i/rivier_jft.htm). In this regex, here is one soft hyphen before and after the vertical bar character (the regex “OR” character).

    [^\w]­|­[^\w]
    

    Searching the text below using the regex above gives false positives (again, the browser makes the soft hyphens invisible; you can see examples of soft hyphens being used at the URL given above).

    <body>
    <p lang="fr">Mé­di­a­teur de l’An­ci­enne</p>
    </body>
    

    Although the browser hides soft hyphens, there is one before the letter ‘d’ in the word Mé­di­a­teur. The regex reports the soft hyphen follows a non-word character (in this case, the accented e (é­).

    I tried other regexes, like the one below, but still get the false positives (again one soft hyphen before and after the vertical bar):

    [^[:alpha:]æáéôöüúíîß]­|­[^[:alpha:]æáéôöüúíîß]
    

    What am I doing wrong? Is this a bug in the regex engine?



  • Hello, @sylvester-bullitt and All,

    No, your regex does not give false positive ! Your regex can be written :

    [^\w]\xAD|\xAD[^\w]

    And, indeed, it matches as expected :

    • A non-word char, followed with a soft-hyphen \xAD

    • A soft-hyphen \xAD, followed with a non-word char


    Assuming your HTML example, below :

    <body>
    <p lang="fr">Mé­di­a­teur de l’An­ci­enne</p>
    </body>
    
    • Place your caret, right before the upper-case M

    • Press the Right Arrow key => The caret is now located right after the letter M

    • Press again the Right Arrow key => The caret seems to be in the middle of the letter é !!

    • Press again the Right Arrow key => The caret is now right after the letter é

    Why this behaviour ? Just because, after the M letter, there is no é letter at all, but 2 characters :

    • The letter e of Unicode code \x{0065}, followed with

    • The COMBINING ACUTE ACCENT, of Unicode code-point \x{0301}, which is a character from the combining diacritical marks Unicode block, in range [\x{0300}–\x{036F}] !

    Refer to all these characters of that block, below :

    http://www.unicode.org/charts/PDF/U0300.pdf


    Note that any character ( not specially vowels ! ) may have any number of additional diacritical marks, referring to the base char !

    IMPORTANT : this condition implies that your current font is able to draw all these diacritical marks. Otherwise, your system will probably make a font substitution or use a fallback font, in order to visualize such glyphs, usually less accurately !

    For instance, the accentuated character,p̸͚̀͟͠ , based on the lowercase letter p, can be found with the regex : p\x{0300}\x{0338}\x{035A}\x{035F}\x{0360} because :

    • I first wrote the lower-case letter p

    • I added the diacritical mark COMBINING GRAVE ACCENT ( \x{0300} )

    • I added the diacritical mark COMBINING LONG SOLIDUS OVERLAY ( \x{0338} )

    • I added the diacritical mark COMBINING DOUBLE RING BELOW ( \x{035A} )

    • I added the diacritical mark COMBINING DOUBLE MACRON BELOW ( \x{035F} )

    • I added the diacritical mark COMBINING DOUBLE TILDE ( \x{0360} )

    Any other combination will not work, of course !


    So, regarding your example, the part, beginning with the M letter and ending right before the d letter, contains, in fact, 4 characters, which can be found with the regex :

    Me\x{0301}\xAD

    And, with your regex [^\w]\xAD|\xAD[^\w], the first alternative does match a non-word char ( \x{0301} ), followed with a soft hyphen ( \xAD )

    So, a regex like (\w[\x{0300}-\x{036F}]*)+ would get any word containing word characters followed with possible diacritical mark chars !

    If we change your regex, in order to include the Combining Diacritical Marks range in the negative class character, an easy solution could be :

    [^\w\x{0300}-\x{036F}]\xAD|\AD[^\w\x{0300}-\x{036F}]

    However, this new regex does not find any occurrence in your text, below :

    http://www.hymntime.com/tch/bio/r/i/v/i/rivier_jft.htm

    This could be your expected result ;-))


    Note that it could be judicious to run a regex S/R to get rid of any combining character and simply use the appropriate character !

    Referring to your example, the S/R :

    SEARCH (?-i)e\x{0301}

    REPLACE \x{00E9}

    And you get this time :

    <body>
    <p lang="fr">Mé­di­a­teur de l’An­ci­enne</p>
    </body>
    

    Remark : use the option Edit Character Panel to get the right hexadecimal code of the replacement character !

    On the other hand, in order to search for any individual Combining Diacritical Mark, use, preferably, the regex [\x{0300}-\x{036F}] and the Mark feature ;-))

    Best Regards

    guy038



  • @guy038
    Wow. Sounds like I’ve got a lot to learn. Will start digging in to this tomorrow.
    Thanks for taking the time to give such a detailed answer on such an arcane topic!



  • Hi, @sylvester-bullitt,

    You may also see all the characters of the N++ Character Panel, with code-point over \x{007f}, in the C1 Controls and Latin-1 Supplement Unicode block, below :

    http://www.unicode.org/charts/PDF/U0080.pdf

    You’ll notice the soft hyphen character, named SHY, of code-point \x{00AD}

    Cheers,

    guy038



  • I think I found a quick fix, but first let me give some background on how this mess happened in the first place.

    Since I don’t have an easy way to enter diacritical marks with my keyboard, I had been using the Windows Charmap application to copy the é character (and many others) to the clipboard. Then I pasted them in a Windows Notepad document, and saved in UTF-8 encoding, so I could just copy them later whenever needed, and paste them to documents I was working on.

    The solution I found was to simply copy the é from Charmap again, then use NPP’s Find & Replace function to replace the faulty é (that is, multi-character version) with the one I just copied from Charmap.

    After making the changes in a test document, I ran the (unchanged) regular expression search again, and what do you know? All the false positives disappeared! In other words, once I removed Windows Notepad as the middleman, and copied the é to NPP, it worked as expected. My take on this: I’m guessing Windows Notepad is so old (even though I use Windows 10) it was never updated to correctly handle characters with diacritical marks. I hope (but doubt) somebody from Microsoft is reading this forum.

    Thanks for the insights!



  • Hello, @sylvester-bullitt and All,

    In this post, I’ll describe all the Windows Input methods, which, with the combined use of the ALT key and the numeric keypad, allows you to enter any character, with Unicode code-point between \x{0000} and \x{FFFF} from :

    • The current Windows OEM Code page, used by your system

    • The current Windows ANSI Code page, used by your system

    • The Unicode Basic Multilingual Plane

    There are 4 Windows Input methods :


    The first TWO most known methods, are :

    • ALT + a number n, from 001 to 255, writes the character, of code n, from the appropriate Windows OEM Code page, on your system

      • Press the ALT key

      • Type a number between 001 and 255, on your numeric keyboard

      • Release the ALT key


    • ALT + a number n, from 0001 to 0255, writes the character, of code n, from the appropriate Windows ANSI code-page, used, on your system, for any NON-Unicode program ( generally Windows-1252 ). You can, also, see that list, with all these characters, in Notepad++, by clicking on the menu option Edit > Character Panel

      • Press the ALT key

      • Hit the 0 key, FIRST, on your numeric keypad ( IMPORTANT )

      • Then, type a number between 001 and 255, on your numeric keyboard

      • Release the ALT key


    A third Windows input method, very little used, which works, ONLY, in a file, with an Unicode encoding, is :

    • ALT + a number n, from 1 to 31, writes the old symbol of the Control character, of code n

      • Press the ALT key

      • Type a number between 1 and 31, on your numeric keyboard, WITHOUT any leading zero !

      • Release the ALT key

    => You’ll obtain the 31 following characters, below :

    ☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼ ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼
    

    A fourth and powerful Windows input method can be obtained, after creating a new registry entry, on your system :

    • ALT + the + sign + an hexadecimal number n, from 0000 to FFFF, writes the character, of code-point n, from the Basic Multilingual Plane

      • Hold down the ALT key

      • Type the + key, on the NUMERIC keypad

      • Type the hexadecimal code-point of the character, using the 0 to 9 keys, on the numeric keypad AND/OR the normal A to F keys, of the alphanumeric keyboard

      • Release the ALT key

    • Note that this fourth input method cannot write any character with code-point over the BMP, so between \x{10000} and \x{10FFFF} !


    As said above, in order to be able to use this fourth Input method, right above, you must modify the registry :

    • Run the application regedit.exe

    • Preferably, backup all your registry, first

    • Move to the HKEY_CURRENT_USER\Control Panel\Input Method location

    • Create a new REG_SZ entry, named EnableHexNumpad

    • Enter, as data, the value 1

    • Valid the dialog

    • Close the registry editor

    • Re-start your system or simply, log Off/On, from Windows 7 and above


    For instance, if you want to write the EM DASH character, of Unicode code-point \x{2014} and with code = 151, in the Windows-1252 encoding, two solutions are possible :

    • Hit ALT and successively, 0, 1, 5, 1, on your numeric keypad ( Second Input method )

    • Hit ALT and successively, +, 2, 0, 1, 4, on your numeric keypad ( Fourth Input method )


    For some additional examples of the 4th Input method, refer to the end of this post :

    https://notepad-plus-plus.org/community/topic/11962/alt-codes-not-working/5

    In that post, note that the present 4th Input method is named 3th Input method !


    Of course, it’s always better to work with documents with an Unicode encoding :

    • The UTF-8 and UTF-8-BOM allows you to store any character, from \x{0000} to \x{10FFFF}, so all characters of any Unicode plane, from 0 to 16

    • The UCS-2 BE BOM or UCS-2 LE BOM ) allows you to store any character from \x{0000} to \x{FFFF} only, so all characters of the Unicode Plane 0, also named the Basic Multilingual Plane ( BMP )

    But, the most important thing is that your current font, used in N+, is able to display all the glyphs of these numerous characters. Traditional mono-spaced fonts used, as Courier New or Consolas, display Latin, Greek and Cyrillic letters and general symbols but lack of great number of Unicode characters !

    I own the Symbola Monospacified for Liberation Mono font, a monospaced font which contains 9,622 characters and 9,827 glyphs and can manage all diacritical marks.

    Note that this font does not contain the Arabic, Hebrew, Asiatic and Japanese Unicode scripts, but contains, in addition to all European scripts, Punctuation, Mathematical, Arrows, Technical, Dingbats, Emoticons, Pictographs scripts and many others, as listed below :

        •-------------------------------------------------------------•-------------------------------•
        |             Unicode 11.0 Block            |      Range      |   Chars |   Total  | Complete |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Basic Latin                              |   0000 -  007F  |    128  |     128  |          |
        |  Latin-1 Supplement                       |   0080 -  00FF  |    128  |     128  |          |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Latin Extended-A                         |   0100 -  017F  |    128  |     128  |          |
        |  Latin Extended-B                         |   0180 -  024F  |    208  |     208  |          |
        |  IPA Extensions                           |   0250 -  02AF  |     96  |      96  |          |
        |  Spacing Modifier Letters                 |   02B0 -  02FF  |     80  |      80  |          |
        |  Combining Diacritical Marks              |   0300 -  036F  |    112  |     112  |          |
        |  Greek and Coptic                         |   0370 -  03FF  |    135  |     135  |          |
        |  Cyrillic                                 |   0400 -  04FF  |    256  |     256  |          |
        |  Cyrillic Supplement                      |   0500 -  052F  |     48  |      48  |          |
        |  Combining Diacritical Marks Extended     |   1AB0 -  1AFF  |     15  |      15  |          |
        |  Cyrillic Extended-C                      |   1C80 -  1C8F  |      9  |       9  |          |
        |  Phonetic Extensions                      |   1D00 -  1D7F  |    128  |     128  |          |
        |  Phonetic Extensions Supplement           |   1D80 -  1DBF  |     64  |      64  |          |
        |  Combining Diacritical Marks Supplement   |   1DC0 -  1DFF  |     63  |      63  |          |
        |  Latin Extended Additional                |   1E00 -  1EFF  |    256  |     256  |          |
        |  Greek Extended                           |   1F00 -  1FFF  |    233  |     233  |          |
        |  General Punctuation                      |   2000 -  206F  |    111  |     111  |          |
        |  Superscripts and Subscripts              |   2070 -  209F  |     42  |      42  |          |
        |  Currency Symbols                         |   20A0 -  20CF  |     32  |      32  |          |
        |  Combining Diacritical Marks for Symbols  |   20D0 -  20FF  |     33  |      33  |          |
        |  Letterlike Symbols                       |   2100 -  214F  |     80  |      80  |          |
        |  Number Forms                             |   2150 -  218F  |     60  |      60  |          |
        |  Arrows                                   |   2190 -  21FF  |    112  |     112  |          |
        |  Mathematical Operators                   |   2200 -  22FF  |    256  |     256  |          |
        |  Miscellaneous Technical                  |   2300 -  23FF  |    256  |     256  |          |
        |  Control Pictures                         |   2400 -  243F  |     39  |      39  |          |
        |  Optical Character Recognition            |   2440 -  245F  |     11  |      11  |          |
        |  Enclosed Alphanumerics                   |   2460 -  24FF  |    160  |     160  |          |
        |  Box Drawing                              |   2500 -  257F  |    128  |     128  |          |
        |  Block Elements                           |   2580 -  259F  |     32  |      32  |          |
        |  Geometric Shapes                         |   25A0 -  25FF  |     96  |      96  |          |
        |  Miscellaneous Symbols                    |   2600 -  26FF  |    256  |     256  |          |
        |  Dingbats                                 |   2700 -  27BF  |    192  |     192  |          |
        |  Miscellaneous Mathematical Symbols-A     |   27C0 -  27EF  |     48  |      48  |          |
        |  Supplemental Arrows-A                    |   27F0 -  27FF  |     16  |      16  |          |
        |  Braille Patterns                         |   2800 -  28FF  |    256  |     256  |          |
        |  Supplemental Arrows-B                    |   2900 -  297F  |    128  |     128  |          |
        |  Miscellaneous Mathematical Symbols-B     |   2980 -  29FF  |    128  |     128  |          |
        |  Supplemental Mathematical Operators      |   2A00 -  2AFF  |    256  |     256  |          |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Miscellaneous Symbols and Arrows         |   2B00 -  2BFF  |    207  |     250  |    No    |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Latin Extended-C                         |   2C60 -  2C7F  |     32  |      32  |          |
        |  Coptic                                   |   2C80 -  2CFF  |    123  |     123  |          |
        |  Cyrillic Extended-A                      |   2DE0 -  2DFF  |     32  |      32  |          |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Supplemental Punctuation                 |   2E00 -  2E7F  |     74  |      79  |    No    |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Yijing Hexagram Symbols                  |   4DC0 -  4DFF  |     64  |      64  |          |
        |  Cyrillic Extended-B                      |   A640 -  A69F  |     96  |      96  |          |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Latin Extended-D                         |   A720 -  A7FF  |    160  |     163  |    No    |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Latin Extended-E                         |   AB30 -  AB6F  |     54  |      54  |          |
        |  Variation Selectors                      |   FE00 -  FE0F  |     16  |      16  |          |
        |  Combining Half Marks                     |   FE20 -  FE2F  |     16  |      16  |          |
        |  Specials                                 |   FFF0 -  FFFF  |      5  |       5  |          |
        |                                           |                 |         |          |          |
        |  Aegean Numbers                           |  10100 - 1013F  |     57  |      57  |          |
        |  Ancient Greek Numbers                    |  10140 - 1018F  |     79  |      79  |          |
        |  Ancient Symbols                          |  10190 - 101CF  |     13  |      13  |          |
        |  Phaistos Disc                            |  101D0 - 101FF  |     46  |      46  |          |
        |  Coptic Epact Numbers                     |  102E0 - 102FF  |     28  |      28  |          |
        |  Byzantine Musical Symbols                |  1D000 - 1D0FF  |    246  |     246  |          |
        |  Musical Symbols                          |  1D100 - 1D1FF  |    231  |     231  |          |
        |  Ancient Greek Musical Notation           |  1D200 - 1D24F  |     70  |      70  |          |
        |  Tai Xuan Jing Symbols                    |  1D300 - 1D35F  |     87  |      87  |          |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Counting Rod Numerals                    |  1D360 - 1D37F  |     18  |      25  |    No    |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Mathematical Alphanumeric Symbols        |  1D400 - 1D7FF  |    996  |     996  |          |
        |  Mahjong Tiles                            |  1F000 - 1F02F  |     44  |      44  |          |
        |  Domino Tiles                             |  1F030 - 1F09F  |    100  |     100  |          |
        |  Playing Cards                            |  1F0A0 - 1F0FF  |     82  |      82  |          |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Enclosed Alphanumeric Supplement         |  1F100 - 1F1FF  |    191  |     192  |    No    |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Enclosed Ideographic Supplement          |  1F200 - 1F2FF  |     64  |      64  |          |
        |  Miscellaneous Symbols and Pictographs    |  1F300 - 1F5FF  |    768  |     768  |          |
        |  Emoticons                                |  1F600 - 1F64F  |     80  |      80  |          |
        |  Ornamental Dingbats                      |  1F650 - 1F67F  |     48  |      48  |          |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Transport and Map Symbols                |  1F680 - 1F6FF  |    107  |     108  |    No    |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Alchemical Symbols                       |  1F700 - 1F77F  |    116  |     116  |          |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Geometric Shapes Extended                |  1F780 - 1F7FF  |     85  |      89  |    No    |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Supplemental Arrows-C                    |  1F800 - 1F8FF  |    148  |     148  |          |
        •-------------------------------------------•-----------------•---------•----------•----------•
        |  Supplemental Symbols and Pictographs     |  1F900 - 1F9FF  |    148  |     213  |    No    |
        |  Supplementary Private Use Area-A         |  F0000 - FFFFF  |    118  |  65,534  |    No    |
        •-------------------------------------------•-----------------•---------•----------•----------•
    

    If you want, I may send you the Symbola Monospacified for Liberation Mono font, by e-mail and I could add the complete list of characters, handled by that font.

    Once installed on your system, you could use it, within N++, as the global default font, for instance.

    My e-mail address is tguy.038@gmail.com

    Best Regards,

    guy038



  • @guy038 said in Regex Misidentifying Foreign Characters:

    numeric keypad

    Any good advice for these techniques for those of us that prefer a keyboard without a numeric keypad (for use on cramped desktops)? :-)



  • Hello, @sylvester-bullitt, @alan-kilborn and All,

    Alan, good question ;-)) Personally, my old NEC 350 laptop does not have a numeric keypad. So, I’ve got an USB usual keyboard ( 105 keys ) plugged permanently to the laptop !

    When the Caps Lock key is set, my laptop’s French keyboard looks like, below :

    1234567890°+
    AZERTYUIOP^£
    QSDFGHJKLM%
    >WXCVBN?./§

    And if I want to use the pseudo-numeric keypad, I just hit the Num Lock key and the keyboard is then changed as below :

    123456789*°+
    AZERTY456-
    QSDFGH123+%
    >WXCVBN0../

    So :

    • The keys 7890 are mapped to keys 789*

    • The keys UIOP are mapped to keys 456-

    • The keys JKLM are mapped to keys 123+

    • The keys ?/§ are mapped to keys 0./

    As the A, B, C, D, E and F keys are mapped to their default, I’m always able, even without any additional keyboard (in case of travel, for instance), to use, in conjunction with the Alt key, all the Input methods, described in my previous post ;-))

    Unfortunately, I don’t use any new mini-laptop, with a special keyboard layout, so I cannot tell anything else about this subject :-((. Even the laptop of my wife has a physical keypad !

    So, I’m sorry : without material, it’s impossible for me to give pertinent clues about the way to handle these Windows Input methods with atypical keyboard configurations !

    Best Regards,

    guy038


Log in to reply