Community
    • Login

    Regex Misidentifying Foreign Characters

    Scheduled Pinned Locked Moved General Discussion
    8 Posts 3 Posters 1.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Sylvester BullittS
      Sylvester Bullitt
      last edited by Sylvester Bullitt

      I’m trying to use the NPP regex search to find soft hyphens (ISO 8859: 0xAD, Unicode U+00AD SOFT HYPHEN) between non-word characters, for diagnostic purposes. Here’s the regular expression I’m using (NOTA BENE: The browser hides soft hyphens, so you can’t see them here. For an example of search text with embedded soft hyphens, look at the source code of the Web page at http://www.hymntime.com/tch/bio/r/i/v/i/rivier_jft.htm). In this regex, here is one soft hyphen before and after the vertical bar character (the regex “OR” character).

      [^\w]­|­[^\w]
      

      Searching the text below using the regex above gives false positives (again, the browser makes the soft hyphens invisible; you can see examples of soft hyphens being used at the URL given above).

      <body>
      <p lang="fr">Mé­di­a­teur de l’An­ci­enne</p>
      </body>
      

      Although the browser hides soft hyphens, there is one before the letter ‘d’ in the word Mé­di­a­teur. The regex reports the soft hyphen follows a non-word character (in this case, the accented e (é­).

      I tried other regexes, like the one below, but still get the false positives (again one soft hyphen before and after the vertical bar):

      [^[:alpha:]æáéôöüúíîß]­|­[^[:alpha:]æáéôöüúíîß]
      

      What am I doing wrong? Is this a bug in the regex engine?

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @sylvester-bullitt and All,

        No, your regex does not give false positive ! Your regex can be written :

        [^\w]\xAD|\xAD[^\w]

        And, indeed, it matches as expected :

        • A non-word char, followed with a soft-hyphen \xAD

        • A soft-hyphen \xAD, followed with a non-word char


        Assuming your HTML example, below :

        <body>
        <p lang="fr">Mé­di­a­teur de l’An­ci­enne</p>
        </body>
        
        • Place your caret, right before the upper-case M

        • Press the Right Arrow key => The caret is now located right after the letter M

        • Press again the Right Arrow key => The caret seems to be in the middle of the letter é !!

        • Press again the Right Arrow key => The caret is now right after the letter é

        Why this behaviour ? Just because, after the M letter, there is no é letter at all, but 2 characters :

        • The letter e of Unicode code \x{0065}, followed with

        • The COMBINING ACUTE ACCENT, of Unicode code-point \x{0301}, which is a character from the combining diacritical marks Unicode block, in range [\x{0300}–\x{036F}] !

        Refer to all these characters of that block, below :

        http://www.unicode.org/charts/PDF/U0300.pdf


        Note that any character ( not specially vowels ! ) may have any number of additional diacritical marks, referring to the base char !

        IMPORTANT : this condition implies that your current font is able to draw all these diacritical marks. Otherwise, your system will probably make a font substitution or use a fallback font, in order to visualize such glyphs, usually less accurately !

        For instance, the accentuated character,p̸͚̀͟͠ , based on the lowercase letter p, can be found with the regex : p\x{0300}\x{0338}\x{035A}\x{035F}\x{0360} because :

        • I first wrote the lower-case letter p

        • I added the diacritical mark COMBINING GRAVE ACCENT ( \x{0300} )

        • I added the diacritical mark COMBINING LONG SOLIDUS OVERLAY ( \x{0338} )

        • I added the diacritical mark COMBINING DOUBLE RING BELOW ( \x{035A} )

        • I added the diacritical mark COMBINING DOUBLE MACRON BELOW ( \x{035F} )

        • I added the diacritical mark COMBINING DOUBLE TILDE ( \x{0360} )

        Any other combination will not work, of course !


        So, regarding your example, the part, beginning with the M letter and ending right before the d letter, contains, in fact, 4 characters, which can be found with the regex :

        Me\x{0301}\xAD

        And, with your regex [^\w]\xAD|\xAD[^\w], the first alternative does match a non-word char ( \x{0301} ), followed with a soft hyphen ( \xAD )

        So, a regex like (\w[\x{0300}-\x{036F}]*)+ would get any word containing word characters followed with possible diacritical mark chars !

        If we change your regex, in order to include the Combining Diacritical Marks range in the negative class character, an easy solution could be :

        [^\w\x{0300}-\x{036F}]\xAD|\AD[^\w\x{0300}-\x{036F}]

        However, this new regex does not find any occurrence in your text, below :

        http://www.hymntime.com/tch/bio/r/i/v/i/rivier_jft.htm

        This could be your expected result ;-))


        Note that it could be judicious to run a regex S/R to get rid of any combining character and simply use the appropriate character !

        Referring to your example, the S/R :

        SEARCH (?-i)e\x{0301}

        REPLACE \x{00E9}

        And you get this time :

        <body>
        <p lang="fr">Mé­di­a­teur de l’An­ci­enne</p>
        </body>
        

        Remark : use the option Edit Character Panel to get the right hexadecimal code of the replacement character !

        On the other hand, in order to search for any individual Combining Diacritical Mark, use, preferably, the regex [\x{0300}-\x{036F}] and the Mark feature ;-))

        Best Regards

        guy038

        Sylvester BullittS 1 Reply Last reply Reply Quote 1
        • Sylvester BullittS
          Sylvester Bullitt @guy038
          last edited by

          @guy038
          Wow. Sounds like I’ve got a lot to learn. Will start digging in to this tomorrow.
          Thanks for taking the time to give such a detailed answer on such an arcane topic!

          1 Reply Last reply Reply Quote 2
          • guy038G
            guy038
            last edited by

            Hi, @sylvester-bullitt,

            You may also see all the characters of the N++ Character Panel, with code-point over \x{007f}, in the C1 Controls and Latin-1 Supplement Unicode block, below :

            http://www.unicode.org/charts/PDF/U0080.pdf

            You’ll notice the soft hyphen character, named SHY, of code-point \x{00AD}

            Cheers,

            guy038

            1 Reply Last reply Reply Quote 2
            • Sylvester BullittS
              Sylvester Bullitt
              last edited by

              I think I found a quick fix, but first let me give some background on how this mess happened in the first place.

              Since I don’t have an easy way to enter diacritical marks with my keyboard, I had been using the Windows Charmap application to copy the é character (and many others) to the clipboard. Then I pasted them in a Windows Notepad document, and saved in UTF-8 encoding, so I could just copy them later whenever needed, and paste them to documents I was working on.

              The solution I found was to simply copy the é from Charmap again, then use NPP’s Find & Replace function to replace the faulty é (that is, multi-character version) with the one I just copied from Charmap.

              After making the changes in a test document, I ran the (unchanged) regular expression search again, and what do you know? All the false positives disappeared! In other words, once I removed Windows Notepad as the middleman, and copied the é to NPP, it worked as expected. My take on this: I’m guessing Windows Notepad is so old (even though I use Windows 10) it was never updated to correctly handle characters with diacritical marks. I hope (but doubt) somebody from Microsoft is reading this forum.

              Thanks for the insights!

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by guy038

                Hello, @sylvester-bullitt and All,

                In this post, I’ll describe all the Windows Input methods, which, with the combined use of the ALT key and the numeric keypad, allows you to enter any character, with Unicode code-point between \x{0000} and \x{FFFF} from :

                • The current Windows OEM Code page, used by your system

                • The current Windows ANSI Code page, used by your system

                • The Unicode Basic Multilingual Plane

                There are 4 Windows Input methods :


                The first TWO most known methods, are :

                • ALT + a number n, from 001 to 255, writes the character, of code n, from the appropriate Windows OEM Code page, on your system

                  • Press the ALT key

                  • Type a number between 001 and 255, on your numeric keyboard

                  • Release the ALT key


                • ALT + a number n, from 0001 to 0255, writes the character, of code n, from the appropriate Windows ANSI code-page, used, on your system, for any NON-Unicode program ( generally Windows-1252 ). You can, also, see that list, with all these characters, in Notepad++, by clicking on the menu option Edit > Character Panel

                  • Press the ALT key

                  • Hit the 0 key, FIRST, on your numeric keypad ( IMPORTANT )

                  • Then, type a number between 001 and 255, on your numeric keyboard

                  • Release the ALT key


                A third Windows input method, very little used, which works, ONLY, in a file, with an Unicode encoding, is :

                • ALT + a number n, from 1 to 31, writes the old symbol of the Control character, of code n

                  • Press the ALT key

                  • Type a number between 1 and 31, on your numeric keyboard, WITHOUT any leading zero !

                  • Release the ALT key

                => You’ll obtain the 31 following characters, below :

                ☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼ ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼
                

                A fourth and powerful Windows input method can be obtained, after creating a new registry entry, on your system :

                • ALT + the + sign + an hexadecimal number n, from 0000 to FFFF, writes the character, of code-point n, from the Basic Multilingual Plane

                  • Hold down the ALT key

                  • Type the + key, on the NUMERIC keypad

                  • Type the hexadecimal code-point of the character, using the 0 to 9 keys, on the numeric keypad AND/OR the normal A to F keys, of the alphanumeric keyboard

                  • Release the ALT key

                • Note that this fourth input method cannot write any character with code-point over the BMP, so between \x{10000} and \x{10FFFF} !


                As said above, in order to be able to use this fourth Input method, right above, you must modify the registry :

                • Run the application regedit.exe

                • Preferably, backup all your registry, first

                • Move to the HKEY_CURRENT_USER\Control Panel\Input Method location

                • Create a new REG_SZ entry, named EnableHexNumpad

                • Enter, as data, the value 1

                • Valid the dialog

                • Close the registry editor

                • Re-start your system or simply, log Off/On, from Windows 7 and above


                For instance, if you want to write the EM DASH character, of Unicode code-point \x{2014} and with code = 151, in the Windows-1252 encoding, two solutions are possible :

                • Hit ALT and successively, 0, 1, 5, 1, on your numeric keypad ( Second Input method )

                • Hit ALT and successively, +, 2, 0, 1, 4, on your numeric keypad ( Fourth Input method )


                For some additional examples of the 4th Input method, refer to the end of this post :

                https://notepad-plus-plus.org/community/topic/11962/alt-codes-not-working/5

                In that post, note that the present 4th Input method is named 3th Input method !


                Of course, it’s always better to work with documents with an Unicode encoding :

                • The UTF-8 and UTF-8-BOM allows you to store any character, from \x{0000} to \x{10FFFF}, so all characters of any Unicode plane, from 0 to 16

                • The UCS-2 BE BOM or UCS-2 LE BOM ) allows you to store any character from \x{0000} to \x{FFFF} only, so all characters of the Unicode Plane 0, also named the Basic Multilingual Plane ( BMP )

                But, the most important thing is that your current font, used in N+, is able to display all the glyphs of these numerous characters. Traditional mono-spaced fonts used, as Courier New or Consolas, display Latin, Greek and Cyrillic letters and general symbols but lack of great number of Unicode characters !

                I own the Symbola Monospacified for Liberation Mono font, a monospaced font which contains 9,622 characters and 9,827 glyphs and can manage all diacritical marks.

                Note that this font does not contain the Arabic, Hebrew, Asiatic and Japanese Unicode scripts, but contains, in addition to all European scripts, Punctuation, Mathematical, Arrows, Technical, Dingbats, Emoticons, Pictographs scripts and many others, as listed below :

                    •-------------------------------------------------------------•-------------------------------•
                    |             Unicode 11.0 Block            |      Range      |   Chars |   Total  | Complete |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Basic Latin                              |   0000 -  007F  |    128  |     128  |          |
                    |  Latin-1 Supplement                       |   0080 -  00FF  |    128  |     128  |          |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Latin Extended-A                         |   0100 -  017F  |    128  |     128  |          |
                    |  Latin Extended-B                         |   0180 -  024F  |    208  |     208  |          |
                    |  IPA Extensions                           |   0250 -  02AF  |     96  |      96  |          |
                    |  Spacing Modifier Letters                 |   02B0 -  02FF  |     80  |      80  |          |
                    |  Combining Diacritical Marks              |   0300 -  036F  |    112  |     112  |          |
                    |  Greek and Coptic                         |   0370 -  03FF  |    135  |     135  |          |
                    |  Cyrillic                                 |   0400 -  04FF  |    256  |     256  |          |
                    |  Cyrillic Supplement                      |   0500 -  052F  |     48  |      48  |          |
                    |  Combining Diacritical Marks Extended     |   1AB0 -  1AFF  |     15  |      15  |          |
                    |  Cyrillic Extended-C                      |   1C80 -  1C8F  |      9  |       9  |          |
                    |  Phonetic Extensions                      |   1D00 -  1D7F  |    128  |     128  |          |
                    |  Phonetic Extensions Supplement           |   1D80 -  1DBF  |     64  |      64  |          |
                    |  Combining Diacritical Marks Supplement   |   1DC0 -  1DFF  |     63  |      63  |          |
                    |  Latin Extended Additional                |   1E00 -  1EFF  |    256  |     256  |          |
                    |  Greek Extended                           |   1F00 -  1FFF  |    233  |     233  |          |
                    |  General Punctuation                      |   2000 -  206F  |    111  |     111  |          |
                    |  Superscripts and Subscripts              |   2070 -  209F  |     42  |      42  |          |
                    |  Currency Symbols                         |   20A0 -  20CF  |     32  |      32  |          |
                    |  Combining Diacritical Marks for Symbols  |   20D0 -  20FF  |     33  |      33  |          |
                    |  Letterlike Symbols                       |   2100 -  214F  |     80  |      80  |          |
                    |  Number Forms                             |   2150 -  218F  |     60  |      60  |          |
                    |  Arrows                                   |   2190 -  21FF  |    112  |     112  |          |
                    |  Mathematical Operators                   |   2200 -  22FF  |    256  |     256  |          |
                    |  Miscellaneous Technical                  |   2300 -  23FF  |    256  |     256  |          |
                    |  Control Pictures                         |   2400 -  243F  |     39  |      39  |          |
                    |  Optical Character Recognition            |   2440 -  245F  |     11  |      11  |          |
                    |  Enclosed Alphanumerics                   |   2460 -  24FF  |    160  |     160  |          |
                    |  Box Drawing                              |   2500 -  257F  |    128  |     128  |          |
                    |  Block Elements                           |   2580 -  259F  |     32  |      32  |          |
                    |  Geometric Shapes                         |   25A0 -  25FF  |     96  |      96  |          |
                    |  Miscellaneous Symbols                    |   2600 -  26FF  |    256  |     256  |          |
                    |  Dingbats                                 |   2700 -  27BF  |    192  |     192  |          |
                    |  Miscellaneous Mathematical Symbols-A     |   27C0 -  27EF  |     48  |      48  |          |
                    |  Supplemental Arrows-A                    |   27F0 -  27FF  |     16  |      16  |          |
                    |  Braille Patterns                         |   2800 -  28FF  |    256  |     256  |          |
                    |  Supplemental Arrows-B                    |   2900 -  297F  |    128  |     128  |          |
                    |  Miscellaneous Mathematical Symbols-B     |   2980 -  29FF  |    128  |     128  |          |
                    |  Supplemental Mathematical Operators      |   2A00 -  2AFF  |    256  |     256  |          |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Miscellaneous Symbols and Arrows         |   2B00 -  2BFF  |    207  |     250  |    No    |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Latin Extended-C                         |   2C60 -  2C7F  |     32  |      32  |          |
                    |  Coptic                                   |   2C80 -  2CFF  |    123  |     123  |          |
                    |  Cyrillic Extended-A                      |   2DE0 -  2DFF  |     32  |      32  |          |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Supplemental Punctuation                 |   2E00 -  2E7F  |     74  |      79  |    No    |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Yijing Hexagram Symbols                  |   4DC0 -  4DFF  |     64  |      64  |          |
                    |  Cyrillic Extended-B                      |   A640 -  A69F  |     96  |      96  |          |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Latin Extended-D                         |   A720 -  A7FF  |    160  |     163  |    No    |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Latin Extended-E                         |   AB30 -  AB6F  |     54  |      54  |          |
                    |  Variation Selectors                      |   FE00 -  FE0F  |     16  |      16  |          |
                    |  Combining Half Marks                     |   FE20 -  FE2F  |     16  |      16  |          |
                    |  Specials                                 |   FFF0 -  FFFF  |      5  |       5  |          |
                    |                                           |                 |         |          |          |
                    |  Aegean Numbers                           |  10100 - 1013F  |     57  |      57  |          |
                    |  Ancient Greek Numbers                    |  10140 - 1018F  |     79  |      79  |          |
                    |  Ancient Symbols                          |  10190 - 101CF  |     13  |      13  |          |
                    |  Phaistos Disc                            |  101D0 - 101FF  |     46  |      46  |          |
                    |  Coptic Epact Numbers                     |  102E0 - 102FF  |     28  |      28  |          |
                    |  Byzantine Musical Symbols                |  1D000 - 1D0FF  |    246  |     246  |          |
                    |  Musical Symbols                          |  1D100 - 1D1FF  |    231  |     231  |          |
                    |  Ancient Greek Musical Notation           |  1D200 - 1D24F  |     70  |      70  |          |
                    |  Tai Xuan Jing Symbols                    |  1D300 - 1D35F  |     87  |      87  |          |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Counting Rod Numerals                    |  1D360 - 1D37F  |     18  |      25  |    No    |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Mathematical Alphanumeric Symbols        |  1D400 - 1D7FF  |    996  |     996  |          |
                    |  Mahjong Tiles                            |  1F000 - 1F02F  |     44  |      44  |          |
                    |  Domino Tiles                             |  1F030 - 1F09F  |    100  |     100  |          |
                    |  Playing Cards                            |  1F0A0 - 1F0FF  |     82  |      82  |          |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Enclosed Alphanumeric Supplement         |  1F100 - 1F1FF  |    191  |     192  |    No    |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Enclosed Ideographic Supplement          |  1F200 - 1F2FF  |     64  |      64  |          |
                    |  Miscellaneous Symbols and Pictographs    |  1F300 - 1F5FF  |    768  |     768  |          |
                    |  Emoticons                                |  1F600 - 1F64F  |     80  |      80  |          |
                    |  Ornamental Dingbats                      |  1F650 - 1F67F  |     48  |      48  |          |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Transport and Map Symbols                |  1F680 - 1F6FF  |    107  |     108  |    No    |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Alchemical Symbols                       |  1F700 - 1F77F  |    116  |     116  |          |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Geometric Shapes Extended                |  1F780 - 1F7FF  |     85  |      89  |    No    |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Supplemental Arrows-C                    |  1F800 - 1F8FF  |    148  |     148  |          |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                    |  Supplemental Symbols and Pictographs     |  1F900 - 1F9FF  |    148  |     213  |    No    |
                    |  Supplementary Private Use Area-A         |  F0000 - FFFFF  |    118  |  65,534  |    No    |
                    •-------------------------------------------•-----------------•---------•----------•----------•
                

                If you want, I may send you the Symbola Monospacified for Liberation Mono font, by e-mail and I could add the complete list of characters, handled by that font.

                Once installed on your system, you could use it, within N++, as the global default font, for instance.

                My e-mail address is

                Best Regards,

                guy038

                Alan KilbornA 1 Reply Last reply Reply Quote 1
                • Alan KilbornA
                  Alan Kilborn @guy038
                  last edited by

                  @guy038 said in Regex Misidentifying Foreign Characters:

                  numeric keypad

                  Any good advice for these techniques for those of us that prefer a keyboard without a numeric keypad (for use on cramped desktops)? :-)

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hello, @sylvester-bullitt, @alan-kilborn and All,

                    Alan, good question ;-)) Personally, my old NEC 350 laptop does not have a numeric keypad. So, I’ve got an USB usual keyboard ( 105 keys ) plugged permanently to the laptop !

                    When the Caps Lock key is set, my laptop’s French keyboard looks like, below :

                    1234567890°+
                    AZERTYUIOP^£
                    QSDFGHJKLM%
                    >WXCVBN?./§

                    And if I want to use the pseudo-numeric keypad, I just hit the Num Lock key and the keyboard is then changed as below :

                    123456789*°+
                    AZERTY456-^£
                    QSDFGH123+%
                    >WXCVBN0../

                    So :

                    • The keys 7890 are mapped to keys 789*

                    • The keys UIOP are mapped to keys 456-

                    • The keys JKLM are mapped to keys 123+

                    • The keys ?/§ are mapped to keys 0./

                    As the A, B, C, D, E and F keys are mapped to their default, I’m always able, even without any additional keyboard (in case of travel, for instance), to use, in conjunction with the Alt key, all the Input methods, described in my previous post ;-))

                    Unfortunately, I don’t use any new mini-laptop, with a special keyboard layout, so I cannot tell anything else about this subject :-((. Even the laptop of my wife has a physical keypad !

                    So, I’m sorry : without material, it’s impossible for me to give pertinent clues about the way to handle these Windows Input methods with atypical keyboard configurations !

                    Best Regards,

                    guy038

                    1 Reply Last reply Reply Quote 2
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors