Regex Misidentifying Foreign Characters
-
I’m trying to use the NPP regex search to find soft hyphens (ISO 8859: 0xAD, Unicode U+00AD SOFT HYPHEN) between non-word characters, for diagnostic purposes. Here’s the regular expression I’m using (NOTA BENE: The browser hides soft hyphens, so you can’t see them here. For an example of search text with embedded soft hyphens, look at the source code of the Web page at http://www.hymntime.com/tch/bio/r/i/v/i/rivier_jft.htm). In this regex, here is one soft hyphen before and after the vertical bar character (the regex “OR” character).
[^\w]|[^\w]
Searching the text below using the regex above gives false positives (again, the browser makes the soft hyphens invisible; you can see examples of soft hyphens being used at the URL given above).
<body> <p lang="fr">Médiateur de l’Ancienne</p> </body>
Although the browser hides soft hyphens, there is one before the letter ‘d’ in the word Médiateur. The regex reports the soft hyphen follows a non-word character (in this case, the accented e (é).
I tried other regexes, like the one below, but still get the false positives (again one soft hyphen before and after the vertical bar):
[^[:alpha:]æáéôöüúíîß]|[^[:alpha:]æáéôöüúíîß]
What am I doing wrong? Is this a bug in the regex engine?
-
Hello, @sylvester-bullitt and All,
No, your regex does not give false positive ! Your regex can be written :
[^\w]\xAD|\xAD[^\w]
And, indeed, it matches as expected :
-
A non-word char, followed with a soft-hyphen
\xAD
-
A soft-hyphen
\xAD
, followed with a non-word char
Assuming your HTML example, below :
<body> <p lang="fr">Médiateur de l’Ancienne</p> </body>
-
Place your caret, right before the upper-case
M
-
Press the Right Arrow key => The caret is now located right after the letter
M
-
Press again the Right Arrow key => The caret seems to be in the middle of the letter
é
!! -
Press again the Right Arrow key => The caret is now right after the letter
é
Why this behaviour ? Just because, after the
M
letter, there is noé
letter at all, but2
characters :-
The letter
e
of Unicode code\x{0065}
, followed with -
The COMBINING ACUTE ACCENT, of Unicode code-point
\x{0301}
, which is a character from the combining diacritical marks Unicode block, in range[\x{0300}–\x{036F}]
!
Refer to all these characters of that block, below :
http://www.unicode.org/charts/PDF/U0300.pdf
Note that any character ( not specially vowels ! ) may have any number of additional diacritical marks, referring to the base char !
IMPORTANT : this condition implies that your current font is able to draw all these diacritical marks. Otherwise, your system will probably make a font substitution or use a fallback font, in order to visualize such glyphs, usually less accurately !
For instance, the accentuated character,
p̸͚̀͟͠
, based on the lowercase letterp
, can be found with the regex :p\x{0300}\x{0338}\x{035A}\x{035F}\x{0360}
because :-
I first wrote the lower-case letter
p
-
I added the diacritical mark COMBINING GRAVE ACCENT (
\x{0300}
) -
I added the diacritical mark COMBINING LONG SOLIDUS OVERLAY (
\x{0338}
) -
I added the diacritical mark COMBINING DOUBLE RING BELOW (
\x{035A}
) -
I added the diacritical mark COMBINING DOUBLE MACRON BELOW (
\x{035F}
) -
I added the diacritical mark COMBINING DOUBLE TILDE (
\x{0360}
)
Any other combination will not work, of course !
So, regarding your example, the part, beginning with the
M
letter and ending right before thed
letter, contains, in fact,4
characters, which can be found with the regex :Me\x{0301}\xAD
And, with your regex
[^\w]\xAD|\xAD[^\w]
, the first alternative does match a non-word char (\x{0301}
), followed with a soft hyphen (\xAD
)So, a regex like
(\w[\x{0300}-\x{036F}]*)+
would get any word containing word characters followed with possible diacritical mark chars !If we change your regex, in order to include the Combining Diacritical Marks range in the negative class character, an easy solution could be :
[^\w\x{0300}-\x{036F}]\xAD|\AD[^\w\x{0300}-\x{036F}]
However, this new regex does not find any occurrence in your text, below :
http://www.hymntime.com/tch/bio/r/i/v/i/rivier_jft.htm
This could be your expected result ;-))
Note that it could be judicious to run a regex S/R to get rid of any combining character and simply use the appropriate character !
Referring to your example, the S/R :
SEARCH
(?-i)e\x{0301}
REPLACE
\x{00E9}
And you get this time :
<body> <p lang="fr">Médiateur de l’Ancienne</p> </body>
Remark : use the option
Edit Character Panel
to get the right hexadecimal code of the replacement character !On the other hand, in order to search for any individual Combining Diacritical Mark, use, preferably, the regex
[\x{0300}-\x{036F}]
and the Mark feature ;-))Best Regards
guy038
-
-
@guy038
Wow. Sounds like I’ve got a lot to learn. Will start digging in to this tomorrow.
Thanks for taking the time to give such a detailed answer on such an arcane topic! -
Hi, @sylvester-bullitt,
You may also see all the characters of the N++ Character Panel, with code-point over
\x{007f}
, in the C1 Controls and Latin-1 Supplement Unicode block, below :http://www.unicode.org/charts/PDF/U0080.pdf
You’ll notice the soft hyphen character, named
SHY
, of code-point\x{00AD}
Cheers,
guy038
-
I think I found a quick fix, but first let me give some background on how this mess happened in the first place.
Since I don’t have an easy way to enter diacritical marks with my keyboard, I had been using the Windows Charmap application to copy the é character (and many others) to the clipboard. Then I pasted them in a Windows Notepad document, and saved in UTF-8 encoding, so I could just copy them later whenever needed, and paste them to documents I was working on.
The solution I found was to simply copy the é from Charmap again, then use NPP’s Find & Replace function to replace the faulty é (that is, multi-character version) with the one I just copied from Charmap.
After making the changes in a test document, I ran the (unchanged) regular expression search again, and what do you know? All the false positives disappeared! In other words, once I removed Windows Notepad as the middleman, and copied the é to NPP, it worked as expected. My take on this: I’m guessing Windows Notepad is so old (even though I use Windows 10) it was never updated to correctly handle characters with diacritical marks. I hope (but doubt) somebody from Microsoft is reading this forum.
Thanks for the insights!
-
Hello, @sylvester-bullitt and All,
In this post, I’ll describe all the Windows Input methods, which, with the combined use of the
ALT
key and the numeric keypad, allows you to enter any character, with Unicode code-point between\x{0000}
and\x{FFFF}
from :-
The current
Windows OEM Code page
, used by your system -
The current
Windows ANSI Code page
, used by your system -
The Unicode
Basic Multilingual Plane
There are
4
Windows Input methods :
The first TWO most known methods, are :
-
ALT
+ a number n, from001
to255
, writes the character, of coden
, from the appropriate Windows OEM Code page, on your system-
Press the
ALT
key -
Type a number between
001
and255
, on your numeric keyboard -
Release the
ALT
key
-
-
ALT
+ a number n, from0001
to0255
, writes the character, of coden
, from the appropriate Windows ANSI code-page, used, on your system, for any NON-Unicode program ( generally Windows-1252 ). You can, also, see that list, with all these characters, in Notepad++, by clicking on the menu optionEdit > Character Panel
-
Press the
ALT
key -
Hit the
0
key, FIRST, on your numeric keypad ( IMPORTANT ) -
Then, type a number between
001
and255
, on your numeric keyboard -
Release the
ALT
key
-
A third Windows input method, very little used, which works, ONLY, in a file, with an Unicode encoding, is :
-
ALT
+ a number n, from1
to31
, writes the old symbol of the Control character, of coden
-
Press the
ALT
key -
Type a number between
1
and31
, on your numeric keyboard, WITHOUT any leading zero ! -
Release the
ALT
key
-
=> You’ll obtain the 31 following characters, below :
☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼ ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼
A fourth and powerful Windows input method can be obtained, after creating a new registry entry, on your system :
-
ALT
+ the+
sign + an hexadecimal number n, from0000
toFFFF
, writes the character, of code-pointn
, from the Basic Multilingual Plane-
Hold down the
ALT
key -
Type the
+
key, on the NUMERIC keypad -
Type the hexadecimal code-point of the character, using the
0
to9
keys, on the numeric keypad AND/OR the normalA
toF
keys, of the alphanumeric keyboard -
Release the
ALT
key
-
-
Note that this fourth input method cannot write any character with code-point over the
BMP
, so between\x{10000}
and\x{10FFFF}
!
As said above, in order to be able to use this
fourth Input
method, right above, you must modify the registry :-
Run the application
regedit.exe
-
Preferably, backup all your registry, first
-
Move to the
HKEY_CURRENT_USER\Control Panel\Input Method
location -
Create a new
REG_SZ
entry, namedEnableHexNumpad
-
Enter, as data, the value
1
-
Valid the dialog
-
Close the registry editor
-
Re-start your system or simply, log Off/On, from Windows 7 and above
For instance, if you want to write the EM DASH character, of Unicode code-point
\x{2014}
and with code =151
, in the Windows-1252 encoding, two solutions are possible :-
Hit
ALT
and successively,0
,1
,5
,1
, on your numeric keypad ( Second Input method ) -
Hit
ALT
and successively,+
,2
,0
,1
,4
, on your numeric keypad ( Fourth Input method )
For some additional examples of the
4th Input
method, refer to the end of this post :https://notepad-plus-plus.org/community/topic/11962/alt-codes-not-working/5
In that post, note that the present
4th Input
method is named3th Input
method !
Of course, it’s always better to work with documents with an Unicode encoding :
-
The
UTF-8
andUTF-8-BOM
allows you to store any character, from\x{0000}
to\x{10FFFF}
, so all characters of any Unicode plane, from0
to16
-
The
UCS-2 BE BOM
orUCS-2 LE BOM
) allows you to store any character from\x{0000}
to\x{FFFF}
only, so all characters of the Unicode Plane 0, also named theBasic Multilingual Plane
( BMP )
But, the most important thing is that your current font, used in N+, is able to display all the glyphs of these numerous characters. Traditional mono-spaced fonts used, as
Courier New
orConsolas
, display Latin, Greek and Cyrillic letters and general symbols but lack of great number of Unicode characters !I own the
Symbola Monospacified for Liberation Mono
font, a monospaced font which contains9,622
characters and9,827
glyphs and can manage all diacritical marks.Note that this font does not contain the Arabic, Hebrew, Asiatic and Japanese Unicode scripts, but contains, in addition to all European scripts, Punctuation, Mathematical, Arrows, Technical, Dingbats, Emoticons, Pictographs scripts and many others, as listed below :
•-------------------------------------------------------------•-------------------------------• | Unicode 11.0 Block | Range | Chars | Total | Complete | •-------------------------------------------•-----------------•---------•----------•----------• | Basic Latin | 0000 - 007F | 128 | 128 | | | Latin-1 Supplement | 0080 - 00FF | 128 | 128 | | •-------------------------------------------•-----------------•---------•----------•----------• | Latin Extended-A | 0100 - 017F | 128 | 128 | | | Latin Extended-B | 0180 - 024F | 208 | 208 | | | IPA Extensions | 0250 - 02AF | 96 | 96 | | | Spacing Modifier Letters | 02B0 - 02FF | 80 | 80 | | | Combining Diacritical Marks | 0300 - 036F | 112 | 112 | | | Greek and Coptic | 0370 - 03FF | 135 | 135 | | | Cyrillic | 0400 - 04FF | 256 | 256 | | | Cyrillic Supplement | 0500 - 052F | 48 | 48 | | | Combining Diacritical Marks Extended | 1AB0 - 1AFF | 15 | 15 | | | Cyrillic Extended-C | 1C80 - 1C8F | 9 | 9 | | | Phonetic Extensions | 1D00 - 1D7F | 128 | 128 | | | Phonetic Extensions Supplement | 1D80 - 1DBF | 64 | 64 | | | Combining Diacritical Marks Supplement | 1DC0 - 1DFF | 63 | 63 | | | Latin Extended Additional | 1E00 - 1EFF | 256 | 256 | | | Greek Extended | 1F00 - 1FFF | 233 | 233 | | | General Punctuation | 2000 - 206F | 111 | 111 | | | Superscripts and Subscripts | 2070 - 209F | 42 | 42 | | | Currency Symbols | 20A0 - 20CF | 32 | 32 | | | Combining Diacritical Marks for Symbols | 20D0 - 20FF | 33 | 33 | | | Letterlike Symbols | 2100 - 214F | 80 | 80 | | | Number Forms | 2150 - 218F | 60 | 60 | | | Arrows | 2190 - 21FF | 112 | 112 | | | Mathematical Operators | 2200 - 22FF | 256 | 256 | | | Miscellaneous Technical | 2300 - 23FF | 256 | 256 | | | Control Pictures | 2400 - 243F | 39 | 39 | | | Optical Character Recognition | 2440 - 245F | 11 | 11 | | | Enclosed Alphanumerics | 2460 - 24FF | 160 | 160 | | | Box Drawing | 2500 - 257F | 128 | 128 | | | Block Elements | 2580 - 259F | 32 | 32 | | | Geometric Shapes | 25A0 - 25FF | 96 | 96 | | | Miscellaneous Symbols | 2600 - 26FF | 256 | 256 | | | Dingbats | 2700 - 27BF | 192 | 192 | | | Miscellaneous Mathematical Symbols-A | 27C0 - 27EF | 48 | 48 | | | Supplemental Arrows-A | 27F0 - 27FF | 16 | 16 | | | Braille Patterns | 2800 - 28FF | 256 | 256 | | | Supplemental Arrows-B | 2900 - 297F | 128 | 128 | | | Miscellaneous Mathematical Symbols-B | 2980 - 29FF | 128 | 128 | | | Supplemental Mathematical Operators | 2A00 - 2AFF | 256 | 256 | | •-------------------------------------------•-----------------•---------•----------•----------• | Miscellaneous Symbols and Arrows | 2B00 - 2BFF | 207 | 250 | No | •-------------------------------------------•-----------------•---------•----------•----------• | Latin Extended-C | 2C60 - 2C7F | 32 | 32 | | | Coptic | 2C80 - 2CFF | 123 | 123 | | | Cyrillic Extended-A | 2DE0 - 2DFF | 32 | 32 | | •-------------------------------------------•-----------------•---------•----------•----------• | Supplemental Punctuation | 2E00 - 2E7F | 74 | 79 | No | •-------------------------------------------•-----------------•---------•----------•----------• | Yijing Hexagram Symbols | 4DC0 - 4DFF | 64 | 64 | | | Cyrillic Extended-B | A640 - A69F | 96 | 96 | | •-------------------------------------------•-----------------•---------•----------•----------• | Latin Extended-D | A720 - A7FF | 160 | 163 | No | •-------------------------------------------•-----------------•---------•----------•----------• | Latin Extended-E | AB30 - AB6F | 54 | 54 | | | Variation Selectors | FE00 - FE0F | 16 | 16 | | | Combining Half Marks | FE20 - FE2F | 16 | 16 | | | Specials | FFF0 - FFFF | 5 | 5 | | | | | | | | | Aegean Numbers | 10100 - 1013F | 57 | 57 | | | Ancient Greek Numbers | 10140 - 1018F | 79 | 79 | | | Ancient Symbols | 10190 - 101CF | 13 | 13 | | | Phaistos Disc | 101D0 - 101FF | 46 | 46 | | | Coptic Epact Numbers | 102E0 - 102FF | 28 | 28 | | | Byzantine Musical Symbols | 1D000 - 1D0FF | 246 | 246 | | | Musical Symbols | 1D100 - 1D1FF | 231 | 231 | | | Ancient Greek Musical Notation | 1D200 - 1D24F | 70 | 70 | | | Tai Xuan Jing Symbols | 1D300 - 1D35F | 87 | 87 | | •-------------------------------------------•-----------------•---------•----------•----------• | Counting Rod Numerals | 1D360 - 1D37F | 18 | 25 | No | •-------------------------------------------•-----------------•---------•----------•----------• | Mathematical Alphanumeric Symbols | 1D400 - 1D7FF | 996 | 996 | | | Mahjong Tiles | 1F000 - 1F02F | 44 | 44 | | | Domino Tiles | 1F030 - 1F09F | 100 | 100 | | | Playing Cards | 1F0A0 - 1F0FF | 82 | 82 | | •-------------------------------------------•-----------------•---------•----------•----------• | Enclosed Alphanumeric Supplement | 1F100 - 1F1FF | 191 | 192 | No | •-------------------------------------------•-----------------•---------•----------•----------• | Enclosed Ideographic Supplement | 1F200 - 1F2FF | 64 | 64 | | | Miscellaneous Symbols and Pictographs | 1F300 - 1F5FF | 768 | 768 | | | Emoticons | 1F600 - 1F64F | 80 | 80 | | | Ornamental Dingbats | 1F650 - 1F67F | 48 | 48 | | •-------------------------------------------•-----------------•---------•----------•----------• | Transport and Map Symbols | 1F680 - 1F6FF | 107 | 108 | No | •-------------------------------------------•-----------------•---------•----------•----------• | Alchemical Symbols | 1F700 - 1F77F | 116 | 116 | | •-------------------------------------------•-----------------•---------•----------•----------• | Geometric Shapes Extended | 1F780 - 1F7FF | 85 | 89 | No | •-------------------------------------------•-----------------•---------•----------•----------• | Supplemental Arrows-C | 1F800 - 1F8FF | 148 | 148 | | •-------------------------------------------•-----------------•---------•----------•----------• | Supplemental Symbols and Pictographs | 1F900 - 1F9FF | 148 | 213 | No | | Supplementary Private Use Area-A | F0000 - FFFFF | 118 | 65,534 | No | •-------------------------------------------•-----------------•---------•----------•----------•
If you want, I may send you the
Symbola Monospacified for Liberation Mono
font, by e-mail and I could add the complete list of characters, handled by that font.Once installed on your system, you could use it, within N++, as the global default font, for instance.
My e-mail address is
Best Regards,
guy038
-
-
@guy038 said in Regex Misidentifying Foreign Characters:
numeric keypad
Any good advice for these techniques for those of us that prefer a keyboard without a numeric keypad (for use on cramped desktops)? :-)
-
Hello, @sylvester-bullitt, @alan-kilborn and All,
Alan, good question ;-)) Personally, my old
NEC 350
laptop does not have a numeric keypad. So, I’ve got anUSB
usual keyboard (105
keys ) plugged permanently to the laptop !When the
Caps Lock
key is set, my laptop’s French keyboard looks like, below :1234567890°+
AZERTYUIOP^£
QSDFGHJKLM%
>WXCVBN?./§And if I want to use the pseudo-numeric keypad, I just hit the
Num Lock
key and the keyboard is then changed as below :123456
789*
°+
AZERTY456-
^£
QSDFGH123+
%
>WXCVBN0
../
So :
-
The keys 7890 are mapped to keys
789*
-
The keys UIOP are mapped to keys
456-
-
The keys JKLM are mapped to keys
123+
-
The keys ?/§ are mapped to keys
0./
As the
A
,B
,C
,D
,E
andF
keys are mapped to their default, I’m always able, even without any additional keyboard (in case of travel, for instance), to use, in conjunction with theAlt
key, all theInput methods
, described in my previous post ;-))Unfortunately, I don’t use any new mini-laptop, with a special keyboard layout, so I cannot tell anything else about this subject :-((. Even the laptop of my wife has a physical keypad !
So, I’m sorry : without material, it’s impossible for me to give pertinent clues about the way to handle these
Windows Input
methods with atypical keyboard configurations !Best Regards,
guy038
-