Hello, @petyo-vodenicharov, @peterjones and All,
In addition, to the two valuable Peter’s posts, above, here is my contribution to these strange characters ;-))
Here is, below, a list of all the Unicode characters, with the General category property = Cf ( Format Character ), which, both, have a code value < FFFF and do NOT, strictly, depend on a specific language !
•--------•--------•-------------------------------------------•------•---------• | Code | Abbr. | Complete Name | Cat. | >Car< | •--------•--------•-------------------------------------------•------•---------• | 00AD | SHY | SOFT HYPHEN | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | 200B | ZWSP | ZERO WIDTH SPACE | Cf | >< | | 200C | ZWNJ | ZERO WIDTH NON-JOINER | Cf | >< | | 200D | ZWJ | ZERO WIDTH JOINER | Cf | >< | | 200E | LRM | LEFT-TO-RIGHT MARK | Cf | >< | | 200F | RLM | RIGHT-TO-LEFT MARK | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | 202A | LRE | LEFT-TO-RIGHT EMBEDDING | Cf | >< | | 202B | RLE | RIGHT-TO-LEFT EMBEDDING | Cf | >< | | 202C | PDF | POP DIRECTIONAL FORMATTING | Cf | >< | | 202D | LRO | LEFT-TO-RIGHT OVERRIDE | Cf | >< | | 202E | RLO | RIGHT-TO-LEFT OVERRIDE | Cf | >| •--------•--------•-------------------------------------------•------•---------• | 2060 | WJ | WORD JOINER | Cf | >< | | 2061 | ƒ() | FUNCTION APPLICATION | Cf | >< | | 2062 | × | INVISIBLE TIMES | Cf | >< | | 2063 | , | INVISIBLE SEPARATOR | Cf | >< | | 2064 | + | INVISIBLE PLUS | Cf | >< | | 2066 | LRI | LEFT-TO-RIGHT ISOLATE | Cf | >< | | 2067 | RLI | RIGHT-TO-LEFT ISOLATE | Cf | >< | | 2068 | FSI | FIRST STRONG ISOLATE | Cf | >< | | 2069 | PDI | POP DIRECTIONAL ISOLATE | Cf | >< | | 206A | ISS | INHIBIT SYMMETRIC SWAPPING | Cf | >< | | 206B | ASS | ACTIVATE SYMMETRIC SWAPPING | Cf | >< | | 206C | IAFS | INHIBIT ARABIC FORM SHAPING | Cf | >< | | 206D | AAFS | ACTIVATE ARABIC FORM SHAPING | Cf | >< | | 206E | NADS | NATIONAL DIGIT SHAPES | Cf | >< | | 206F | NODS | NOMINAL DIGIT SHAPES | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | FEFF | ZWNBSP | ZERO WIDTH NO-BREAK SPACE | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | FFF9 | IAA | INTERLINEAR ANNOTATION ANCHOR | Cf | >< | | FFFA | IAS | INTERLINEAR ANNOTATION SEPARATOR | Cf | >< | | FFFB | IAT | INTERLINEAR ANNOTATION TERMINATOR | Cf | >< | •--------•--------•-------------------------------------------•------•---------•Now, depending of the current font, that is used in N++, the glyph of these characters may :
Be invisible ( A true Zero Width character )
Display a square or a thin rectangular box ( Character not handled by current font )
Display a specific character ( case of the Soft Hyphen )
Of course, with the simple regular expression \x{####}, you can match the character of Unicode value = ####. But, it would be better to find out a regex to match any of these format characters !
I noticed that the Posix character class [[:cntrl:]] matches most of these characters :
The 4 characters, from \x{200C} to \x{200F}
The 5 characters, from \x{202A} to \x{202E}
The 6 characters, from \x{206A} to \x{206F}
The character \x{FEFF}
The 3 characters, from \x{FFF9} to \x{FFFB}
Unfortunately, the [[:cntrl:]] regex, also matches the Control characters :
The 32 C0 characters, from \x{0000} to \x{001F}
The 32 C1 characters, from \x{0080} to \x{009F}
Moreover, the [[:cntrl:]] regex misses some characters :
The Soft Hyphen \x{00AD}
The Zero Width Space \x{200B}
The 9 characters, from \x{2060} to \x{2069}
So, a correct regex, to match all these format characters, above, in an Unicode encoded file, could be :
(?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xADNow, how to visualize a zero-width character ? If you just hit the Find Next button, you see that a specific line is reached but you do not know the exact location of this/these zero-width char(s) :-((
Two solutions are possible :
(?-s).((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+.Which match two standard characters, separated by, one or several consecutive format character(s)
((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+Which mark all these format chars, while clicking on the Mark All button ( the best solution, to my mind ! )
So, trying the simple regex \x{200B}, against the sentence, below and using the Mark option, will convince you that this sentence does contain some Zero Width Space characters, inside !
For example, I’ve inserted 10 zero-width spaces into this sentence, can you tell me where ?Note that, between the two letters l of the verb tell, there are two consecutive chars \x{200B} !
You can see a description of these format characters, from the following links :
http://www.unicode.org/charts/PDF/U2000.pdf
http://www.unicode.org/charts/PDF/UFE70.pdf
http://www.unicode.org/charts/PDF/UFFF0.pdf
Refer, also, to that post :
Best Regards,
guy038
P.S. :
Simply, copy/paste the list and the sentence, above, in inverse video, in a new tab and enjoy !