Visualization for zero-width characters
-
Hello Community,
I just ran into this post:
https://medium.com/@umpox/be-careful-what-you-copy-invisibly-inserting-usernames-into-text-with-zero-width-characters-18b4e6f17b66In summary you can have zero width (invisible) characters that most applications don’t register.
I’ve tested with Notepad++ it seems to register that there is characters when I navigate the text (I need an extra press of the arrow when I’m at a location with a zero-width chars).
Even with “Show All Characters” option I don’t see any characters there.Please consider adding visualization for zero-width characters it will be very helpful for security conscious people.
PS. Thanks to everyone who is involved in the existence of this great software.
All the best,
Petyo -
See my previous post on the subject of zero-width characters, and an earlier post where I even shared a pair of PythonScript scripts to give a fuller
Show All CharactersandDon't Show All Charactersfunctionality -
Hello, @petyo-vodenicharov, @peterjones and All,
In addition, to the two valuable Peter’s posts, above, here is my contribution to these strange characters ;-))
Here is, below, a list of all the Unicode characters, with the General category property =
Cf( Format Character ), which, both, have a code value <FFFFand do NOT, strictly, depend on a specific language !•--------•--------•-------------------------------------------•------•---------• | Code | Abbr. | Complete Name | Cat. | >Car< | •--------•--------•-------------------------------------------•------•---------• | 00AD | SHY | SOFT HYPHEN | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | 200B | ZWSP | ZERO WIDTH SPACE | Cf | >< | | 200C | ZWNJ | ZERO WIDTH NON-JOINER | Cf | >< | | 200D | ZWJ | ZERO WIDTH JOINER | Cf | >< | | 200E | LRM | LEFT-TO-RIGHT MARK | Cf | >< | | 200F | RLM | RIGHT-TO-LEFT MARK | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | 202A | LRE | LEFT-TO-RIGHT EMBEDDING | Cf | >< | | 202B | RLE | RIGHT-TO-LEFT EMBEDDING | Cf | >< | | 202C | PDF | POP DIRECTIONAL FORMATTING | Cf | >< | | 202D | LRO | LEFT-TO-RIGHT OVERRIDE | Cf | >< | | 202E | RLO | RIGHT-TO-LEFT OVERRIDE | Cf | >| •--------•--------•-------------------------------------------•------•---------• | 2060 | WJ | WORD JOINER | Cf | >< | | 2061 | ƒ() | FUNCTION APPLICATION | Cf | >< | | 2062 | × | INVISIBLE TIMES | Cf | >< | | 2063 | , | INVISIBLE SEPARATOR | Cf | >< | | 2064 | + | INVISIBLE PLUS | Cf | >< | | 2066 | LRI | LEFT-TO-RIGHT ISOLATE | Cf | >< | | 2067 | RLI | RIGHT-TO-LEFT ISOLATE | Cf | >< | | 2068 | FSI | FIRST STRONG ISOLATE | Cf | >< | | 2069 | PDI | POP DIRECTIONAL ISOLATE | Cf | >< | | 206A | ISS | INHIBIT SYMMETRIC SWAPPING | Cf | >< | | 206B | ASS | ACTIVATE SYMMETRIC SWAPPING | Cf | >< | | 206C | IAFS | INHIBIT ARABIC FORM SHAPING | Cf | >< | | 206D | AAFS | ACTIVATE ARABIC FORM SHAPING | Cf | >< | | 206E | NADS | NATIONAL DIGIT SHAPES | Cf | >< | | 206F | NODS | NOMINAL DIGIT SHAPES | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | FEFF | ZWNBSP | ZERO WIDTH NO-BREAK SPACE | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | FFF9 | IAA | INTERLINEAR ANNOTATION ANCHOR | Cf | >< | | FFFA | IAS | INTERLINEAR ANNOTATION SEPARATOR | Cf | >< | | FFFB | IAT | INTERLINEAR ANNOTATION TERMINATOR | Cf | >< | •--------•--------•-------------------------------------------•------•---------•Now, depending of the current font, that is used in N++, the glyph of these characters may :
-
Be invisible ( A true
Zero Widthcharacter ) -
Display a square or a thin rectangular box ( Character not handled by current font )
-
Display a specific character ( case of the Soft Hyphen )
Of course, with the simple regular expression
\x{####}, you can match the character of Unicode value =####. But, it would be better to find out a regex to match any of these format characters !I noticed that the Posix character class
[[:cntrl:]]matches most of these characters :-
The
4characters, from\x{200C}to\x{200F} -
The
5characters, from\x{202A}to\x{202E} -
The
6characters, from\x{206A}to\x{206F} -
The character
\x{FEFF} -
The
3characters, from\x{FFF9}to\x{FFFB}
Unfortunately, the
[[:cntrl:]]regex, also matches the Control characters :-
The
32C0 characters, from\x{0000}to\x{001F} -
The
32C1 characters, from\x{0080}to\x{009F}
Moreover, the
[[:cntrl:]]regex misses some characters :-
The Soft Hyphen
\x{00AD} -
The Zero Width Space
\x{200B} -
The
9characters, from\x{2060}to\x{2069}
So, a correct regex, to match all these format characters, above, in an Unicode encoded file, could be :
(?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD
Now, how to visualize a zero-width character ? If you just hit the Find Next button, you see that a specific line is reached but you do not know the exact location of this/these zero-width char(s) :-((
Two solutions are possible :
(?-s).((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+.
Which match two standard characters, separated by, one or several consecutive format character(s)
((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+
Which mark all these format chars, while clicking on the Mark All button ( the best solution, to my mind ! )
So, trying the simple regex
\x{200B}, against the sentence, below and using the Mark option, will convince you that this sentence does contain some Zero Width Space characters, inside !For example, I’ve inserted 10 zero-width spaces into this sentence, can you tell me where ?Note that, between the two letters
lof the verb tell, there are two consecutive chars\x{200B}!
You can see a description of these format characters, from the following links :
http://www.unicode.org/charts/PDF/U2000.pdf
http://www.unicode.org/charts/PDF/UFE70.pdf
http://www.unicode.org/charts/PDF/UFFF0.pdf
Refer, also, to that post :
Best Regards,
guy038
P.S. :
Simply, copy/paste the list and the sentence, above, in inverse video, in a new tab and enjoy !
-
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login