Visualization for zero-width characters
-
Hello Community,
I just ran into this post:
https://medium.com/@umpox/be-careful-what-you-copy-invisibly-inserting-usernames-into-text-with-zero-width-characters-18b4e6f17b66In summary you can have zero width (invisible) characters that most applications don’t register.
I’ve tested with Notepad++ it seems to register that there is characters when I navigate the text (I need an extra press of the arrow when I’m at a location with a zero-width chars).
Even with “Show All Characters” option I don’t see any characters there.Please consider adding visualization for zero-width characters it will be very helpful for security conscious people.
PS. Thanks to everyone who is involved in the existence of this great software.
All the best,
Petyo -
See my previous post on the subject of zero-width characters, and an earlier post where I even shared a pair of PythonScript scripts to give a fuller
Show All Characters
andDon't Show All Characters
functionality -
Hello, @petyo-vodenicharov, @peterjones and All,
In addition, to the two valuable Peter’s posts, above, here is my contribution to these strange characters ;-))
Here is, below, a list of all the Unicode characters, with the General category property =
Cf
( Format Character ), which, both, have a code value <FFFF
and do NOT, strictly, depend on a specific language !•--------•--------•-------------------------------------------•------•---------• | Code | Abbr. | Complete Name | Cat. | >Car< | •--------•--------•-------------------------------------------•------•---------• | 00AD | SHY | SOFT HYPHEN | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | 200B | ZWSP | ZERO WIDTH SPACE | Cf | >< | | 200C | ZWNJ | ZERO WIDTH NON-JOINER | Cf | >< | | 200D | ZWJ | ZERO WIDTH JOINER | Cf | >< | | 200E | LRM | LEFT-TO-RIGHT MARK | Cf | >< | | 200F | RLM | RIGHT-TO-LEFT MARK | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | 202A | LRE | LEFT-TO-RIGHT EMBEDDING | Cf | >< | | 202B | RLE | RIGHT-TO-LEFT EMBEDDING | Cf | >< | | 202C | PDF | POP DIRECTIONAL FORMATTING | Cf | >< | | 202D | LRO | LEFT-TO-RIGHT OVERRIDE | Cf | >< | | 202E | RLO | RIGHT-TO-LEFT OVERRIDE | Cf | >| •--------•--------•-------------------------------------------•------•---------• | 2060 | WJ | WORD JOINER | Cf | >< | | 2061 | ƒ() | FUNCTION APPLICATION | Cf | >< | | 2062 | × | INVISIBLE TIMES | Cf | >< | | 2063 | , | INVISIBLE SEPARATOR | Cf | >< | | 2064 | + | INVISIBLE PLUS | Cf | >< | | 2066 | LRI | LEFT-TO-RIGHT ISOLATE | Cf | >< | | 2067 | RLI | RIGHT-TO-LEFT ISOLATE | Cf | >< | | 2068 | FSI | FIRST STRONG ISOLATE | Cf | >< | | 2069 | PDI | POP DIRECTIONAL ISOLATE | Cf | >< | | 206A | ISS | INHIBIT SYMMETRIC SWAPPING | Cf | >< | | 206B | ASS | ACTIVATE SYMMETRIC SWAPPING | Cf | >< | | 206C | IAFS | INHIBIT ARABIC FORM SHAPING | Cf | >< | | 206D | AAFS | ACTIVATE ARABIC FORM SHAPING | Cf | >< | | 206E | NADS | NATIONAL DIGIT SHAPES | Cf | >< | | 206F | NODS | NOMINAL DIGIT SHAPES | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | FEFF | ZWNBSP | ZERO WIDTH NO-BREAK SPACE | Cf | >< | •--------•--------•-------------------------------------------•------•---------• | FFF9 | IAA | INTERLINEAR ANNOTATION ANCHOR | Cf | >< | | FFFA | IAS | INTERLINEAR ANNOTATION SEPARATOR | Cf | >< | | FFFB | IAT | INTERLINEAR ANNOTATION TERMINATOR | Cf | >< | •--------•--------•-------------------------------------------•------•---------•
Now, depending of the current font, that is used in N++, the glyph of these characters may :
-
Be invisible ( A true
Zero Width
character ) -
Display a square or a thin rectangular box ( Character not handled by current font )
-
Display a specific character ( case of the Soft Hyphen )
Of course, with the simple regular expression
\x{####}
, you can match the character of Unicode value =####
. But, it would be better to find out a regex to match any of these format characters !I noticed that the Posix character class
[[:cntrl:]]
matches most of these characters :-
The
4
characters, from\x{200C}
to\x{200F}
-
The
5
characters, from\x{202A}
to\x{202E}
-
The
6
characters, from\x{206A}
to\x{206F}
-
The character
\x{FEFF}
-
The
3
characters, from\x{FFF9}
to\x{FFFB}
Unfortunately, the
[[:cntrl:]]
regex, also matches the Control characters :-
The
32
C0 characters, from\x{0000}
to\x{001F}
-
The
32
C1 characters, from\x{0080}
to\x{009F}
Moreover, the
[[:cntrl:]]
regex misses some characters :-
The Soft Hyphen
\x{00AD}
-
The Zero Width Space
\x{200B}
-
The
9
characters, from\x{2060}
to\x{2069}
So, a correct regex, to match all these format characters, above, in an Unicode encoded file, could be :
(?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD
Now, how to visualize a zero-width character ? If you just hit the Find Next button, you see that a specific line is reached but you do not know the exact location of this/these zero-width char(s) :-((
Two solutions are possible :
(?-s).((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+.
Which match two standard characters, separated by, one or several consecutive format character(s)
((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+
Which mark all these format chars, while clicking on the Mark All button ( the best solution, to my mind ! )
So, trying the simple regex
\x{200B}
, against the sentence, below and using the Mark option, will convince you that this sentence does contain some Zero Width Space characters, inside !For example, I’ve inserted 10 zero-width spaces into this sentence, can you tell me where ?
Note that, between the two letters
l
of the verb tell, there are two consecutive chars\x{200B}
!
You can see a description of these format characters, from the following links :
http://www.unicode.org/charts/PDF/U2000.pdf
http://www.unicode.org/charts/PDF/UFE70.pdf
http://www.unicode.org/charts/PDF/UFFF0.pdf
Refer, also, to that post :
Best Regards,
guy038
P.S. :
Simply, copy/paste the list and the sentence, above, in inverse video, in a new tab and enjoy !
-