Visualization for zero-width characters

Petyo Vodenicharov

Hello Community,
I just ran into this post:
https://medium.com/@umpox/be-careful-what-you-copy-invisibly-inserting-usernames-into-text-with-zero-width-characters-18b4e6f17b66

In summary you can have zero width (invisible) characters that most applications don’t register.
I’ve tested with Notepad++ it seems to register that there is characters when I navigate the text (I need an extra press of the arrow when I’m at a location with a zero-width chars).
Even with “Show All Characters” option I don’t see any characters there.

Please consider adding visualization for zero-width characters it will be very helpful for security conscious people.

PS. Thanks to everyone who is involved in the existence of this great software.

All the best,
Petyo

PeterJones

See my previous post on the subject of zero-width characters, and an earlier post where I even shared a pair of PythonScript scripts to give a fuller Show All Characters and Don't Show All Characters functionality

guy038

Hello, @petyo-vodenicharov, @peterjones and All,

In addition, to the two valuable Peter’s posts, above, here is my contribution to these strange characters ;-))

Here is, below, a list of all the Unicode characters, with the General category property = Cf ( Format Character ), which, both, have a code value < FFFF and do NOT, strictly, depend on a specific language !

        •--------•--------•-------------------------------------------•------•---------•
        |  Code  |  Abbr. |              Complete Name                | Cat. |  >Car<  |
        •--------•--------•-------------------------------------------•------•---------•
        |  00AD  |  SHY   |  SOFT HYPHEN                              |  Cf  |    ><  |
        •--------•--------•-------------------------------------------•------•---------•
        |  200B  |  ZWSP  |  ZERO WIDTH SPACE                         |  Cf  |    ><   |
        |  200C  |  ZWNJ  |  ZERO WIDTH NON-JOINER                    |  Cf  |    >‌<   |
        |  200D  |  ZWJ   |  ZERO WIDTH JOINER                        |  Cf  |    >‍<   |
        |  200E  |  LRM   |  LEFT-TO-RIGHT MARK                       |  Cf  |    >‎<   |
        |  200F  |  RLM   |  RIGHT-TO-LEFT MARK                       |  Cf  |    >‏<   |
        •--------•--------•-------------------------------------------•------•---------•
        |  202A  |  LRE   |  LEFT-TO-RIGHT EMBEDDING                  |  Cf  |    >‪<   |
        |  202B  |  RLE   |  RIGHT-TO-LEFT EMBEDDING                  |  Cf  |    >‫<   |
        |  202C  |  PDF   |  POP DIRECTIONAL FORMATTING               |  Cf  |    >‬<   |
        |  202D  |  LRO   |  LEFT-TO-RIGHT OVERRIDE                   |  Cf  |    >‭<   |
        |  202E  |  RLO   |  RIGHT-TO-LEFT OVERRIDE                   |  Cf  |    >‮|
        •--------•--------•-------------------------------------------•------•---------•
        |  2060  |  WJ    |  WORD JOINER                              |  Cf  |    >⁠<  |
        |  2061  |  ƒ()   |  FUNCTION APPLICATION                     |  Cf  |    >⁡<  |
        |  2062  |  ×     |  INVISIBLE TIMES                          |  Cf  |    >⁢<  |
        |  2063  |  ,     |  INVISIBLE SEPARATOR                      |  Cf  |    >⁣<  |
        |  2064  |  +     |  INVISIBLE PLUS                           |  Cf  |    >⁤<  |
        |  2066  |  LRI   |  LEFT-TO-RIGHT ISOLATE                    |  Cf  |    >⁦<  |
        |  2067  |  RLI   |  RIGHT-TO-LEFT ISOLATE                    |  Cf  |    >⁧<  |
        |  2068  |  FSI   |  FIRST STRONG ISOLATE                     |  Cf  |    >⁨<  |
        |  2069  |  PDI   |  POP DIRECTIONAL ISOLATE                  |  Cf  |    >⁩<  |
        |  206A  |  ISS   |  INHIBIT SYMMETRIC SWAPPING               |  Cf  |    >⁪<   |
        |  206B  |  ASS   |  ACTIVATE SYMMETRIC SWAPPING              |  Cf  |    >⁫<   |
        |  206C  |  IAFS  |  INHIBIT ARABIC FORM SHAPING              |  Cf  |    >⁬<   |
        |  206D  |  AAFS  |  ACTIVATE ARABIC FORM SHAPING             |  Cf  |    >⁭<   |
        |  206E  |  NADS  |  NATIONAL DIGIT SHAPES                    |  Cf  |    >⁮<   |
        |  206F  |  NODS  |  NOMINAL DIGIT SHAPES                     |  Cf  |    >⁯<   |
        •--------•--------•-------------------------------------------•------•---------•
        |  FEFF  | ZWNBSP |  ZERO WIDTH NO-BREAK SPACE                |  Cf  |    ><   |
        •--------•--------•-------------------------------------------•------•---------•
        |  FFF9  |  IAA   |  INTERLINEAR ANNOTATION ANCHOR            |  Cf  |    >￹<   |
        |  FFFA  |  IAS   |  INTERLINEAR ANNOTATION SEPARATOR         |  Cf  |    >￺<   |
        |  FFFB  |  IAT   |  INTERLINEAR ANNOTATION TERMINATOR        |  Cf  |    >￻<   |
        •--------•--------•-------------------------------------------•------•---------•

Now, depending of the current font, that is used in N++, the glyph of these characters may :

Be invisible ( A true Zero Width character )
Display a square or a thin rectangular box ( Character not handled by current font )
Display a specific character ( case of the Soft Hyphen )

Of course, with the simple regular expression \x{####}, you can match the character of Unicode value = ####. But, it would be better to find out a regex to match any of these format characters !

I noticed that the Posix character class [[:cntrl:]] matches most of these characters :

The 4 characters, from \x{200C} to \x{200F}
The 5 characters, from \x{202A} to \x{202E}
The 6 characters, from \x{206A} to \x{206F}
The character \x{FEFF}
The 3 characters, from \x{FFF9} to \x{FFFB}

Unfortunately, the [[:cntrl:]] regex, also matches the Control characters :

The 32 C0 characters, from \x{0000} to \x{001F}
The 32 C1 characters, from \x{0080} to \x{009F}

Moreover, the [[:cntrl:]] regex misses some characters :

The Soft Hyphen \x{00AD}
The Zero Width Space \x{200B}
The 9 characters, from \x{2060} to \x{2069}

So, a correct regex, to match all these format characters, above, in an Unicode encoded file, could be :

(?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD

Now, how to visualize a zero-width character ? If you just hit the Find Next button, you see that a specific line is reached but you do not know the exact location of this/these zero-width char(s) :-((

Two solutions are possible :

(?-s).((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+.

Which match two standard characters, separated by, one or several consecutive format character(s)

((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+

Which mark all these format chars, while clicking on the Mark All button ( the best solution, to my mind ! )

So, trying the simple regex \x{200B}, against the sentence, below and using the Mark option, will convince you that this sentence does contain some Zero Width Space characters, inside !

For example, I’ve inserted 10 zero-width spaces into this sentence, can you tell me where ?

Note that, between the two letters l of the verb tell, there are two consecutive chars \x{200B} !

You can see a description of these format characters, from the following links :

http://www.unicode.org/charts/PDF/U2000.pdf

http://www.unicode.org/charts/PDF/UFE70.pdf

http://www.unicode.org/charts/PDF/UFFF0.pdf

Refer, also, to that post :

https://notepad-plus-plus.org/community/topic/14812/how-to-search-for-unknown-3-digit-characters-with-black-background/2

Best Regards,

guy038

P.S. :

Simply, copy/paste the list and the sentence, above, in inverse video, in a new tab and enjoy !