Visualization for zero-width characters



  • Hello Community,
    I just ran into this post:
    https://medium.com/@umpox/be-careful-what-you-copy-invisibly-inserting-usernames-into-text-with-zero-width-characters-18b4e6f17b66

    In summary you can have zero width (invisible) characters that most applications don’t register.
    I’ve tested with Notepad++ it seems to register that there is characters when I navigate the text (I need an extra press of the arrow when I’m at a location with a zero-width chars).
    Even with “Show All Characters” option I don’t see any characters there.

    Please consider adding visualization for zero-width characters it will be very helpful for security conscious people.

    PS. Thanks to everyone who is involved in the existence of this great software.

    All the best,
    Petyo



  • See my previous post on the subject of zero-width characters, and an earlier post where I even shared a pair of PythonScript scripts to give a fuller Show All Characters and Don't Show All Characters functionality



  • Hello, @petyo-vodenicharov, @peterjones and All,

    In addition, to the two valuable Peter’s posts, above, here is my contribution to these strange characters ;-))

    Here is, below, a list of all the Unicode characters, with the General category property = Cf ( Format Character ), which, both, have a code value < FFFF and do NOT, strictly, depend on a specific language !

            •--------•--------•-------------------------------------------•------•---------•
            |  Code  |  Abbr. |              Complete Name                | Cat. |  >Car<  |
            •--------•--------•-------------------------------------------•------•---------•
            |  00AD  |  SHY   |  SOFT HYPHEN                              |  Cf  |    >­<  |
            •--------•--------•-------------------------------------------•------•---------•
            |  200B  |  ZWSP  |  ZERO WIDTH SPACE                         |  Cf  |    >​<   |
            |  200C  |  ZWNJ  |  ZERO WIDTH NON-JOINER                    |  Cf  |    >‌<   |
            |  200D  |  ZWJ   |  ZERO WIDTH JOINER                        |  Cf  |    >‍<   |
            |  200E  |  LRM   |  LEFT-TO-RIGHT MARK                       |  Cf  |    >‎<   |
            |  200F  |  RLM   |  RIGHT-TO-LEFT MARK                       |  Cf  |    >‏<   |
            •--------•--------•-------------------------------------------•------•---------•
            |  202A  |  LRE   |  LEFT-TO-RIGHT EMBEDDING                  |  Cf  |    >‪<   |
            |  202B  |  RLE   |  RIGHT-TO-LEFT EMBEDDING                  |  Cf  |    >‫<   |
            |  202C  |  PDF   |  POP DIRECTIONAL FORMATTING               |  Cf  |    >‬<   |
            |  202D  |  LRO   |  LEFT-TO-RIGHT OVERRIDE                   |  Cf  |    >‭<   |
            |  202E  |  RLO   |  RIGHT-TO-LEFT OVERRIDE                   |  Cf  |    >‮|
            •--------•--------•-------------------------------------------•------•---------•
            |  2060  |  WJ    |  WORD JOINER                              |  Cf  |    >⁠<  |
            |  2061  |  ƒ()   |  FUNCTION APPLICATION                     |  Cf  |    >⁡<  |
            |  2062  |  ×     |  INVISIBLE TIMES                          |  Cf  |    >⁢<  |
            |  2063  |  ,     |  INVISIBLE SEPARATOR                      |  Cf  |    >⁣<  |
            |  2064  |  +     |  INVISIBLE PLUS                           |  Cf  |    >⁤<  |
            |  2066  |  LRI   |  LEFT-TO-RIGHT ISOLATE                    |  Cf  |    >⁦<  |
            |  2067  |  RLI   |  RIGHT-TO-LEFT ISOLATE                    |  Cf  |    >⁧<  |
            |  2068  |  FSI   |  FIRST STRONG ISOLATE                     |  Cf  |    >⁨<  |
            |  2069  |  PDI   |  POP DIRECTIONAL ISOLATE                  |  Cf  |    >⁩<  |
            |  206A  |  ISS   |  INHIBIT SYMMETRIC SWAPPING               |  Cf  |    ><   |
            |  206B  |  ASS   |  ACTIVATE SYMMETRIC SWAPPING              |  Cf  |    ><   |
            |  206C  |  IAFS  |  INHIBIT ARABIC FORM SHAPING              |  Cf  |    ><   |
            |  206D  |  AAFS  |  ACTIVATE ARABIC FORM SHAPING             |  Cf  |    ><   |
            |  206E  |  NADS  |  NATIONAL DIGIT SHAPES                    |  Cf  |    ><   |
            |  206F  |  NODS  |  NOMINAL DIGIT SHAPES                     |  Cf  |    ><   |
            •--------•--------•-------------------------------------------•------•---------•
            |  FEFF  | ZWNBSP |  ZERO WIDTH NO-BREAK SPACE                |  Cf  |    ><   |
            •--------•--------•-------------------------------------------•------•---------•
            |  FFF9  |  IAA   |  INTERLINEAR ANNOTATION ANCHOR            |  Cf  |    ><   |
            |  FFFA  |  IAS   |  INTERLINEAR ANNOTATION SEPARATOR         |  Cf  |    ><   |
            |  FFFB  |  IAT   |  INTERLINEAR ANNOTATION TERMINATOR        |  Cf  |    ><   |
            •--------•--------•-------------------------------------------•------•---------•
    

    Now, depending of the current font, that is used in N++, the glyph of these characters may :

    • Be invisible ( A true Zero Width character )

    • Display a square or a thin rectangular box ( Character not handled by current font )

    • Display a specific character ( case of the Soft Hyphen )


    Of course, with the simple regular expression \x{####}, you can match the character of Unicode value = ####. But, it would be better to find out a regex to match any of these format characters !

    I noticed that the Posix character class [[:cntrl:]] matches most of these characters :

    • The 4 characters, from \x{200C} to \x{200F}

    • The 5 characters, from \x{202A} to \x{202E}

    • The 6 characters, from \x{206A} to \x{206F}

    • The character \x{FEFF}

    • The 3 characters, from \x{FFF9} to \x{FFFB}


    Unfortunately, the [[:cntrl:]] regex, also matches the Control characters :

    • The 32 C0 characters, from \x{0000} to \x{001F}

    • The 32 C1 characters, from \x{0080} to \x{009F}

    Moreover, the [[:cntrl:]] regex misses some characters :

    • The Soft Hyphen \x{00AD}

    • The Zero Width Space \x{200B}

    • The 9 characters, from \x{2060} to \x{2069}


    So, a correct regex, to match all these format characters, above, in an Unicode encoded file, could be :

    (?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD

    But, probably, you may just need this shorter regex (?=[[:unicode:]])[[:cntrl:]\x{200B}] !

    Now, how to visualize a zero-width character ? If you just hit the Find Next button, you see that a specific line is reached but you do not know the exact location of this/these zero-width char(s) :-((

    Two solutions are possible :

    • Firstly, use the regexes :

      • (?-s).((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+.

      • (?-s).(?=[[:unicode:]])[[:cntrl:]\x{200B}]+.

    Which match two standard characters, separated by, one or several consecutive format character(s)

    • Secondly, use the regexes :

      • ((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+

      • (?=[[:unicode:]])[[:cntrl:]\x{200B}]+

    Which mark all these format chars, while clicking on the Mark All button ( the best solution, to my mind ! )


    So, trying the simple regex \x{200B}, against the sentence, below and using the Mark option, will convince you that this sentence does contain some Zero Width Space characters, inside !

    F​or exam​ple, I’ve ins​erted 10 ze​ro-width spa​ces in​to thi​s sentence, c​an you tel​​l me where ?
    

    Note that, between the two letters l of the verb tell, there are two consecutive chars \x{200B} !


    You can see a description of these format characters, from the following links :

    http://www.unicode.org/charts/PDF/U2000.pdf

    http://www.unicode.org/charts/PDF/UFE70.pdf

    http://www.unicode.org/charts/PDF/UFFF0.pdf

    Refer, also, to that post :

    https://notepad-plus-plus.org/community/topic/14812/how-to-search-for-unknown-3-digit-characters-with-black-background/2

    Best Regards,

    guy038

    P.S. :

    Simply, copy/paste the list and the sentence, above, in inverse video, in a new tab and enjoy !


Log in to reply