• Login
Community
  • Login

Visualization for zero-width characters

Scheduled Pinned Locked Moved General Discussion
3 Posts 3 Posters 12.7k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P
    Petyo Vodenicharov
    last edited by Apr 19, 2018, 8:50 AM

    Hello Community,
    I just ran into this post:
    https://medium.com/@umpox/be-careful-what-you-copy-invisibly-inserting-usernames-into-text-with-zero-width-characters-18b4e6f17b66

    In summary you can have zero width (invisible) characters that most applications don’t register.
    I’ve tested with Notepad++ it seems to register that there is characters when I navigate the text (I need an extra press of the arrow when I’m at a location with a zero-width chars).
    Even with “Show All Characters” option I don’t see any characters there.

    Please consider adding visualization for zero-width characters it will be very helpful for security conscious people.

    PS. Thanks to everyone who is involved in the existence of this great software.

    All the best,
    Petyo

    1 Reply Last reply Reply Quote 1
    • P
      PeterJones
      last edited by Apr 19, 2018, 1:17 PM

      See my previous post on the subject of zero-width characters, and an earlier post where I even shared a pair of PythonScript scripts to give a fuller Show All Characters and Don't Show All Characters functionality

      1 Reply Last reply Reply Quote 1
      • G
        guy038
        last edited by guy038 Aug 1, 2022, 11:01 AM Apr 19, 2018, 7:25 PM

        Hello, @petyo-vodenicharov, @peterjones and All,

        In addition, to the two valuable Peter’s posts, above, here is my contribution to these strange characters ;-))

        Here is, below, a list of all the Unicode characters, with the General category property = Cf ( Format Character ), which, both, have a code value < FFFF and do NOT, strictly, depend on a specific language !

                •--------•--------•-------------------------------------------•------•---------•
                |  Code  |  Abbr. |              Complete Name                | Cat. |  >Car<  |
                •--------•--------•-------------------------------------------•------•---------•
                |  00AD  |  SHY   |  SOFT HYPHEN                              |  Cf  |    >­<  |
                •--------•--------•-------------------------------------------•------•---------•
                |  200B  |  ZWSP  |  ZERO WIDTH SPACE                         |  Cf  |    >​<   |
                |  200C  |  ZWNJ  |  ZERO WIDTH NON-JOINER                    |  Cf  |    >‌<   |
                |  200D  |  ZWJ   |  ZERO WIDTH JOINER                        |  Cf  |    >‍<   |
                |  200E  |  LRM   |  LEFT-TO-RIGHT MARK                       |  Cf  |    >‎<   |
                |  200F  |  RLM   |  RIGHT-TO-LEFT MARK                       |  Cf  |    >‏<   |
                •--------•--------•-------------------------------------------•------•---------•
                |  202A  |  LRE   |  LEFT-TO-RIGHT EMBEDDING                  |  Cf  |    >‪<   |
                |  202B  |  RLE   |  RIGHT-TO-LEFT EMBEDDING                  |  Cf  |    >‫<   |
                |  202C  |  PDF   |  POP DIRECTIONAL FORMATTING               |  Cf  |    >‬<   |
                |  202D  |  LRO   |  LEFT-TO-RIGHT OVERRIDE                   |  Cf  |    >‭<   |
                |  202E  |  RLO   |  RIGHT-TO-LEFT OVERRIDE                   |  Cf  |    >‮|
                •--------•--------•-------------------------------------------•------•---------•
                |  2060  |  WJ    |  WORD JOINER                              |  Cf  |    >⁠<  |
                |  2061  |  ƒ()   |  FUNCTION APPLICATION                     |  Cf  |    >⁡<  |
                |  2062  |  ×     |  INVISIBLE TIMES                          |  Cf  |    >⁢<  |
                |  2063  |  ,     |  INVISIBLE SEPARATOR                      |  Cf  |    >⁣<  |
                |  2064  |  +     |  INVISIBLE PLUS                           |  Cf  |    >⁤<  |
                |  2066  |  LRI   |  LEFT-TO-RIGHT ISOLATE                    |  Cf  |    >⁦<  |
                |  2067  |  RLI   |  RIGHT-TO-LEFT ISOLATE                    |  Cf  |    >⁧<  |
                |  2068  |  FSI   |  FIRST STRONG ISOLATE                     |  Cf  |    >⁨<  |
                |  2069  |  PDI   |  POP DIRECTIONAL ISOLATE                  |  Cf  |    >⁩<  |
                |  206A  |  ISS   |  INHIBIT SYMMETRIC SWAPPING               |  Cf  |    ><   |
                |  206B  |  ASS   |  ACTIVATE SYMMETRIC SWAPPING              |  Cf  |    ><   |
                |  206C  |  IAFS  |  INHIBIT ARABIC FORM SHAPING              |  Cf  |    ><   |
                |  206D  |  AAFS  |  ACTIVATE ARABIC FORM SHAPING             |  Cf  |    ><   |
                |  206E  |  NADS  |  NATIONAL DIGIT SHAPES                    |  Cf  |    ><   |
                |  206F  |  NODS  |  NOMINAL DIGIT SHAPES                     |  Cf  |    ><   |
                •--------•--------•-------------------------------------------•------•---------•
                |  FEFF  | ZWNBSP |  ZERO WIDTH NO-BREAK SPACE                |  Cf  |    ><   |
                •--------•--------•-------------------------------------------•------•---------•
                |  FFF9  |  IAA   |  INTERLINEAR ANNOTATION ANCHOR            |  Cf  |    ><   |
                |  FFFA  |  IAS   |  INTERLINEAR ANNOTATION SEPARATOR         |  Cf  |    ><   |
                |  FFFB  |  IAT   |  INTERLINEAR ANNOTATION TERMINATOR        |  Cf  |    ><   |
                •--------•--------•-------------------------------------------•------•---------•
        

        Now, depending of the current font, that is used in N++, the glyph of these characters may :

        • Be invisible ( A true Zero Width character )

        • Display a square or a thin rectangular box ( Character not handled by current font )

        • Display a specific character ( case of the Soft Hyphen )


        Of course, with the simple regular expression \x{####}, you can match the character of Unicode value = ####. But, it would be better to find out a regex to match any of these format characters !

        I noticed that the Posix character class [[:cntrl:]] matches most of these characters :

        • The 4 characters, from \x{200C} to \x{200F}

        • The 5 characters, from \x{202A} to \x{202E}

        • The 6 characters, from \x{206A} to \x{206F}

        • The character \x{FEFF}

        • The 3 characters, from \x{FFF9} to \x{FFFB}


        Unfortunately, the [[:cntrl:]] regex, also matches the Control characters :

        • The 32 C0 characters, from \x{0000} to \x{001F}

        • The 32 C1 characters, from \x{0080} to \x{009F}

        Moreover, the [[:cntrl:]] regex misses some characters :

        • The Soft Hyphen \x{00AD}

        • The Zero Width Space \x{200B}

        • The 9 characters, from \x{2060} to \x{2069}


        So, a correct regex, to match all these format characters, above, in an Unicode encoded file, could be :

        • (?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD

        Now, how to visualize a zero-width character ? If you just hit the Find Next button, you see that a specific line is reached but you do not know the exact location of this/these zero-width char(s) :-((

        Two solutions are possible :

        • (?-s).((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+.

        Which match two standard characters, separated by, one or several consecutive format character(s)

        • ((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+

        Which mark all these format chars, while clicking on the Mark All button ( the best solution, to my mind ! )


        So, trying the simple regex \x{200B}, against the sentence, below and using the Mark option, will convince you that this sentence does contain some Zero Width Space characters, inside !

        F​or exam​ple, I’ve ins​erted 10 ze​ro-width spa​ces in​to thi​s sentence, c​an you tel​​l me where ?
        

        Note that, between the two letters l of the verb tell, there are two consecutive chars \x{200B} !


        You can see a description of these format characters, from the following links :

        http://www.unicode.org/charts/PDF/U2000.pdf

        http://www.unicode.org/charts/PDF/UFE70.pdf

        http://www.unicode.org/charts/PDF/UFFF0.pdf

        Refer, also, to that post :

        https://notepad-plus-plus.org/community/topic/14812/how-to-search-for-unknown-3-digit-characters-with-black-background/2

        Best Regards,

        guy038

        P.S. :

        Simply, copy/paste the list and the sentence, above, in inverse video, in a new tab and enjoy !

        1 Reply Last reply Reply Quote 1
        2 out of 3
        • First post
          2/3
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors