Community
    • Login

    Visualization for zero-width characters

    Scheduled Pinned Locked Moved General Discussion
    3 Posts 3 Posters 17.0k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Petyo VodenicharovP Offline
      Petyo Vodenicharov
      last edited by

      Hello Community,
      I just ran into this post:
      https://medium.com/@umpox/be-careful-what-you-copy-invisibly-inserting-usernames-into-text-with-zero-width-characters-18b4e6f17b66

      In summary you can have zero width (invisible) characters that most applications don’t register.
      I’ve tested with Notepad++ it seems to register that there is characters when I navigate the text (I need an extra press of the arrow when I’m at a location with a zero-width chars).
      Even with “Show All Characters” option I don’t see any characters there.

      Please consider adding visualization for zero-width characters it will be very helpful for security conscious people.

      PS. Thanks to everyone who is involved in the existence of this great software.

      All the best,
      Petyo

      1 Reply Last reply Reply Quote 1
      • PeterJonesP Online
        PeterJones
        last edited by

        See my previous post on the subject of zero-width characters, and an earlier post where I even shared a pair of PythonScript scripts to give a fuller Show All Characters and Don't Show All Characters functionality

        1 Reply Last reply Reply Quote 1
        • guy038G Offline
          guy038
          last edited by guy038

          Hello, @petyo-vodenicharov, @peterjones and All,

          In addition, to the two valuable Peter’s posts, above, here is my contribution to these strange characters ;-))

          Here is, below, a list of all the Unicode characters, with the General category property = Cf ( Format Character ), which, both, have a code value < FFFF and do NOT, strictly, depend on a specific language !

                  •--------•--------•-------------------------------------------•------•---------•
                  |  Code  |  Abbr. |              Complete Name                | Cat. |  >Car<  |
                  •--------•--------•-------------------------------------------•------•---------•
                  |  00AD  |  SHY   |  SOFT HYPHEN                              |  Cf  |    >­<  |
                  •--------•--------•-------------------------------------------•------•---------•
                  |  200B  |  ZWSP  |  ZERO WIDTH SPACE                         |  Cf  |    >​<   |
                  |  200C  |  ZWNJ  |  ZERO WIDTH NON-JOINER                    |  Cf  |    >‌<   |
                  |  200D  |  ZWJ   |  ZERO WIDTH JOINER                        |  Cf  |    >‍<   |
                  |  200E  |  LRM   |  LEFT-TO-RIGHT MARK                       |  Cf  |    >‎<   |
                  |  200F  |  RLM   |  RIGHT-TO-LEFT MARK                       |  Cf  |    >‏<   |
                  •--------•--------•-------------------------------------------•------•---------•
                  |  202A  |  LRE   |  LEFT-TO-RIGHT EMBEDDING                  |  Cf  |    >‪<   |
                  |  202B  |  RLE   |  RIGHT-TO-LEFT EMBEDDING                  |  Cf  |    >‫<   |
                  |  202C  |  PDF   |  POP DIRECTIONAL FORMATTING               |  Cf  |    >‬<   |
                  |  202D  |  LRO   |  LEFT-TO-RIGHT OVERRIDE                   |  Cf  |    >‭<   |
                  |  202E  |  RLO   |  RIGHT-TO-LEFT OVERRIDE                   |  Cf  |    >‮|
                  •--------•--------•-------------------------------------------•------•---------•
                  |  2060  |  WJ    |  WORD JOINER                              |  Cf  |    >⁠<  |
                  |  2061  |  ƒ()   |  FUNCTION APPLICATION                     |  Cf  |    >⁡<  |
                  |  2062  |  ×     |  INVISIBLE TIMES                          |  Cf  |    >⁢<  |
                  |  2063  |  ,     |  INVISIBLE SEPARATOR                      |  Cf  |    >⁣<  |
                  |  2064  |  +     |  INVISIBLE PLUS                           |  Cf  |    >⁤<  |
                  |  2066  |  LRI   |  LEFT-TO-RIGHT ISOLATE                    |  Cf  |    >⁦<  |
                  |  2067  |  RLI   |  RIGHT-TO-LEFT ISOLATE                    |  Cf  |    >⁧<  |
                  |  2068  |  FSI   |  FIRST STRONG ISOLATE                     |  Cf  |    >⁨<  |
                  |  2069  |  PDI   |  POP DIRECTIONAL ISOLATE                  |  Cf  |    >⁩<  |
                  |  206A  |  ISS   |  INHIBIT SYMMETRIC SWAPPING               |  Cf  |    ><   |
                  |  206B  |  ASS   |  ACTIVATE SYMMETRIC SWAPPING              |  Cf  |    ><   |
                  |  206C  |  IAFS  |  INHIBIT ARABIC FORM SHAPING              |  Cf  |    ><   |
                  |  206D  |  AAFS  |  ACTIVATE ARABIC FORM SHAPING             |  Cf  |    ><   |
                  |  206E  |  NADS  |  NATIONAL DIGIT SHAPES                    |  Cf  |    ><   |
                  |  206F  |  NODS  |  NOMINAL DIGIT SHAPES                     |  Cf  |    ><   |
                  •--------•--------•-------------------------------------------•------•---------•
                  |  FEFF  | ZWNBSP |  ZERO WIDTH NO-BREAK SPACE                |  Cf  |    ><   |
                  •--------•--------•-------------------------------------------•------•---------•
                  |  FFF9  |  IAA   |  INTERLINEAR ANNOTATION ANCHOR            |  Cf  |    ><   |
                  |  FFFA  |  IAS   |  INTERLINEAR ANNOTATION SEPARATOR         |  Cf  |    ><   |
                  |  FFFB  |  IAT   |  INTERLINEAR ANNOTATION TERMINATOR        |  Cf  |    ><   |
                  •--------•--------•-------------------------------------------•------•---------•
          

          Now, depending of the current font, that is used in N++, the glyph of these characters may :

          • Be invisible ( A true Zero Width character )

          • Display a square or a thin rectangular box ( Character not handled by current font )

          • Display a specific character ( case of the Soft Hyphen )


          Of course, with the simple regular expression \x{####}, you can match the character of Unicode value = ####. But, it would be better to find out a regex to match any of these format characters !

          I noticed that the Posix character class [[:cntrl:]] matches most of these characters :

          • The 4 characters, from \x{200C} to \x{200F}

          • The 5 characters, from \x{202A} to \x{202E}

          • The 6 characters, from \x{206A} to \x{206F}

          • The character \x{FEFF}

          • The 3 characters, from \x{FFF9} to \x{FFFB}


          Unfortunately, the [[:cntrl:]] regex, also matches the Control characters :

          • The 32 C0 characters, from \x{0000} to \x{001F}

          • The 32 C1 characters, from \x{0080} to \x{009F}

          Moreover, the [[:cntrl:]] regex misses some characters :

          • The Soft Hyphen \x{00AD}

          • The Zero Width Space \x{200B}

          • The 9 characters, from \x{2060} to \x{2069}


          So, a correct regex, to match all these format characters, above, in an Unicode encoded file, could be :

          • (?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD

          Now, how to visualize a zero-width character ? If you just hit the Find Next button, you see that a specific line is reached but you do not know the exact location of this/these zero-width char(s) :-((

          Two solutions are possible :

          • (?-s).((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+.

          Which match two standard characters, separated by, one or several consecutive format character(s)

          • ((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+

          Which mark all these format chars, while clicking on the Mark All button ( the best solution, to my mind ! )


          So, trying the simple regex \x{200B}, against the sentence, below and using the Mark option, will convince you that this sentence does contain some Zero Width Space characters, inside !

          F​or exam​ple, I’ve ins​erted 10 ze​ro-width spa​ces in​to thi​s sentence, c​an you tel​​l me where ?
          

          Note that, between the two letters l of the verb tell, there are two consecutive chars \x{200B} !


          You can see a description of these format characters, from the following links :

          http://www.unicode.org/charts/PDF/U2000.pdf

          http://www.unicode.org/charts/PDF/UFE70.pdf

          http://www.unicode.org/charts/PDF/UFFF0.pdf

          Refer, also, to that post :

          https://notepad-plus-plus.org/community/topic/14812/how-to-search-for-unknown-3-digit-characters-with-black-background/2

          Best Regards,

          guy038

          P.S. :

          Simply, copy/paste the list and the sentence, above, in inverse video, in a new tab and enjoy !

          1 Reply Last reply Reply Quote 1

          Hello! It looks like you're interested in this conversation, but you don't have an account yet.

          Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

          With your input, this post could be even better 💗

          Register Login
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors