Community
    • Login

    Visualization for zero-width characters

    Scheduled Pinned Locked Moved General Discussion
    3 Posts 3 Posters 12.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Petyo VodenicharovP
      Petyo Vodenicharov
      last edited by

      Hello Community,
      I just ran into this post:
      https://medium.com/@umpox/be-careful-what-you-copy-invisibly-inserting-usernames-into-text-with-zero-width-characters-18b4e6f17b66

      In summary you can have zero width (invisible) characters that most applications don’t register.
      I’ve tested with Notepad++ it seems to register that there is characters when I navigate the text (I need an extra press of the arrow when I’m at a location with a zero-width chars).
      Even with “Show All Characters” option I don’t see any characters there.

      Please consider adding visualization for zero-width characters it will be very helpful for security conscious people.

      PS. Thanks to everyone who is involved in the existence of this great software.

      All the best,
      Petyo

      1 Reply Last reply Reply Quote 1
      • PeterJonesP
        PeterJones
        last edited by

        See my previous post on the subject of zero-width characters, and an earlier post where I even shared a pair of PythonScript scripts to give a fuller Show All Characters and Don't Show All Characters functionality

        1 Reply Last reply Reply Quote 1
        • guy038G
          guy038
          last edited by guy038

          Hello, @petyo-vodenicharov, @peterjones and All,

          In addition, to the two valuable Peter’s posts, above, here is my contribution to these strange characters ;-))

          Here is, below, a list of all the Unicode characters, with the General category property = Cf ( Format Character ), which, both, have a code value < FFFF and do NOT, strictly, depend on a specific language !

                  •--------•--------•-------------------------------------------•------•---------•
                  |  Code  |  Abbr. |              Complete Name                | Cat. |  >Car<  |
                  •--------•--------•-------------------------------------------•------•---------•
                  |  00AD  |  SHY   |  SOFT HYPHEN                              |  Cf  |    >­<  |
                  •--------•--------•-------------------------------------------•------•---------•
                  |  200B  |  ZWSP  |  ZERO WIDTH SPACE                         |  Cf  |    >​<   |
                  |  200C  |  ZWNJ  |  ZERO WIDTH NON-JOINER                    |  Cf  |    >‌<   |
                  |  200D  |  ZWJ   |  ZERO WIDTH JOINER                        |  Cf  |    >‍<   |
                  |  200E  |  LRM   |  LEFT-TO-RIGHT MARK                       |  Cf  |    >‎<   |
                  |  200F  |  RLM   |  RIGHT-TO-LEFT MARK                       |  Cf  |    >‏<   |
                  •--------•--------•-------------------------------------------•------•---------•
                  |  202A  |  LRE   |  LEFT-TO-RIGHT EMBEDDING                  |  Cf  |    >‪<   |
                  |  202B  |  RLE   |  RIGHT-TO-LEFT EMBEDDING                  |  Cf  |    >‫<   |
                  |  202C  |  PDF   |  POP DIRECTIONAL FORMATTING               |  Cf  |    >‬<   |
                  |  202D  |  LRO   |  LEFT-TO-RIGHT OVERRIDE                   |  Cf  |    >‭<   |
                  |  202E  |  RLO   |  RIGHT-TO-LEFT OVERRIDE                   |  Cf  |    >‮|
                  •--------•--------•-------------------------------------------•------•---------•
                  |  2060  |  WJ    |  WORD JOINER                              |  Cf  |    >⁠<  |
                  |  2061  |  ƒ()   |  FUNCTION APPLICATION                     |  Cf  |    >⁡<  |
                  |  2062  |  ×     |  INVISIBLE TIMES                          |  Cf  |    >⁢<  |
                  |  2063  |  ,     |  INVISIBLE SEPARATOR                      |  Cf  |    >⁣<  |
                  |  2064  |  +     |  INVISIBLE PLUS                           |  Cf  |    >⁤<  |
                  |  2066  |  LRI   |  LEFT-TO-RIGHT ISOLATE                    |  Cf  |    >⁦<  |
                  |  2067  |  RLI   |  RIGHT-TO-LEFT ISOLATE                    |  Cf  |    >⁧<  |
                  |  2068  |  FSI   |  FIRST STRONG ISOLATE                     |  Cf  |    >⁨<  |
                  |  2069  |  PDI   |  POP DIRECTIONAL ISOLATE                  |  Cf  |    >⁩<  |
                  |  206A  |  ISS   |  INHIBIT SYMMETRIC SWAPPING               |  Cf  |    ><   |
                  |  206B  |  ASS   |  ACTIVATE SYMMETRIC SWAPPING              |  Cf  |    ><   |
                  |  206C  |  IAFS  |  INHIBIT ARABIC FORM SHAPING              |  Cf  |    ><   |
                  |  206D  |  AAFS  |  ACTIVATE ARABIC FORM SHAPING             |  Cf  |    ><   |
                  |  206E  |  NADS  |  NATIONAL DIGIT SHAPES                    |  Cf  |    ><   |
                  |  206F  |  NODS  |  NOMINAL DIGIT SHAPES                     |  Cf  |    ><   |
                  •--------•--------•-------------------------------------------•------•---------•
                  |  FEFF  | ZWNBSP |  ZERO WIDTH NO-BREAK SPACE                |  Cf  |    ><   |
                  •--------•--------•-------------------------------------------•------•---------•
                  |  FFF9  |  IAA   |  INTERLINEAR ANNOTATION ANCHOR            |  Cf  |    ><   |
                  |  FFFA  |  IAS   |  INTERLINEAR ANNOTATION SEPARATOR         |  Cf  |    ><   |
                  |  FFFB  |  IAT   |  INTERLINEAR ANNOTATION TERMINATOR        |  Cf  |    ><   |
                  •--------•--------•-------------------------------------------•------•---------•
          

          Now, depending of the current font, that is used in N++, the glyph of these characters may :

          • Be invisible ( A true Zero Width character )

          • Display a square or a thin rectangular box ( Character not handled by current font )

          • Display a specific character ( case of the Soft Hyphen )


          Of course, with the simple regular expression \x{####}, you can match the character of Unicode value = ####. But, it would be better to find out a regex to match any of these format characters !

          I noticed that the Posix character class [[:cntrl:]] matches most of these characters :

          • The 4 characters, from \x{200C} to \x{200F}

          • The 5 characters, from \x{202A} to \x{202E}

          • The 6 characters, from \x{206A} to \x{206F}

          • The character \x{FEFF}

          • The 3 characters, from \x{FFF9} to \x{FFFB}


          Unfortunately, the [[:cntrl:]] regex, also matches the Control characters :

          • The 32 C0 characters, from \x{0000} to \x{001F}

          • The 32 C1 characters, from \x{0080} to \x{009F}

          Moreover, the [[:cntrl:]] regex misses some characters :

          • The Soft Hyphen \x{00AD}

          • The Zero Width Space \x{200B}

          • The 9 characters, from \x{2060} to \x{2069}


          So, a correct regex, to match all these format characters, above, in an Unicode encoded file, could be :

          • (?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD

          Now, how to visualize a zero-width character ? If you just hit the Find Next button, you see that a specific line is reached but you do not know the exact location of this/these zero-width char(s) :-((

          Two solutions are possible :

          • (?-s).((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+.

          Which match two standard characters, separated by, one or several consecutive format character(s)

          • ((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+

          Which mark all these format chars, while clicking on the Mark All button ( the best solution, to my mind ! )


          So, trying the simple regex \x{200B}, against the sentence, below and using the Mark option, will convince you that this sentence does contain some Zero Width Space characters, inside !

          F​or exam​ple, I’ve ins​erted 10 ze​ro-width spa​ces in​to thi​s sentence, c​an you tel​​l me where ?
          

          Note that, between the two letters l of the verb tell, there are two consecutive chars \x{200B} !


          You can see a description of these format characters, from the following links :

          http://www.unicode.org/charts/PDF/U2000.pdf

          http://www.unicode.org/charts/PDF/UFE70.pdf

          http://www.unicode.org/charts/PDF/UFFF0.pdf

          Refer, also, to that post :

          https://notepad-plus-plus.org/community/topic/14812/how-to-search-for-unknown-3-digit-characters-with-black-background/2

          Best Regards,

          guy038

          P.S. :

          Simply, copy/paste the list and the sentence, above, in inverse video, in a new tab and enjoy !

          1 Reply Last reply Reply Quote 1
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors