Reply to Visualization for zero-width characters on Mon, 01 Aug 2022 11:01:29 GMT

guy038 — Mon, 01 Aug 2022 11:01:29 GMT

Hello, @petyo-vodenicharov, @peterjones and All,

In addition, to the two valuable Peter’s posts, above, here is my contribution to these strange characters ;-))

Here is, below, a list of all the Unicode characters, with the General category property = Cf ( Format Character ), which, both, have a code value < FFFF and do NOT, strictly, depend on a specific language !

        •--------•--------•-------------------------------------------•------•---------•
        |  Code  |  Abbr. |              Complete Name                | Cat. |  >Car<  |
        •--------•--------•-------------------------------------------•------•---------•
        |  00AD  |  SHY   |  SOFT HYPHEN                              |  Cf  |    ><  |
        •--------•--------•-------------------------------------------•------•---------•
        |  200B  |  ZWSP  |  ZERO WIDTH SPACE                         |  Cf  |    ><   |
        |  200C  |  ZWNJ  |  ZERO WIDTH NON-JOINER                    |  Cf  |    >‌<   |
        |  200D  |  ZWJ   |  ZERO WIDTH JOINER                        |  Cf  |    >‍<   |
        |  200E  |  LRM   |  LEFT-TO-RIGHT MARK                       |  Cf  |    >‎<   |
        |  200F  |  RLM   |  RIGHT-TO-LEFT MARK                       |  Cf  |    >‏<   |
        •--------•--------•-------------------------------------------•------•---------•
        |  202A  |  LRE   |  LEFT-TO-RIGHT EMBEDDING                  |  Cf  |    >‪<   |
        |  202B  |  RLE   |  RIGHT-TO-LEFT EMBEDDING                  |  Cf  |    >‫<   |
        |  202C  |  PDF   |  POP DIRECTIONAL FORMATTING               |  Cf  |    >‬<   |
        |  202D  |  LRO   |  LEFT-TO-RIGHT OVERRIDE                   |  Cf  |    >‭<   |
        |  202E  |  RLO   |  RIGHT-TO-LEFT OVERRIDE                   |  Cf  |    >‮|
        •--------•--------•-------------------------------------------•------•---------•
        |  2060  |  WJ    |  WORD JOINER                              |  Cf  |    >⁠<  |
        |  2061  |  ƒ()   |  FUNCTION APPLICATION                     |  Cf  |    >⁡<  |
        |  2062  |  ×     |  INVISIBLE TIMES                          |  Cf  |    >⁢<  |
        |  2063  |  ,     |  INVISIBLE SEPARATOR                      |  Cf  |    >⁣<  |
        |  2064  |  +     |  INVISIBLE PLUS                           |  Cf  |    >⁤<  |
        |  2066  |  LRI   |  LEFT-TO-RIGHT ISOLATE                    |  Cf  |    >⁦<  |
        |  2067  |  RLI   |  RIGHT-TO-LEFT ISOLATE                    |  Cf  |    >⁧<  |
        |  2068  |  FSI   |  FIRST STRONG ISOLATE                     |  Cf  |    >⁨<  |
        |  2069  |  PDI   |  POP DIRECTIONAL ISOLATE                  |  Cf  |    >⁩<  |
        |  206A  |  ISS   |  INHIBIT SYMMETRIC SWAPPING               |  Cf  |    >⁪<   |
        |  206B  |  ASS   |  ACTIVATE SYMMETRIC SWAPPING              |  Cf  |    >⁫<   |
        |  206C  |  IAFS  |  INHIBIT ARABIC FORM SHAPING              |  Cf  |    >⁬<   |
        |  206D  |  AAFS  |  ACTIVATE ARABIC FORM SHAPING             |  Cf  |    >⁭<   |
        |  206E  |  NADS  |  NATIONAL DIGIT SHAPES                    |  Cf  |    >⁮<   |
        |  206F  |  NODS  |  NOMINAL DIGIT SHAPES                     |  Cf  |    >⁯<   |
        •--------•--------•-------------------------------------------•------•---------•
        |  FEFF  | ZWNBSP |  ZERO WIDTH NO-BREAK SPACE                |  Cf  |    ><   |
        •--------•--------•-------------------------------------------•------•---------•
        |  FFF9  |  IAA   |  INTERLINEAR ANNOTATION ANCHOR            |  Cf  |    >￹<   |
        |  FFFA  |  IAS   |  INTERLINEAR ANNOTATION SEPARATOR         |  Cf  |    >￺<   |
        |  FFFB  |  IAT   |  INTERLINEAR ANNOTATION TERMINATOR        |  Cf  |    >￻<   |
        •--------•--------•-------------------------------------------•------•---------•

Now, depending of the current font, that is used in N++, the glyph of these characters may :

Be invisible ( A true Zero Width character )
Display a square or a thin rectangular box ( Character not handled by current font )
Display a specific character ( case of the Soft Hyphen )

Of course, with the simple regular expression \x{####}, you can match the character of Unicode value = ####. But, it would be better to find out a regex to match any of these format characters !

I noticed that the Posix character class [[:cntrl:]] matches most of these characters :

The 4 characters, from \x{200C} to \x{200F}
The 5 characters, from \x{202A} to \x{202E}
The 6 characters, from \x{206A} to \x{206F}
The character \x{FEFF}
The 3 characters, from \x{FFF9} to \x{FFFB}

Unfortunately, the [[:cntrl:]] regex, also matches the Control characters :

The 32 C0 characters, from \x{0000} to \x{001F}
The 32 C1 characters, from \x{0080} to \x{009F}

Moreover, the [[:cntrl:]] regex misses some characters :

The Soft Hyphen \x{00AD}
The Zero Width Space \x{200B}
The 9 characters, from \x{2060} to \x{2069}

So, a correct regex, to match all these format characters, above, in an Unicode encoded file, could be :

(?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD

Now, how to visualize a zero-width character ? If you just hit the Find Next button, you see that a specific line is reached but you do not know the exact location of this/these zero-width char(s) :-((

Two solutions are possible :

(?-s).((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+.

Which match two standard characters, separated by, one or several consecutive format character(s)

((?=[[:unicode:]])[[:cntrl:]\x{200B}\x{2060}-\x{2069}]|\xAD)+

Which mark all these format chars, while clicking on the Mark All button ( the best solution, to my mind ! )

So, trying the simple regex \x{200B}, against the sentence, below and using the Mark option, will convince you that this sentence does contain some Zero Width Space characters, inside !

For example, I’ve inserted 10 zero-width spaces into this sentence, can you tell me where ?

Note that, between the two letters l of the verb tell, there are two consecutive chars \x{200B} !

You can see a description of these format characters, from the following links :

http://www.unicode.org/charts/PDF/U2000.pdf

http://www.unicode.org/charts/PDF/UFE70.pdf

http://www.unicode.org/charts/PDF/UFFF0.pdf

Refer, also, to that post :

https://notepad-plus-plus.org/community/topic/14812/how-to-search-for-unknown-3-digit-characters-with-black-background/2

Best Regards,

guy038

P.S. :

Simply, copy/paste the list and the sentence, above, in inverse video, in a new tab and enjoy !

Reply to Visualization for zero-width characters on Thu, 19 Apr 2018 13:17:25 GMT

PeterJones — Thu, 19 Apr 2018 13:17:25 GMT

See my previous post on the subject of zero-width characters, and an earlier post where I even shared a pair of PythonScript scripts to give a fuller Show All Characters and Don't Show All Characters functionality