How do I recongnize slightly different coding characters?

mahengrui1

hello, I believed information shown at Notepad++ is the ‘purest’ but now I find a letter does not equal to a letter for example:

shinobi_
shinobi_

1st row has ascii 115 226 128 139 104 105 110 111 98 105 95
2nd row has ascii 115 104 105 110 111 98 105 95
1st row has utf-8 \x73\xe2\x80\x8b\x68\x69\x6e\x6f\x62\x69\x5f
2nd row has utf-8 \x73\x68\x69\x6e\x6f\x62\x69\x5f

I’d say 1st row has a hidden slightly different coding characters and I want to know the right name of this type of characters. And what is the way of quickly detect them? As for the concern others use this hidden content fool me.

In Notepad++ I can see the length and Pos changes, but for a long document it is impossible to get it.
Thank you

Ekopalypse

@mahengrui1

The \xe2\x80\x8b is the Zero Width Space codepoint.

One way to detect such characters would be to use a different symbol for these code points to represent it, such as described here.

guy038

Hello, @mahengrui1, @ekopalypse and All,

In addition to the valuable link provided by @ekopalypse, you may have look to this other one :

https://community.notepad-plus-plus.org/post/31761

You may also use this on-line tool, below :

https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

which gives you the main representations of an Unicode character and its exact name !

Best Regards,

guy038

mahengrui1

@Ekopalypse @guy038

Thank you very much. Besides, a method pop up while I reading a reply here (I’m sorry I don’t know how to quote a certain reply)

If Notepad++\Encoding\Convert to ANSI it shows all Zero-Width characters right? And if I Convert to UTF-8 back they all become normal ‘?’

Alan Kilborn

@mahengrui1 said in How do I recongnize slightly different coding characters?:

If Notepad++\Encoding\Convert to ANSI it shows all Zero-Width characters right?

Well, it will shown unconvertable characters (because an “ANSI” encoding has no concept of most Unicode characters) as a ? character.

And if I Convert to UTF-8 back they all become normal ‘?’

Ah, so this leads me to believe you understand what is happening when the ? appear…

And yes, converting back to UTF-8 will just keep the ? as a ?. I suppose your strategy is to then search for all ? and decide what to do for each one?

mahengrui1

@Alan-Kilborn yes, it is my current strategy

guy038

Hi, @mahengrui1, @ekopalypse, @alan-kilborn and All,

Well, an ANSI encoded file, which codes any character with 1 byte only, will never be able to code the ZERO-WIDTH SPACE character of code-point \x{200B}, which is coded :

In an UTF-8 or UTF-8-BOM encoded file, with the 3 consecutive bytes \xE2, \x80 and \x8B
In an UCS-2 BE BOM encoded file, with the 2 consecutive bytes \x20 and \x0B
In an UCS-2 LE BOM encoded file, with the 2 consecutive bytes \x0B and \x20

However, IF, within an ANSI file :

You insert, on purpose, these 3 bytes, shown above, at any location
Select the Notepad++ option Encoding > UTF8 ( NOT the option Encoding > Convert to UTF8 ! )
RE-save this file

You’ll create an UTF-8 file, containing one ( or several ) Zero-Width Space character(s)

Method to use :

Open your ANSI encoded file
Select the Edit > Character Panel option
Move your caret to the location where you want to insert a Zero-Width Space character
Click, successively, on the 3 characters, in the Character column of this panel, which are in the same row than the Hex value E2, 80 and 8B, so the string â€‹
Repeat the previous operation at any location where this special char is needed
Select the Encoding > UTF-8 option
Save your new UTF-8 file

Now, to find the location of all the Zero-Width Space characters, simply use the Mark feature and search for \x{200B}, with the Regular exoression mode checked

Best Regards,

guy038

PS :

Refer to this page :

https://www.unicode.org/charts/PDF/U2000.pdf

mahengrui1

@guy038 it looks good, however I don’t have an ANSI file so I can’t practice it