How do I recongnize slightly different coding characters?
-
hello, I believed information shown at Notepad++ is the ‘purest’ but now I find a letter does not equal to a letter for example:
shinobi_
shinobi_1st row has ascii 115 226 128 139 104 105 110 111 98 105 95
2nd row has ascii 115 104 105 110 111 98 105 95
1st row has utf-8 \x73\xe2\x80\x8b\x68\x69\x6e\x6f\x62\x69\x5f
2nd row has utf-8 \x73\x68\x69\x6e\x6f\x62\x69\x5fI’d say 1st row has a hidden slightly different coding characters and I want to know the right name of this type of characters. And what is the way of quickly detect them? As for the concern others use this hidden content fool me.
In Notepad++ I can see the length and Pos changes, but for a long document it is impossible to get it.
Thank you -
The
\xe2\x80\x8b
is the Zero Width Space codepoint.One way to detect such characters would be to use a different symbol for these code points to represent it, such as described here.
-
Hello, @mahengrui1, @ekopalypse and All,
In addition to the valuable link provided by @ekopalypse, you may have look to this other one :
https://community.notepad-plus-plus.org/post/31761
You may also use this on-line tool, below :
https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?
which gives you the main representations of an Unicode character and its exact name !
Best Regards,
guy038
-
Thank you very much. Besides, a method pop up while I reading a reply here (I’m sorry I don’t know how to quote a certain reply)
If Notepad++\Encoding\Convert to ANSI it shows all Zero-Width characters right? And if I Convert to UTF-8 back they all become normal ‘?’
-
@mahengrui1 said in How do I recongnize slightly different coding characters?:
If Notepad++\Encoding\Convert to ANSI it shows all Zero-Width characters right?
Well, it will shown unconvertable characters (because an “ANSI” encoding has no concept of most Unicode characters) as a
?
character.And if I Convert to UTF-8 back they all become normal ‘?’
Ah, so this leads me to believe you understand what is happening when the
?
appear…And yes, converting back to UTF-8 will just keep the
?
as a?
. I suppose your strategy is to then search for all?
and decide what to do for each one? -
@Alan-Kilborn yes, it is my current strategy
-
Hi, @mahengrui1, @ekopalypse, @alan-kilborn and All,
Well, an
ANSI
encoded file, which codes any character with1
byte only, will never be able to code the ZERO-WIDTH SPACE character of code-point\x{200B}
, which is coded :-
In an
UTF-8
orUTF-8-BOM
encoded file, with the3
consecutive bytes\xE2
,\x80
and\x8B
-
In an
UCS-2 BE BOM
encoded file, with the2
consecutive bytes\x20
and\x0B
-
In an
UCS-2 LE BOM
encoded file, with the2
consecutive bytes\x0B
and\x20
However, IF, within an
ANSI
file :-
You insert, on purpose, these
3
bytes, shown above, at any location -
Select the Notepad++ option
Encoding > UTF8
( NOT the optionEncoding > Convert to UTF8
! ) -
RE-save this file
You’ll create an
UTF-8
file, containing one ( or several )Zero-Width Space
character(s)
Method to use :
-
Open your
ANSI
encoded file -
Select the
Edit > Character Panel
option -
Move your caret to the location where you want to insert a Zero-Width Space character
-
Click, successively, on the
3
characters, in the Character column of this panel, which are in the same row than the Hex valueE2
,80
and8B
, so the string​
-
Repeat the previous operation at any location where this special char is needed
-
Select the
Encoding > UTF-8
option -
Save your new
UTF-8
file
Now, to find the location of all the Zero-Width Space characters, simply use the Mark feature and search for
\x{200B}
, with theRegular exoression
mode checkedBest Regards,
guy038
PS :
Refer to this page :
-
-
@guy038 it looks good, however I don’t have an ANSI file so I can’t practice it