Invisible characters unwanted
-
Where does it come from ?
Could it be that you played with different encodings? Copy and pasted from different sources?
Cheers
Claudia -
I copied the text from an ancient website but from the browser and not from the code self
-
A hex dump of the downloaded file shows as below. There appear to be the three bytes
E2 80 8B
between the double quote and the ‘F’. I do not know what they represent or why Notepad++ does not show anything for them.000000 3C646976 20636C61 |<div cla| 000008 73733D22 6D6F6461 |ss="moda| 000010 6C206661 64652070 |l fade p| 000018 726F6475 63745F76 |roduct_v| 000020 69657722 2069643D |iew" id=| 000028 22E2808B 46452D30 |"...FE-0| 000030 30302D32 223E |00-2"> |
-
I originally assumed it was Unicode encoded in UTF-8. In UTF-8, bytes starting with a most-significant-bit of 1 (0x80 - 0xFF) are part of an encoded Unicode codepoint beyond U+007F, and when the first nibble in the encoded codepoint (the first hex character) is an
E
, it indicates that the codepoint is spread across thee bytes:1110_xxxx 10yy_yyyy 10zz_zzzz the three bytes of codepoint, showing the "prefixes" and the arbitrary characters x,b,c E0 82 8B the three values from your document 1110_0000 1000_0010 1000_1011 ... converted to binary 0000 00 0010 00 1011 get rid of the 0000_00_0010_00_1011 compress the space 0000_0000_1000_1011 regroup to nibbles U+008B convert to hex unicode
U+008B is a control character.
Oddly,
U+008B
should have been represented with only two bytes, not three:C2 8B
, so I’m not sure why your browser gave you those three bytes when you copied, unless the browser wasn’t really presenting UTF-8. Windows-1252 (Latin-1) would beà‚‹
. In CP850/OEM850 (“Multilingual”, which also calls itself Latin I in some documents), it isÓéï
. In DOS CodePage 437 (which I cannot find in Notepad++ Encoding list, but it’s the one that had the old box-drawing characters), it would have beenαéï
. None of those strings make sense as being likely; maybe the browser misinterpreted or misencoded something when you copied from the browser, or over the years, whatever encoding was originally there had been corrupted into those three bytes.I tried an experiment, and took your example file, and appended the hex characters 2020C28B2020 (two spaces, the proper UTF-8 encoding of U+008B, and two more spaces) – Notepad++ doesn’t display the E0828B as anything, but does display the C28B as
[PLD]
. So it’s looking like Notepad++ just doesn’t display anything for an invalid UTF-8 sequence, but does display something for UTF-8 control-codes -
Peter, great on how to convert, but I think you transcribed the hex values wrongly. You converted
E0 82 8B
but I believe the values in the file areE2 80 8B
. When I convert these using your method I get U+400B and it is described as “Unicode Han Character ‘(same as U+9E7D 鹽) salt’ (U+400B)”. -
Oops. Actually, we were both wrong. It was
E2 80 8B
, but that decodes into…1110_xxxx 10yy_yyyy 10zz_zzzz E2 80 8B 1110 0010 1000 0000 1000 1011 0010 00 0000 00 1011 0010_00_0000_00_1011 0010_0000_0000_1011 U+200B
U+200B
is the Zero Width Space. And suddenly the lack of a glyph (or, rather, not seeing anything) in Notepad++ makes perfect sense! Of course you won’t see a zero-width space. ☺Thanks.
-
Hello, @benoît-lechat, @adrianHHH, @peterjones, and All,
For
UTF-8
files, I very often use this simple and useful tool :http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?
For instance, given the line of Benoît’s example, from the link :
https://drive.google.com/file/d/0B-tMAt7OX-3OSFNjTFJVNkk3VEU/view
-
Select all that line and paste in a N++ new tab
-
Now, place the cursor, right before the opening double-quote of the string “FE-000-2”
-
Hit the
Right Arrow
, to be at the location, right after the"
character -
Then hit the
Shift + Right Arrow
shortcut to select this unknown character -
Copy it, with the
Ctrl + C
shortcut -
Open the following UTF-8 tool : http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?
-
Paste it, with the
Ctrl + V
shortcut, in the box, at the top of the page -
Choose the Interpret as
character
option -
Finally, click on the
Go
button
=> You’ll get the Zero Width space character :-))
You may, either :
-
Enter the text
E2 80 8B
, with a space separator character, between bytes -
Choose the Interpret as
Hex UTF-8 Bytes
option -
Click on the
Go
button
You could also :
-
Enter the text
200B
, without any space character ( Hexadecimal Unicode code-point number ) -
Choose the Interpret as
Hex code-point
option -
Click on the
Go
button
To end with :
-
Enter the text
8203
( Decimal Unicode code-point number ) -
Choose the Interpret as
Decimal code-point
option -
Click on the
Go
button
Each time, you’ll get, again, the Zero Width space character :-))
Best Regards,
guy038
P.S. :
You can get additional and useful information on the zero-width space (ZWSP) character, from the link, below :
-
-
This post is deleted! -
But Shouldn’t we be able to see all the characters (including Zero-width space) if we use the menu option View↘Show Symbol↘Show All Characters ?
This seems like a menu function that doesn’t work as described (ALL characters should include weird control characters and “noop” characters like zero-width space).
-
I’m feeling the same, as long as the underlying font is able to represent it, it should be displayed,
even when not using show all symbols.Cheers
Claudia -
and it is shown
What am I missing here?
Cheers
Claudia -
I think what you’re missing is that your font doesn’t have a glyph for the character, so shows the
?
in a box.My font, DejaVu Sans Mono, has a glyph for that character, which is a zero-width glyph, so you cannot see it (because it’s there, but zero-width). But I can highlight it (see the little green highlight on the first line, and the “Sel: 1|1” on the status bar.
@Nathan-Harvey , I think Notepad++ and my font are doing the right thing: there is a character (Zero-Width Space), and it is being shown, as zero-width. It’s not a control-character, so it doesn’t have a default
CR
LF
-style box-glyph from show-all-characters.However, using the PythonScript plugin (that Claudia’s screenshot implied), you can run
editor.setRepresentation(u'\u200B', "ZWS")
to get it to replace the normal zero-width space withZWS
in a black box (similar to theCR
andLF
boxes). To clear that alternate representation,editor.clearRepresentation(u'\u200B')
. (There is similar notation for the NppExec plugin as well, but I do not know how to represent a unicode string in its syntax.)By saving that to a script, and using the PythonScript Configuration menu to add that script to the
Plugins > PythonScript
menu, you can actually then assign a keyboard shortcut usingSettings > Shortcut Mapper > Plugin Commands
. If you make two scripts# script = Show ZeroWidth Characters (give them a non-zero-width representation) editor.setRepresentation(u'\u200B', "ZWS") editor.setRepresentation(u'\u200C', "ZWNJ") editor.setRepresentation(u'\u200D', "ZWJ") editor.setRepresentation(u'\uFEFF', "ZWNBSP") # script = Default ZeroWidth Characters (return them to their zero-width glyph from the selected font) editor.clearRepresentation(u'\u200B') editor.clearRepresentation(u'\u200C') editor.clearRepresentation(u'\u200D') editor.clearRepresentation(u'\uFEFF')
you can get all the “zero width” unicode characters that I can find to toggle visibility
-
Peter, thank you very much for your insight.
You could be and I already start thinking you are right about the glyph and my used font.
I still feel it should be the other way around as I don’t like to have an invisible char in my code
and wondering why it doesn’t do what it is supposed to do but than, on the other side, it doesn’t make sense to have a zero-width char. Hmmm.
I guess I’m good as I can use my font or using setRepresentation function to see any “invisible” chars :-)
Your explanation makes sense - absolutely.Thank you very much.
Claudia -
Also, if you want one command to do the normal
Show All Characters
plus showing these four Zero Width characters,# script = Show All Characters (including ZeroWidth) editor.setRepresentation(u'\u200B', "ZWS") editor.setRepresentation(u'\u200C', "ZWNJ") editor.setRepresentation(u'\u200D', "ZWJ") editor.setRepresentation(u'\uFEFF', "ZWNBSP") # if you want to _also_ show all characters with this script, # first pick a different View > Show Symbols option, # then pick this one (each is a toggle, so don't want to accidentally hide all characters if show-all was already selected) notepad.menuCommand(MENUCOMMAND.VIEW_EOL) notepad.menuCommand(MENUCOMMAND.VIEW_ALL_CHARACTERS)
And similarly to un-set
Show All Characters
as well as clearing the four Zero Width representations:# script = Don'tShow All Characters (including ZeroWidth) editor.clearRepresentation(u'\u200B') editor.clearRepresentation(u'\u200C') editor.clearRepresentation(u'\u200D') editor.clearRepresentation(u'\uFEFF') # if you want to _also_ hide all characters with this script, # first pick a different View > Show Symbols option, # then pick this one twice (each is a toggle, so don't want to accidentally show all characters if show-all was already cleared) notepad.menuCommand(MENUCOMMAND.VIEW_EOL) notepad.menuCommand(MENUCOMMAND.VIEW_ALL_CHARACTERS) notepad.menuCommand(MENUCOMMAND.VIEW_ALL_CHARACTERS)
As explained in my comments, I use the VIEW_EOL to change out of VIEW_ALL_CHARACTERS, no matter what the state of the VIEW_ALL_CHARACTERS toggle is; then I use VIEW_ALL_CHARACTERS once to set it or twice to clear it. If, instead, you’d like your DontShowAllCharacters to revert to “Show EOL” or “Show Whitespace and Tab”, then instead of VIEW_EOL/VIEW_ALL_CHARACTERS/VIEW_ALL_CHARACTERS sequence of three, you could just use VIEW_EOL (a sequence of one to show EOL) or VIEW_TAB_SPACE. (Though, to be safe, you might want a two-sequence of VIEW_ALL_CHARACTERS/VIEW_EOL or VIEW_ALL_CHARACTERS/VIEW_TAB_SPACE. It would be easier if there were a notepad.getMenuCommandState() or similar command that reads back the current state of a toggled menu command.)
-
thx,
there are editor.getViewEOL() and editor.getViewWS() functions available
to retrieve current state. But instead of using setView… I would recommend using
notepad.menuCommand(MENUCOMMAND…) to be in sync with notepad++ itself.Cheers
Claudia -
I’ve noticed that after doing an
editor.setRepresentation()
it shows the new character representation in the currently active tab, but if I switch tabs to one which also has characters that should be shown by this, they aren’t shown. Switching back to the tab I started in, the representation I set has also disappeared.I think I know why this is (well, kinda, :-) ), but I’m surprised it wasn’t mentioned before in this thread.
-
@Alan-Kilborn said in Invisible characters unwanted:
I’ve noticed that after doing an editor.setRepresentation() it shows the new character representation in the currently active tab, but if I switch tabs to one which also has characters that should be shown by this, they aren’t shown. Switching back to the tab I started in, the representation I set has also disappeared.
Here’s a little script to avoid that problem, I call it
SetRepresentationForSpecialCharacters.py
:# -*- coding: utf-8 -*- from Npp import editor, notepad, NOTIFICATION class SRFSC(object): def __init__(self): notepad.callback(self.callback_npp_BUFFERACTIVATED, [NOTIFICATION.BUFFERACTIVATED]) self.callback_npp_BUFFERACTIVATED(None) def callback_npp_BUFFERACTIVATED(self, args): editor.setRepresentation(u'\u200B', "ZWS") editor.setRepresentation(u'\u200C', "ZWNJ") editor.setRepresentation(u'\u200D', "ZWJ") editor.setRepresentation(u'\u200E', "LTR") # left-to-right mark editor.setRepresentation(u'\uFEFF', "ZWNBSP")
I run it from my
startup.py
with this segment of code:import SetRepresentationForSpecialCharacters SetRepresentationForSpecialCharacters.SRFSC()
-
@Alan-Kilborn said in Invisible characters unwanted:
editor.setRepresentation(u'\u200E', "LTR") # left-to-right mark`
Apparently this LTR issue is really annoying to you. :-)
Looking at http://www.fileformat.info/info/unicode/block/general_punctuation/images.htm, there are other control characters in that block, so if your data is more varied, I might expand that to:
# zero width in name editor.setRepresentation(u'\u200B', "ZWS") editor.setRepresentation(u'\u200C', "ZWNJ") editor.setRepresentation(u'\u200D', "ZWJ") editor.setRepresentation(u'\uFEFF', "ZWNBSP") # also zero width editor.setRepresentation(u'\u2060', "WJ") # word joiner (separate from ZWJ, but still claims zero width) # directional controls and other toggles editor.setRepresentation(u'\u200E', "LTR") # left-to-right mark editor.setRepresentation(u'\u200F', "RTL") # right-to-left mark editor.setRepresentation(u'\u202A', "EMBL") # left-to-right embedding editor.setRepresentation(u'\u202B', "EMBR") # right-to-left embedding editor.setRepresentation(u'\u202C', "EMBP") # pop directional formatting editor.setRepresentation(u'\u202A', "OVRL") # left-to-right override editor.setRepresentation(u'\u202B', "OVRR") # right-to-left override editor.setRepresentation(u'\u2066', "ISOL") # left-to-right isolate editor.setRepresentation(u'\u2067', "ISOR") # right-to-left isolate editor.setRepresentation(u'\u2068', "ISO1") # first strong isolate editor.setRepresentation(u'\u2069', "ISOP") # pop directional isolate editor.setRepresentation(u'\u206A', "SYMI") # inhibit symmetric swapping editor.setRepresentation(u'\u206B', "SYMA") # activate symmetric swapping editor.setRepresentation(u'\u206C', "ARAI") # inhibit arabic form shaping editor.setRepresentation(u'\u206D', "ARAA") # activate arabic form shaping editor.setRepresentation(u'\u206E', "SHNA") # national digit shapes editor.setRepresentation(u'\u206E', "SHNO") # nominal digit shapes
But, most important is to include the characters that you, as the user of the script, might run across.
-
Hello, @peterjones, @alan-kilborn and All,
From these two links :
https://www.unicode.org/charts/PDF/U2000.pdf
https://www.unicode.org/charts/PDF/UFE70.pdf
I just rewrote these
26
special characters :-
By increasing Unicode code-point order
-
With their exact code-points ( some typos corrected )
-
With their normalized Unicode character representation
So, here is a new version of the @alan-kilborn’s
SetRepresentationForSpecialCharacters.py
file, with the merged lines from the @peterjones’s script, without using thestartup.py
file :# -*- coding: utf-8 -*- from Npp import editor, notepad, NOTIFICATION class SRFSC(object): def __init__(self): notepad.callback(self.callback_npp_BUFFERACTIVATED, [NOTIFICATION.BUFFERACTIVATED]) self.callback_npp_BUFFERACTIVATED(None) def callback_npp_BUFFERACTIVATED(self, args): # FORMAT chars editor.setRepresentation(u'\u200B', "ZWSP") # zero width space editor.setRepresentation(u'\u200C', "ZWNJ") # zero width non-joiner editor.setRepresentation(u'\u200D', "ZWJ") # zero width joiner editor.setRepresentation(u'\u200E', "LRM") # left-to-right mark editor.setRepresentation(u'\u200F', "RLM") # right-to-left mark editor.setRepresentation(u'\u202A', "LRE") # left-to-right embedding editor.setRepresentation(u'\u202B', "RLE") # right-to-left embedding editor.setRepresentation(u'\u202C', "PDF") # pop directional formatting editor.setRepresentation(u'\u202D', "LRO") # left-to-right override editor.setRepresentation(u'\u202E', "RLO") # right-to-left override editor.setRepresentation(u'\u2060', "WJ") # word joiner ( zero width no-break space ) # INVISIBLE chars editor.setRepresentation(u'\u2061', "FA") # function application editor.setRepresentation(u'\u2062', "IT") # invisible times editor.setRepresentation(u'\u2063', "IS") # invisible separator editor.setRepresentation(u'\u2064', "IP") # invisible plus # FORMAT chars editor.setRepresentation(u'\u2066', "LRI") # left-to-right isolate editor.setRepresentation(u'\u2067', "RLI") # right-to-left isolate editor.setRepresentation(u'\u2068', "FSI") # first strong isolate editor.setRepresentation(u'\u2069', "PDI") # pop directional isolate # DEPRECATED chars editor.setRepresentation(u'\u206A', "ISS") # inhibit symmetric swapping editor.setRepresentation(u'\u206B', "ASS") # activate symmetric swapping editor.setRepresentation(u'\u206C', "IAFS") # inhibit arabic form shaping editor.setRepresentation(u'\u206D', "AAFS") # activate arabic form shaping editor.setRepresentation(u'\u206E', "NADS") # national digit shapes editor.setRepresentation(u'\u206F', "NODS") # nominal digit shapes # SPECIAL char editor.setRepresentation(u'\uFEFF', "BOM") # byte order mark ( zero width no-break space : deprecated, see U+2060 ) notepad.menuCommand(MENUCOMMAND.VIEW_ALL_CHARACTERS) notepad.menuCommand(MENUCOMMAND.VIEW_ALL_CHARACTERS) SRFSC()
-
On the other hand, you may quickly verify if some special characters exist in current file, using the
Mark
dialog :- MARK
[\x{200B}-\x{200F}\x{202A}-\x{202E}\x{2060}-\x{2064}\x{2066}-\x{206F}\x{FEFF}]
- MARK
=> You should see some thin
red
marks and , since thev7.9.2
N++ version, you can copy all these chars in a new tab, for further examination, with theCopy Marked Text
button !-
A third solution could be to perform a regex S/R ( which can be recorded as a macro) to replace any of these special characters with their Unicode representation :
-
SEARCH
(\x{200B})|(\x{200C})|(\x{200D})|(\x{200E})|(\x{200F})|(\x{202A})|(\x{202B})|(\x{202C})|(\x{202D})|(\x{202E})|(\x{2060})|(\x{2061})|(\x{2062})|(\x{2063})|(\x{2064})|(\x{2066})|(\x{2067})|(\x{2068})|(\x{2069})|(\x{206A})|(\x{206B})|(\x{206C})|(\x{206D})|(\x{206E})|(\x{206F})|(\x{FEFF})
-
REPLACE
(?1[ZWSP])(?2[ZWNJ])(?3[ZWJ])(?4[LRM])(?5[RLM])(?6[LRE])(?7[RLE])(?8[PDF])(?9[LRO])(?10[RLO])(?11[WJ])(?12[FA])(?13[IT])(?14[IS])(?15[IP])(?16[LRI])(?17[RLI])(?18[FSI])(?19[PDI])(?20[ISS])(?21[ASS])(?22[IAFS])(?23[AAFS])(?24[NADS])(?25[NODS])(?26[BOM])
-
Once the characters have been noted and/or the lines bookmarked, for further analyze, then just undo the replacements with
Ctrl + Z
-
Best Regards,
guy038
-
-
Is it your intent with your last posting to say that, with the PDFs from unicode.org, we now have a “complete” list of invisible characters, and a script can be made that covers them all, using correct abbreviations in their N++ representations?
At first look, it seems that anything in those docs that is shown inside a “dashed box”, e.g.:
is a good candidate for a new representation being assigned in a N++ script like the ones above?
If this is the case, I’m surprised that in the script you presented, not all of the seemingly invisible characters from the documents are in the script.
EDIT: Hmm, not sure now about the “dashed box” as I just noticed some dashed boxes in the doc containing things like
,
and+
, so probably the dashed box does not truly identify something as an “invisible character”.