• Login
Community
  • Login

Invisible characters unwanted

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
charactersinvisible
28 Posts 10 Posters 30.8k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • B
    Benoît Lechat
    last edited by Jun 28, 2017, 1:18 PM

    Hello everyone,
    Really strange problem with Notepad++ here…
    It comes an invisible characters between the others letters at certain point of my html code.
    So my code doesn’t work.
    Where does it come from ? Do you have a solution such as “search and replace” ?
    Check this file below : just before the F letter of FE-000-2. If I try to delete the character ", I have to press 2 times on backspace !
    https://drive.google.com/open?id=0B-tMAt7OX-3OSFNjTFJVNkk3VEU

    Thx for help

    C 1 Reply Last reply Jun 28, 2017, 10:38 PM Reply Quote 0
    • S
      sambuccid
      last edited by Jun 28, 2017, 3:04 PM

      I don’t see character

      1 Reply Last reply Reply Quote 0
      • C
        Claudia Frank @Benoît Lechat
        last edited by Jun 28, 2017, 10:38 PM

        @Benoît-Lechat

        Where does it come from ?

        Could it be that you played with different encodings? Copy and pasted from different sources?

        Cheers
        Claudia

        1 Reply Last reply Reply Quote 0
        • B
          Benoît Lechat
          last edited by Jun 29, 2017, 7:19 AM

          I copied the text from an ancient website but from the browser and not from the code self

          1 Reply Last reply Reply Quote 0
          • A
            AdrianHHH
            last edited by Jun 29, 2017, 8:11 AM

            A hex dump of the downloaded file shows as below. There appear to be the three bytes E2 80 8B between the double quote and the ‘F’. I do not know what they represent or why Notepad++ does not show anything for them.

             000000   3C646976 20636C61   |<div cla|
             000008   73733D22 6D6F6461   |ss="moda|
             000010   6C206661 64652070   |l fade p|
             000018   726F6475 63745F76   |roduct_v|
             000020   69657722 2069643D   |iew" id=|
             000028   22E2808B 46452D30   |"...FE-0|
             000030   30302D32 223E       |00-2">  |
            
            1 Reply Last reply Reply Quote 1
            • P
              PeterJones
              last edited by Jun 29, 2017, 2:17 PM

              I originally assumed it was Unicode encoded in UTF-8. In UTF-8, bytes starting with a most-significant-bit of 1 (0x80 - 0xFF) are part of an encoded Unicode codepoint beyond U+007F, and when the first nibble in the encoded codepoint (the first hex character) is an E, it indicates that the codepoint is spread across thee bytes:

              1110_xxxx   10yy_yyyy   10zz_zzzz       the three bytes of codepoint, showing the "prefixes" and the arbitrary characters x,b,c
              E0          82          8B              the three values from your document
              1110_0000   1000_0010   1000_1011       ... converted to binary
                   0000     00 0010     00 1011       get rid of the
              0000_00_0010_00_1011                    compress the space
              0000_0000_1000_1011                     regroup to nibbles
              U+008B                                  convert to hex unicode
              

              U+008B is a control character.

              Oddly, U+008B should have been represented with only two bytes, not three: C2 8B, so I’m not sure why your browser gave you those three bytes when you copied, unless the browser wasn’t really presenting UTF-8. Windows-1252 (Latin-1) would be à‚‹. In CP850/OEM850 (“Multilingual”, which also calls itself Latin I in some documents), it is Óéï. In DOS CodePage 437 (which I cannot find in Notepad++ Encoding list, but it’s the one that had the old box-drawing characters), it would have been αéï. None of those strings make sense as being likely; maybe the browser misinterpreted or misencoded something when you copied from the browser, or over the years, whatever encoding was originally there had been corrupted into those three bytes.

              I tried an experiment, and took your example file, and appended the hex characters 2020C28B2020 (two spaces, the proper UTF-8 encoding of U+008B, and two more spaces) – Notepad++ doesn’t display the E0828B as anything, but does display the C28B as [PLD]. So it’s looking like Notepad++ just doesn’t display anything for an invalid UTF-8 sequence, but does display something for UTF-8 control-codes

              1 Reply Last reply Reply Quote 0
              • A
                AdrianHHH
                last edited by AdrianHHH Jun 30, 2017, 10:45 AM Jun 30, 2017, 10:45 AM

                Peter, great on how to convert, but I think you transcribed the hex values wrongly. You converted E0 82 8B but I believe the values in the file are E2 80 8B. When I convert these using your method I get U+400B and it is described as “Unicode Han Character ‘(same as U+9E7D 鹽) salt’ (U+400B)”.

                1 Reply Last reply Reply Quote 1
                • P
                  PeterJones
                  last edited by PeterJones Jun 30, 2017, 1:03 PM Jun 30, 2017, 1:01 PM

                  Oops. Actually, we were both wrong. It was E2 80 8B, but that decodes into…

                  1110_xxxx 10yy_yyyy 10zz_zzzz
                  E2        80        8B
                  1110 0010 1000 0000 1000 1011
                       0010   00 0000   00 1011
                  0010_00_0000_00_1011
                  0010_0000_0000_1011
                  U+200B
                  

                  U+200B is the Zero Width Space. And suddenly the lack of a glyph (or, rather, not seeing anything) in Notepad++ makes perfect sense! Of course you won’t see a zero-width space. ☺

                  Thanks.

                  1 Reply Last reply Reply Quote 3
                  • G
                    guy038
                    last edited by guy038 Jul 1, 2017, 11:20 AM Jul 1, 2017, 11:12 AM

                    Hello, @benoît-lechat, @adrianHHH, @peterjones, and All,

                    For UTF-8 files, I very often use this simple and useful tool :

                    http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?


                    For instance, given the line of Benoît’s example, from the link :

                    https://drive.google.com/file/d/0B-tMAt7OX-3OSFNjTFJVNkk3VEU/view

                    • Select all that line and paste in a N++ new tab

                    • Now, place the cursor, right before the opening double-quote of the string “FE-000-2”

                    • Hit the Right Arrow, to be at the location, right after the " character

                    • Then hit the Shift + Right Arrow shortcut to select this unknown character

                    • Copy it, with the Ctrl + C shortcut

                    • Open the following UTF-8 tool : http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

                    • Paste it, with the Ctrl + V shortcut, in the box, at the top of the page

                    • Choose the Interpret as character option

                    • Finally, click on the Go button

                    => You’ll get the Zero Width space character :-))


                    You may, either :

                    • Enter the text E2 80 8B, with a space separator character, between bytes

                    • Choose the Interpret as Hex UTF-8 Bytes option

                    • Click on the Go button

                    You could also :

                    • Enter the text 200B, without any space character ( Hexadecimal Unicode code-point number )

                    • Choose the Interpret as Hex code-point option

                    • Click on the Go button

                    To end with :

                    • Enter the text 8203 ( Decimal Unicode code-point number )

                    • Choose the Interpret as Decimal code-point option

                    • Click on the Go button

                    Each time, you’ll get, again, the Zero Width space character :-))

                    Best Regards,

                    guy038

                    P.S. :

                    You can get additional and useful information on the zero-width space (ZWSP) character, from the link, below :

                    https://en.wikipedia.org/wiki/Zero-width_space

                    1 Reply Last reply Reply Quote 2
                    • Stella MichealS
                      Stella Micheal Banned
                      last edited by Jul 1, 2017, 2:12 PM

                      This post is deleted!
                      1 Reply Last reply Reply Quote 0
                      • Nathan HarveyN
                        Nathan Harvey
                        last edited by Jul 11, 2017, 6:30 PM

                        But Shouldn’t we be able to see all the characters (including Zero-width space) if we use the menu option View↘Show Symbol↘Show All Characters ?

                        This seems like a menu function that doesn’t work as described (ALL characters should include weird control characters and “noop” characters like zero-width space).

                        C 1 Reply Last reply Jul 11, 2017, 6:59 PM Reply Quote 2
                        • C
                          Claudia Frank @Nathan Harvey
                          last edited by Jul 11, 2017, 6:59 PM

                          @Nathan-Harvey

                          I’m feeling the same, as long as the underlying font is able to represent it, it should be displayed,
                          even when not using show all symbols.

                          Cheers
                          Claudia

                          1 Reply Last reply Reply Quote 1
                          • C
                            Claudia Frank
                            last edited by Jul 11, 2017, 7:16 PM

                            and it is shown

                            What am I missing here?

                            Cheers
                            Claudia

                            1 Reply Last reply Reply Quote 0
                            • P
                              PeterJones
                              last edited by Jul 11, 2017, 10:12 PM

                              @Claudia-Frank ,

                              I think what you’re missing is that your font doesn’t have a glyph for the character, so shows the ? in a box.

                              My font, DejaVu Sans Mono, has a glyph for that character, which is a zero-width glyph, so you cannot see it (because it’s there, but zero-width). But I can highlight it (see the little green highlight on the first line, and the “Sel: 1|1” on the status bar.

                              @Nathan-Harvey , I think Notepad++ and my font are doing the right thing: there is a character (Zero-Width Space), and it is being shown, as zero-width. It’s not a control-character, so it doesn’t have a default CR LF-style box-glyph from show-all-characters.

                              However, using the PythonScript plugin (that Claudia’s screenshot implied), you can run editor.setRepresentation(u'\u200B', "ZWS") to get it to replace the normal zero-width space with ZWS in a black box (similar to the CR and LF boxes). To clear that alternate representation, editor.clearRepresentation(u'\u200B'). (There is similar notation for the NppExec plugin as well, but I do not know how to represent a unicode string in its syntax.)

                              By saving that to a script, and using the PythonScript Configuration menu to add that script to the Plugins > PythonScript menu, you can actually then assign a keyboard shortcut using Settings > Shortcut Mapper > Plugin Commands. If you make two scripts

                              # script = Show ZeroWidth Characters (give them a non-zero-width representation)
                              editor.setRepresentation(u'\u200B', "ZWS")
                              editor.setRepresentation(u'\u200C', "ZWNJ")
                              editor.setRepresentation(u'\u200D', "ZWJ")
                              editor.setRepresentation(u'\uFEFF', "ZWNBSP")
                              
                              # script = Default ZeroWidth Characters (return them to their zero-width glyph from the selected font)
                              editor.clearRepresentation(u'\u200B')
                              editor.clearRepresentation(u'\u200C')
                              editor.clearRepresentation(u'\u200D')
                              editor.clearRepresentation(u'\uFEFF')
                              

                              you can get all the “zero width” unicode characters that I can find to toggle visibility

                              C 1 Reply Last reply Jul 11, 2017, 10:39 PM Reply Quote 4
                              • C
                                Claudia Frank @PeterJones
                                last edited by Jul 11, 2017, 10:39 PM

                                @PeterJones

                                Peter, thank you very much for your insight.
                                You could be and I already start thinking you are right about the glyph and my used font.
                                I still feel it should be the other way around as I don’t like to have an invisible char in my code
                                and wondering why it doesn’t do what it is supposed to do but than, on the other side, it doesn’t make sense to have a zero-width char. Hmmm.
                                I guess I’m good as I can use my font or using setRepresentation function to see any “invisible” chars :-)
                                Your explanation makes sense - absolutely.

                                Thank you very much.
                                Claudia

                                1 Reply Last reply Reply Quote 0
                                • P
                                  PeterJones
                                  last edited by Jul 12, 2017, 1:33 PM

                                  Also, if you want one command to do the normal Show All Characters plus showing these four Zero Width characters,

                                      # script = Show All Characters (including ZeroWidth)
                                      editor.setRepresentation(u'\u200B', "ZWS")
                                      editor.setRepresentation(u'\u200C', "ZWNJ")
                                      editor.setRepresentation(u'\u200D', "ZWJ")
                                      editor.setRepresentation(u'\uFEFF', "ZWNBSP")
                                      # if you want to _also_ show all characters with this script,
                                      #   first pick a different View > Show Symbols option,
                                      #   then pick this one (each is a toggle, so don't want to accidentally hide all characters if show-all was already selected)
                                      notepad.menuCommand(MENUCOMMAND.VIEW_EOL)
                                      notepad.menuCommand(MENUCOMMAND.VIEW_ALL_CHARACTERS)
                                  

                                  And similarly to un-set Show All Characters as well as clearing the four Zero Width representations:

                                      # script = Don'tShow All Characters (including ZeroWidth)
                                      editor.clearRepresentation(u'\u200B')
                                      editor.clearRepresentation(u'\u200C')
                                      editor.clearRepresentation(u'\u200D')
                                      editor.clearRepresentation(u'\uFEFF')
                                      # if you want to _also_ hide all characters with this script,
                                      #   first pick a different View > Show Symbols option,
                                      #   then pick this one twice (each is a toggle, so don't want to accidentally show all characters if show-all was already cleared)
                                      notepad.menuCommand(MENUCOMMAND.VIEW_EOL)
                                      notepad.menuCommand(MENUCOMMAND.VIEW_ALL_CHARACTERS)
                                      notepad.menuCommand(MENUCOMMAND.VIEW_ALL_CHARACTERS)
                                  

                                  As explained in my comments, I use the VIEW_EOL to change out of VIEW_ALL_CHARACTERS, no matter what the state of the VIEW_ALL_CHARACTERS toggle is; then I use VIEW_ALL_CHARACTERS once to set it or twice to clear it. If, instead, you’d like your DontShowAllCharacters to revert to “Show EOL” or “Show Whitespace and Tab”, then instead of VIEW_EOL/VIEW_ALL_CHARACTERS/VIEW_ALL_CHARACTERS sequence of three, you could just use VIEW_EOL (a sequence of one to show EOL) or VIEW_TAB_SPACE. (Though, to be safe, you might want a two-sequence of VIEW_ALL_CHARACTERS/VIEW_EOL or VIEW_ALL_CHARACTERS/VIEW_TAB_SPACE. It would be easier if there were a notepad.getMenuCommandState() or similar command that reads back the current state of a toggled menu command.)

                                  C Alan KilbornA 2 Replies Last reply Jul 12, 2017, 2:08 PM Reply Quote 0
                                  • C
                                    Claudia Frank @PeterJones
                                    last edited by Jul 12, 2017, 2:08 PM

                                    @PeterJones

                                    thx,
                                    there are editor.getViewEOL() and editor.getViewWS() functions available
                                    to retrieve current state. But instead of using setView… I would recommend using
                                    notepad.menuCommand(MENUCOMMAND…) to be in sync with notepad++ itself.

                                    Cheers
                                    Claudia

                                    1 Reply Last reply Reply Quote 1
                                    • Alan KilbornA
                                      Alan Kilborn @PeterJones
                                      last edited by Aug 7, 2020, 5:45 PM

                                      @PeterJones

                                      I’ve noticed that after doing an editor.setRepresentation() it shows the new character representation in the currently active tab, but if I switch tabs to one which also has characters that should be shown by this, they aren’t shown. Switching back to the tab I started in, the representation I set has also disappeared.

                                      I think I know why this is (well, kinda, :-) ), but I’m surprised it wasn’t mentioned before in this thread.

                                      Alan KilbornA 1 Reply Last reply Aug 7, 2020, 6:00 PM Reply Quote 2
                                      • Alan KilbornA
                                        Alan Kilborn @Alan Kilborn
                                        last edited by Aug 7, 2020, 6:00 PM

                                        @Alan-Kilborn said in Invisible characters unwanted:

                                        I’ve noticed that after doing an editor.setRepresentation() it shows the new character representation in the currently active tab, but if I switch tabs to one which also has characters that should be shown by this, they aren’t shown. Switching back to the tab I started in, the representation I set has also disappeared.

                                        Here’s a little script to avoid that problem, I call it SetRepresentationForSpecialCharacters.py :

                                        # -*- coding: utf-8 -*-
                                        
                                        from Npp import editor, notepad, NOTIFICATION
                                        
                                        class SRFSC(object):
                                        
                                            def __init__(self):
                                                notepad.callback(self.callback_npp_BUFFERACTIVATED, [NOTIFICATION.BUFFERACTIVATED])
                                                self.callback_npp_BUFFERACTIVATED(None)
                                        
                                            def callback_npp_BUFFERACTIVATED(self, args):
                                                editor.setRepresentation(u'\u200B', "ZWS")
                                                editor.setRepresentation(u'\u200C', "ZWNJ")
                                                editor.setRepresentation(u'\u200D', "ZWJ")
                                                editor.setRepresentation(u'\u200E', "LTR")  # left-to-right mark
                                                editor.setRepresentation(u'\uFEFF', "ZWNBSP")
                                        

                                        I run it from my startup.py with this segment of code:

                                        import SetRepresentationForSpecialCharacters
                                        SetRepresentationForSpecialCharacters.SRFSC()
                                        
                                        P 1 Reply Last reply Aug 7, 2020, 6:32 PM Reply Quote 4
                                        • P
                                          PeterJones @Alan Kilborn
                                          last edited by Aug 7, 2020, 6:32 PM

                                          @Alan-Kilborn said in Invisible characters unwanted:

                                          editor.setRepresentation(u'\u200E', "LTR")  # left-to-right mark`
                                          

                                          Apparently this LTR issue is really annoying to you. :-)

                                          Looking at http://www.fileformat.info/info/unicode/block/general_punctuation/images.htm, there are other control characters in that block, so if your data is more varied, I might expand that to:

                                              # zero width in name
                                              editor.setRepresentation(u'\u200B', "ZWS")
                                              editor.setRepresentation(u'\u200C', "ZWNJ")
                                              editor.setRepresentation(u'\u200D', "ZWJ")
                                              editor.setRepresentation(u'\uFEFF', "ZWNBSP")
                                              # also zero width
                                              editor.setRepresentation(u'\u2060', "WJ")       # word joiner (separate from ZWJ, but still claims zero width)
                                              # directional controls and other toggles
                                              editor.setRepresentation(u'\u200E', "LTR")  # left-to-right mark
                                              editor.setRepresentation(u'\u200F', "RTL")  # right-to-left mark
                                              editor.setRepresentation(u'\u202A', "EMBL")  # left-to-right embedding
                                              editor.setRepresentation(u'\u202B', "EMBR")  # right-to-left embedding
                                              editor.setRepresentation(u'\u202C', "EMBP")  # pop directional formatting
                                              editor.setRepresentation(u'\u202A', "OVRL")  # left-to-right override
                                              editor.setRepresentation(u'\u202B', "OVRR")  # right-to-left override
                                              editor.setRepresentation(u'\u2066', "ISOL")  # left-to-right isolate
                                              editor.setRepresentation(u'\u2067', "ISOR")  # right-to-left isolate
                                              editor.setRepresentation(u'\u2068', "ISO1")  # first strong isolate
                                              editor.setRepresentation(u'\u2069', "ISOP")  # pop directional isolate
                                              editor.setRepresentation(u'\u206A', "SYMI")  # inhibit symmetric swapping
                                              editor.setRepresentation(u'\u206B', "SYMA")  # activate symmetric swapping
                                              editor.setRepresentation(u'\u206C', "ARAI")  # inhibit arabic form shaping
                                              editor.setRepresentation(u'\u206D', "ARAA")  # activate arabic form shaping
                                              editor.setRepresentation(u'\u206E', "SHNA")  # national digit shapes
                                              editor.setRepresentation(u'\u206E', "SHNO")  # nominal digit shapes
                                          

                                          But, most important is to include the characters that you, as the user of the script, might run across.

                                          1 Reply Last reply Reply Quote 4
                                          • First post
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors