Community
    • Login

    How do I recongnize slightly different coding characters?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    8 Posts 4 Posters 1.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      mahengrui1
      last edited by

      hello, I believed information shown at Notepad++ is the ‘purest’ but now I find a letter does not equal to a letter for example:

      s​hinobi_
      shinobi_

      1st row has ascii 115 226 128 139 104 105 110 111 98 105 95
      2nd row has ascii 115 104 105 110 111 98 105 95
      1st row has utf-8 \x73\xe2\x80\x8b\x68\x69\x6e\x6f\x62\x69\x5f
      2nd row has utf-8 \x73\x68\x69\x6e\x6f\x62\x69\x5f

      I’d say 1st row has a hidden slightly different coding characters and I want to know the right name of this type of characters. And what is the way of quickly detect them? As for the concern others use this hidden content fool me.

      In Notepad++ I can see the length and Pos changes, but for a long document it is impossible to get it.
      Thank you

      EkopalypseE 1 Reply Last reply Reply Quote 0
      • EkopalypseE
        Ekopalypse @mahengrui1
        last edited by Ekopalypse

        @mahengrui1

        The \xe2\x80\x8b is the Zero Width Space codepoint.

        One way to detect such characters would be to use a different symbol for these code points to represent it, such as described here.

        M 1 Reply Last reply Reply Quote 1
        • guy038G
          guy038
          last edited by guy038

          Hello, @mahengrui1, @ekopalypse and All,

          In addition to the valuable link provided by @ekopalypse, you may have look to this other one :

          https://community.notepad-plus-plus.org/post/31761


          You may also use this on-line tool, below :

          https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

          which gives you the main representations of an Unicode character and its exact name !

          Best Regards,

          guy038

          1 Reply Last reply Reply Quote 1
          • M
            mahengrui1 @Ekopalypse
            last edited by

            @Ekopalypse @guy038

            Thank you very much. Besides, a method pop up while I reading a reply here (I’m sorry I don’t know how to quote a certain reply)

            If Notepad++\Encoding\Convert to ANSI it shows all Zero-Width characters right? And if I Convert to UTF-8 back they all become normal ‘?’

            Alan KilbornA 1 Reply Last reply Reply Quote 0
            • Alan KilbornA
              Alan Kilborn @mahengrui1
              last edited by

              @mahengrui1 said in How do I recongnize slightly different coding characters?:

              If Notepad++\Encoding\Convert to ANSI it shows all Zero-Width characters right?

              Well, it will shown unconvertable characters (because an “ANSI” encoding has no concept of most Unicode characters) as a ? character.

              And if I Convert to UTF-8 back they all become normal ‘?’

              Ah, so this leads me to believe you understand what is happening when the ? appear…

              And yes, converting back to UTF-8 will just keep the ? as a ?. I suppose your strategy is to then search for all ? and decide what to do for each one?

              M 1 Reply Last reply Reply Quote 0
              • M
                mahengrui1 @Alan Kilborn
                last edited by

                @Alan-Kilborn yes, it is my current strategy

                1 Reply Last reply Reply Quote 1
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, @mahengrui1, @ekopalypse, @alan-kilborn and All,

                  Well, an ANSI encoded file, which codes any character with 1 byte only, will never be able to code the ZERO-WIDTH SPACE character of code-point \x{200B}, which is coded :

                  • In an UTF-8 or UTF-8-BOM encoded file, with the 3 consecutive bytes \xE2, \x80 and \x8B

                  • In an UCS-2 BE BOM encoded file, with the 2 consecutive bytes \x20 and \x0B

                  • In an UCS-2 LE BOM encoded file, with the 2 consecutive bytes \x0B and \x20


                  However, IF, within an ANSI file :

                  • You insert, on purpose, these 3 bytes, shown above, at any location

                  • Select the Notepad++ option Encoding > UTF8 ( NOT the option Encoding > Convert to UTF8 ! )

                  • RE-save this file

                  You’ll create an UTF-8 file, containing one ( or several ) Zero-Width Space character(s)


                  Method to use :

                  • Open your ANSI encoded file

                  • Select the Edit > Character Panel option

                  • Move your caret to the location where you want to insert a Zero-Width Space character

                  • Click, successively, on the 3 characters, in the Character column of this panel, which are in the same row than the Hex value E2, 80 and 8B, so the string ​

                  • Repeat the previous operation at any location where this special char is needed

                  • Select the Encoding > UTF-8 option

                  • Save your new UTF-8 file


                  Now, to find the location of all the Zero-Width Space characters, simply use the Mark feature and search for \x{200B}, with the Regular exoression mode checked

                  Best Regards,

                  guy038

                  PS :

                  Refer to this page :

                  https://www.unicode.org/charts/PDF/U2000.pdf

                  M 1 Reply Last reply Reply Quote 1
                  • M
                    mahengrui1 @guy038
                    last edited by

                    @guy038 it looks good, however I don’t have an ANSI file so I can’t practice it

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors