Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    How do I recongnize slightly different coding characters?

    Help wanted · · · – – – · · ·
    4
    8
    217
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      mahengrui1 last edited by

      hello, I believed information shown at Notepad++ is the ‘purest’ but now I find a letter does not equal to a letter for example:

      s​hinobi_
      shinobi_

      1st row has ascii 115 226 128 139 104 105 110 111 98 105 95
      2nd row has ascii 115 104 105 110 111 98 105 95
      1st row has utf-8 \x73\xe2\x80\x8b\x68\x69\x6e\x6f\x62\x69\x5f
      2nd row has utf-8 \x73\x68\x69\x6e\x6f\x62\x69\x5f

      I’d say 1st row has a hidden slightly different coding characters and I want to know the right name of this type of characters. And what is the way of quickly detect them? As for the concern others use this hidden content fool me.

      In Notepad++ I can see the length and Pos changes, but for a long document it is impossible to get it.
      Thank you

      Ekopalypse 1 Reply Last reply Reply Quote 0
      • Ekopalypse
        Ekopalypse @mahengrui1 last edited by Ekopalypse

        @mahengrui1

        The \xe2\x80\x8b is the Zero Width Space codepoint.

        One way to detect such characters would be to use a different symbol for these code points to represent it, such as described here.

        M 1 Reply Last reply Reply Quote 1
        • guy038
          guy038 last edited by guy038

          Hello, @mahengrui1, @ekopalypse and All,

          In addition to the valuable link provided by @ekopalypse, you may have look to this other one :

          https://community.notepad-plus-plus.org/post/31761


          You may also use this on-line tool, below :

          https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

          which gives you the main representations of an Unicode character and its exact name !

          Best Regards,

          guy038

          1 Reply Last reply Reply Quote 1
          • M
            mahengrui1 @Ekopalypse last edited by

            @Ekopalypse @guy038

            Thank you very much. Besides, a method pop up while I reading a reply here (I’m sorry I don’t know how to quote a certain reply)

            If Notepad++\Encoding\Convert to ANSI it shows all Zero-Width characters right? And if I Convert to UTF-8 back they all become normal ‘?’

            Alan Kilborn 1 Reply Last reply Reply Quote 0
            • Alan Kilborn
              Alan Kilborn @mahengrui1 last edited by

              @mahengrui1 said in How do I recongnize slightly different coding characters?:

              If Notepad++\Encoding\Convert to ANSI it shows all Zero-Width characters right?

              Well, it will shown unconvertable characters (because an “ANSI” encoding has no concept of most Unicode characters) as a ? character.

              And if I Convert to UTF-8 back they all become normal ‘?’

              Ah, so this leads me to believe you understand what is happening when the ? appear…

              And yes, converting back to UTF-8 will just keep the ? as a ?. I suppose your strategy is to then search for all ? and decide what to do for each one?

              M 1 Reply Last reply Reply Quote 0
              • M
                mahengrui1 @Alan Kilborn last edited by

                @Alan-Kilborn yes, it is my current strategy

                1 Reply Last reply Reply Quote 1
                • guy038
                  guy038 last edited by guy038

                  Hi, @mahengrui1, @ekopalypse, @alan-kilborn and All,

                  Well, an ANSI encoded file, which codes any character with 1 byte only, will never be able to code the ZERO-WIDTH SPACE character of code-point \x{200B}, which is coded :

                  • In an UTF-8 or UTF-8-BOM encoded file, with the 3 consecutive bytes \xE2, \x80 and \x8B

                  • In an UCS-2 BE BOM encoded file, with the 2 consecutive bytes \x20 and \x0B

                  • In an UCS-2 LE BOM encoded file, with the 2 consecutive bytes \x0B and \x20


                  However, IF, within an ANSI file :

                  • You insert, on purpose, these 3 bytes, shown above, at any location

                  • Select the Notepad++ option Encoding > UTF8 ( NOT the option Encoding > Convert to UTF8 ! )

                  • RE-save this file

                  You’ll create an UTF-8 file, containing one ( or several ) Zero-Width Space character(s)


                  Method to use :

                  • Open your ANSI encoded file

                  • Select the Edit > Character Panel option

                  • Move your caret to the location where you want to insert a Zero-Width Space character

                  • Click, successively, on the 3 characters, in the Character column of this panel, which are in the same row than the Hex value E2, 80 and 8B, so the string ​

                  • Repeat the previous operation at any location where this special char is needed

                  • Select the Encoding > UTF-8 option

                  • Save your new UTF-8 file


                  Now, to find the location of all the Zero-Width Space characters, simply use the Mark feature and search for \x{200B}, with the Regular exoression mode checked

                  Best Regards,

                  guy038

                  PS :

                  Refer to this page :

                  https://www.unicode.org/charts/PDF/U2000.pdf

                  M 1 Reply Last reply Reply Quote 1
                  • M
                    mahengrui1 @guy038 last edited by

                    @guy038 it looks good, however I don’t have an ANSI file so I can’t practice it

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    Copyright © 2014 NodeBB Forums | Contributors