• Login
Community
  • Login

How do I recongnize slightly different coding characters?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
8 Posts 4 Posters 1.4k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M
    mahengrui1
    last edited by Aug 1, 2022, 7:05 AM

    hello, I believed information shown at Notepad++ is the ‘purest’ but now I find a letter does not equal to a letter for example:

    s​hinobi_
    shinobi_

    1st row has ascii 115 226 128 139 104 105 110 111 98 105 95
    2nd row has ascii 115 104 105 110 111 98 105 95
    1st row has utf-8 \x73\xe2\x80\x8b\x68\x69\x6e\x6f\x62\x69\x5f
    2nd row has utf-8 \x73\x68\x69\x6e\x6f\x62\x69\x5f

    I’d say 1st row has a hidden slightly different coding characters and I want to know the right name of this type of characters. And what is the way of quickly detect them? As for the concern others use this hidden content fool me.

    In Notepad++ I can see the length and Pos changes, but for a long document it is impossible to get it.
    Thank you

    E 1 Reply Last reply Aug 1, 2022, 9:55 AM Reply Quote 0
    • E
      Ekopalypse @mahengrui1
      last edited by Ekopalypse Aug 1, 2022, 9:56 AM Aug 1, 2022, 9:55 AM

      @mahengrui1

      The \xe2\x80\x8b is the Zero Width Space codepoint.

      One way to detect such characters would be to use a different symbol for these code points to represent it, such as described here.

      M 1 Reply Last reply Aug 1, 2022, 2:12 PM Reply Quote 1
      • G
        guy038
        last edited by guy038 Aug 1, 2022, 11:38 AM Aug 1, 2022, 11:34 AM

        Hello, @mahengrui1, @ekopalypse and All,

        In addition to the valuable link provided by @ekopalypse, you may have look to this other one :

        https://community.notepad-plus-plus.org/post/31761


        You may also use this on-line tool, below :

        https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi ?

        which gives you the main representations of an Unicode character and its exact name !

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 1
        • M
          mahengrui1 @Ekopalypse
          last edited by Aug 1, 2022, 2:12 PM

          @Ekopalypse @guy038

          Thank you very much. Besides, a method pop up while I reading a reply here (I’m sorry I don’t know how to quote a certain reply)

          If Notepad++\Encoding\Convert to ANSI it shows all Zero-Width characters right? And if I Convert to UTF-8 back they all become normal ‘?’

          A 1 Reply Last reply Aug 1, 2022, 2:35 PM Reply Quote 0
          • A
            Alan Kilborn @mahengrui1
            last edited by Aug 1, 2022, 2:35 PM

            @mahengrui1 said in How do I recongnize slightly different coding characters?:

            If Notepad++\Encoding\Convert to ANSI it shows all Zero-Width characters right?

            Well, it will shown unconvertable characters (because an “ANSI” encoding has no concept of most Unicode characters) as a ? character.

            And if I Convert to UTF-8 back they all become normal ‘?’

            Ah, so this leads me to believe you understand what is happening when the ? appear…

            And yes, converting back to UTF-8 will just keep the ? as a ?. I suppose your strategy is to then search for all ? and decide what to do for each one?

            M 1 Reply Last reply Aug 1, 2022, 4:18 PM Reply Quote 0
            • M
              mahengrui1 @Alan Kilborn
              last edited by Aug 1, 2022, 4:18 PM

              @Alan-Kilborn yes, it is my current strategy

              1 Reply Last reply Reply Quote 1
              • G
                guy038
                last edited by guy038 Aug 1, 2022, 11:23 PM Aug 1, 2022, 11:16 PM

                Hi, @mahengrui1, @ekopalypse, @alan-kilborn and All,

                Well, an ANSI encoded file, which codes any character with 1 byte only, will never be able to code the ZERO-WIDTH SPACE character of code-point \x{200B}, which is coded :

                • In an UTF-8 or UTF-8-BOM encoded file, with the 3 consecutive bytes \xE2, \x80 and \x8B

                • In an UCS-2 BE BOM encoded file, with the 2 consecutive bytes \x20 and \x0B

                • In an UCS-2 LE BOM encoded file, with the 2 consecutive bytes \x0B and \x20


                However, IF, within an ANSI file :

                • You insert, on purpose, these 3 bytes, shown above, at any location

                • Select the Notepad++ option Encoding > UTF8 ( NOT the option Encoding > Convert to UTF8 ! )

                • RE-save this file

                You’ll create an UTF-8 file, containing one ( or several ) Zero-Width Space character(s)


                Method to use :

                • Open your ANSI encoded file

                • Select the Edit > Character Panel option

                • Move your caret to the location where you want to insert a Zero-Width Space character

                • Click, successively, on the 3 characters, in the Character column of this panel, which are in the same row than the Hex value E2, 80 and 8B, so the string ​

                • Repeat the previous operation at any location where this special char is needed

                • Select the Encoding > UTF-8 option

                • Save your new UTF-8 file


                Now, to find the location of all the Zero-Width Space characters, simply use the Mark feature and search for \x{200B}, with the Regular exoression mode checked

                Best Regards,

                guy038

                PS :

                Refer to this page :

                https://www.unicode.org/charts/PDF/U2000.pdf

                M 1 Reply Last reply Aug 3, 2022, 5:43 AM Reply Quote 1
                • M
                  mahengrui1 @guy038
                  last edited by Aug 3, 2022, 5:43 AM

                  @guy038 it looks good, however I don’t have an ANSI file so I can’t practice it

                  1 Reply Last reply Reply Quote 0
                  8 out of 8
                  • First post
                    8/8
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors