Community
    • Login

    UCS-2 encoding problem

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    14 Posts 6 Posters 7.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Marek JindraM
      Marek Jindra
      last edited by

      Thank you for the reply.

      Yes, saving as UCS-2 LE did save the correct bytes to disk.
      However, I also want to use NPP to verify if the bytes in the file are correct. Now I have to use another software, or an old version of NPP, because I am not able to view the unicode file as bytes (8bit encoding).
      Even the Hex-editor plugin in NPP does not work anymore and does not show the real hex values in this situation.

      Sometimes I used to view or edit binary files in NPP. That is not always reliable now.
      Imagine the situation, when a binary file is composed of ANSI parts and Unicode parts. Then there is no perfect encoding for the whole file and more encodings might accidentally seem to be valid. I wish to switch between them.
      I need an editor to both edit binary files and convert encodings/reinterpret encodings.

      An invalid UTF-8 sequence could result in some question marks or strange characters. I could use this unlikely scenario to view which parts of my corrupted file are not valid UTF-8 sequences. I expected it to work because it worked in the previous versions.

      I believe there are situations, when a file might be interpreted as many encodings at once and even produce humanly readable content. Then it is just a thing of preference, which encoding you show as a default. Or you might have a partially broken file, which is only readable if you select UTF-8, even though it contains several corrupted bytes.

      If NPP developers changed this function intentionally, I wish to have a setting to turn it off.

      1 Reply Last reply Reply Quote 2
      • PeterJonesP
        PeterJones
        last edited by

        @Marek-Jindra said:

        If NPP developers changed this function intentionally, I wish to have a setting to turn it off.

        If you wish to make a feature request or bug report, this FAQ explains how. You will probably want to reference this thread (https://notepad-plus-plus.org/community/topic/17196/ucs-2-encoding-problem) from your feature request, and it’s considered polite to paste a link to the feature request back in this discussion.

        1 Reply Last reply Reply Quote 1
        • PeterJonesP
          PeterJones
          last edited by

          @Marek-Jindra said:

          I expected it to work because it worked in the previous versions.

          Sometimes features change between versions. That’s why many people recommend not succumbing to upgraditis – if it’s not broke, don’t fix it. Others recommend doing every update, because of potential security problems – that’s great advice for front-facing applications like phone apps or web browsers, which do a lot of networking; but for local-focused applications like Notepad++, that’s not as critical.

          Since an older version works for you, you might consider re-installing the older version, and turning off auto-updates. In that case, you can either wait until your feature request is implemented and confirmed before upgrading, or just not bother upgrading.

          In the end, it’s up to you. Good luck.

          1 Reply Last reply Reply Quote 1
          • Alan KilbornA
            Alan Kilborn
            last edited by

            BTW I have not found the hex editor plugin to be very good; in this case maybe best to use a separate hex editor. Notepad++, while we want it to do and be good at all things, isn’t the type of program with the necessary kinds of resources behind its development to support being all-powerful.

            Meta ChuhM 1 Reply Last reply Reply Quote 2
            • Meta ChuhM
              Meta Chuh moderator @Marek Jindra
              last edited by Meta Chuh

              @Marek-Jindra

              Now I have to use another software, or an old version of NPP, because I am not able to view the unicode file as bytes (8bit encoding).

              i get the same results on all tested notepad++ versions, from very old to newest.
              (5.9.3 ansi, 5.9.3 unicode, 7.5.5, 7.6.3)
              are you sure that it behaved differently on an old version of npp ?
              if yes, which version was it ?

              if you have time, you can download all older portable versions from here:
              https://notepad-plus-plus.org/download/all-versions.html
              (choose the zip packages. they will not interfere with your installed version)
              and find the version which did what you need.
              reason: as soon as you file an issue report, it might be of help, if a notepad++ reference source code, that behaves like you would expect, has ever existed.

              here are my test results:

              original content of "Pound.txt", saved as ucs-2 le bom, displayed as ucs-2 le bom:
              £1 = €1.17
              
              -----
              
              ansi/utf-8 view in notepad++ 7.5.5:
              
              encoding > encode in ansi:
              £1 = €1.17
              
              encoding > encode in utf-8:
              £1 = €1.17
              
              -----
              
              ansi/utf-8 view in notepad++ 7.6.3:
              
              encoding > encode in ansi:
              £1 = €1.17
              
              encoding > encode in utf-8:
              £1 = €1.17
              
              -----
              
              ansi/utf-8 view in notepad++ 5.9.3 unicode:
              
              encoding > encode in ansi:
              £1 = €1.17
              
              encoding > encode in utf-8:
              £1 = €1.17
              
              -----
              
              ansi/utf-8 view in notepad++ 5.9.3 ansi:
              
              encoding > encode in ansi:
              £1 = €1.17
              
              encoding > encode in utf-8:
              £1 = €1.17
              
              Marek JindraM 1 Reply Last reply Reply Quote 1
              • Meta ChuhM
                Meta Chuh moderator @Alan Kilborn
                last edited by Meta Chuh

                i second @Alan-Kilborn with the separate hex editor (where are we now ? somewhere between 4096 and 65536 i guess ;-) )

                @Marek-Jindra @Alan-Kilborn @PeterJones and all:
                i currently use hxd 2.2.1 (https://mh-nexus.de/en/hxd/)
                which ones do you use ? maybe yours are even better for parsing character encodings, as hxd is good as a hex editor, but rather limited when it comes to file encodings.

                Alan KilbornA 1 Reply Last reply Reply Quote 2
                • Alan KilbornA
                  Alan Kilborn @Meta Chuh
                  last edited by

                  @Meta-Chuh

                  Not sure hxd needs to be good at file encodings. I use it as well when I have the need to get to that level.

                  1 Reply Last reply Reply Quote 2
                  • PeterJonesP
                    PeterJones
                    last edited by

                    Apparently I haven’t needed a hex editor since my last computer upgrade at work, but when I do, HxD is what I use.

                    When all I need to do is do a quick hex dump, which I use much more often than a full-blown hex editor, I use the xxd that’s bundled with the windows version of gvim.

                    1 Reply Last reply Reply Quote 1
                    • EkopalypseE
                      Ekopalypse
                      last edited by

                      Yep, I have two run menu entries HxD and HxD load current document :-)

                      1 Reply Last reply Reply Quote 2
                      • Marek JindraM
                        Marek Jindra @Meta Chuh
                        last edited by

                        Thank you all for your input. I will also have a look at the HxD.

                        @Meta-Chuh
                        I think this changed after I upgraded from NPP 7.5.9 to 7.6.2.
                        I am quite sure it behaved differently in the older version.
                        Now I tried the portable version and you are right, it behaves the same as the current version.
                        So it might be plugin-related or config-related.
                        I think I have got an older version of NPP on my other laptop, so I will investigate that and search for differences.

                        1 Reply Last reply Reply Quote 1
                        • guy038G
                          guy038
                          last edited by

                          Hello, @marek-jindra, @peterjones, @meta chuh, @alan-kilborn, @ekopalypse, and All,

                          I have the explanation of this behavior, but, unfortunately, I cannot confirm you that is the correct one :-/

                          I’m going to begin with some general notions. Then, I’ll try to give you an accurate answer. I know, encodings are really a nightmare for everyone of us :-((


                          If we write the string £1 = €1.17, in a new file then use the Convert to UCS-2 LE BOM N++ option and save it as pound.txt, the different bytes of this file and their signification are as below :

                           BOM         £         1         SP        =        SP         €         1         .         1         7
                          -----      -----     -----     -----     -----     -----     -----     -----     -----     -----     -----
                          ff fe      a3 00     31 00     20 00     3d 00     20 00     ac 20     31 00     2e 00     31 00     37 00
                          

                          Everything logical, here !

                          • The UCS-2 encoding can only encode the Unicode characters of the BMP ( Basic Multilingual Plane ) of the range [\x{0000}-\x{D7FF}\x{E000}-\x{FFFF}] in a 16-bits code unit.

                          • The LE terminology means that, for each character, the least significant byte ( containing the least significant byte ) is written first and the most significant byte comes last

                          • The BOM syntax is an invisible Byte Order Mark, the Unicode character x{FEFF}, logically written FFFE according to the Little Endian rule witch identify the byte order, without ambiguity !

                          Refer to :

                          https://en.wikipedia.org/wiki/UTF-16

                          https://en.wikipedia.org/wiki/Endianness

                          Remarks :

                          • It’s important to point out that the two N++ encodings UCS-2 LE and UCS-2 BE cannot represent Unicode characters, with code-points over \x{FFFF}, so over the BMP ( Basic Multilingual Plane )

                          • In order to represent these characters ( for instance the emoticons characters , in range [\x{1f600}-\x{1F64F}] ), while keeping the two bytes architecture, the UTF-16 encoding ( BTW, the default Windows Unicode encoding ! ) codes them in two 16-bit units, called a surrogate pair

                          • These two 16-bits are located in range [\x{D800}-\x{DBFF ( High surrogates ) and in range [\x{DC00}-\x{DFFF ( Low surrogates ). Refer, below, for additional information :

                          https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

                          • This also means that, if your document contains characters, with Unicode code-point over x{FFFF}, it must be saved, exclusively, with the N++ UTF-8 or UTF-8 BOM encodings !

                          Now, Marek, let’s get back to your question :

                          From the definition of an encoding, this process should not change the file contents but simply re-interprets file contents, according the encoding map of the characters, in this encoding

                          So, in theory, it should be, strictly, as below ( I assume that the BOM is also ignored ) :

                                      £ NUL     1 NUL    SP NUL     = NUL    SP NUL     ¬ SP      1 NUL     . NUL     1 NUL     7 NUL
                          
                                     a3 00     31 00     20 00     3d 00     20 00     ac 20     31 00     2e 00     31 00     37 00
                          

                          Instead, after using the N++ Encode in ANSI option and saving the file, we get this strange layout :

                                      Â   £      1         SP        =        SP      â   ‚   ¬    1         .         1         7
                                     --  --     --         --       --        --     --  --  --   --        --        --        --
                                     c2  a3     31         20       3d        20     e2  82  ac   31        2e        31        37
                          

                          At first sight, we cannot see any logic ! Actually, two phases occur :

                          • Firstly, a transformation of the UCS-2 LE BOM representation of characters, with code-point > \x{007F}, into the analog UTF-8 representation of these characters

                          • Secondly, the normal re-interpretation of these bytes in ANSI, which is, by the way, quite identical to the Windows-1252 encoding, in my country ( France )

                          So :

                          • The £ character, of Unicode code-point \x00A3, and represented, in UTF-8, with the two-bytes sequence C2A3 is finally interpreted as the two ANSI characters  and £

                          • The € character, of Unicode code-point \x20AC, and represented, in UTF-8, with the three-bytes sequence E282AC is finally interpreted as the three ANSI characters â, ‚ and ¬

                          IMPORTANT : I don’t know if this behavior is a real bug or if some “hidden” rules could explain it :-(( In the meanwhile, we have to live with it !

                          Thus, then you performed you second operation Encode in UTF8, you see, again, the £1 = €1.17 text, with the internal representation :

                                       £         1         SP        =        SP         €         1         .         1         7
                                     -----      --         --        -        --     --------     --        --        --        -- 
                                     c2 a3      31         20       3d        20     e2 82 ac     31        2e        31        37
                          

                          Now, let’s compare with some other N++ sequences of Encoding in / Convert to !

                          Let’s start, again, with your correct “Pound.txt” file, saved after the operation Convert to UCS-2 LE BOM" :

                           BOM         £         1         SP        =        SP         €         1         .         1         7
                          -----      -----     -----     -----     -----     -----     -----     -----     -----     -----     -----
                          ff fe      a3 00     31 00     20 00     3d 00     20 00     ac 20     31 00     2e 00     31 00     37 00
                          

                          If we use the Convert to UTF-8 BOM N++ option, first, we obtain, the same text, with the byte contents :

                            BOM        £         1         SP        =        SP         €         1         .         1         7
                          --------   -----      --         --        -        --     --------     --        --        --        --
                          ef bb bf   c2 a3      31         20       3d        20     e2 82 ac     31        2e        31        37
                          

                          BTW, note that the beginning byte sequence EF BB BF is simply the UTF-8 representation of the Unicode character of the BOM ( \x{FEFF} )

                          Then, after a Encode in ANSI operation, we get this layout, identical to what you obtained when changing, directly from Convert to UCS-2 LE BOM to Encode in ANSI

                                      Â   £      1         SP        =        SP      â   ‚   ¬    1         .         1         7
                                     --  --     --         --       --        --     --  --  --   --        --        --        --
                                     c2  a3     31         20       3d        20     e2  82  ac   31        2e        31        37
                          

                          To end with, let’s, again, click on the Encode in UTF-8 BOM option. We read, logically, the correct text £1 = €1.17, with the bytes sequence :

                            BOM        £         1         SP        =        SP         €         1         .         1         7
                          --------   -----      --         --        -        --     --------     --        --        --        --
                          ef bb bf   c2 a3      31         20       3d        20     e2 82 ac     31        2e        31        37
                          

                          Now, if we click on the Convert to ANSI option, we get the same text £1 = €1.17, corresponding to :

                                       £         1         SP        =        SP         €         1         .         1         7
                                      --        --         --       --        --        --        --        --        --        --
                                      a3        31         20       3d        20        80        31        2e        31        37
                          

                          IMPORTANT :

                          Unlike the encoding process, a conversion to a new encoding does modify file contents, trying to write all the characters displayed, in current encoding, according to the byte representation, of these characters, in the new desired encoding !

                          Hope that my answer gives you some hints !

                          Best Regards,

                          guy038

                          I’m quite used to this tiny but very useful on-line UTF-8 tool :

                          http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

                          Before typing anything in the zone, I advice you :

                          • To read the notes, carefully, at end of the page

                          • To select the right type of your entry which, generally, will be, either, Interpret as Character or Interpret as Hex code point ( For instance, character € or Unicode value 20AC )

                          Marek JindraM 1 Reply Last reply Reply Quote 4
                          • Marek JindraM
                            Marek Jindra @guy038
                            last edited by

                            @guy038
                            Thank you for the explanation. You described very thoroughly what happens.

                            I think, this behavior is very good for people, who want to see a readable text and not bother with encodings. It doesn’t corrupt the characters even if you tell it to do so.
                            But I think NPP is not displaying the truth to me, how the UCS-2 LE really looks like if interpreted as ANSI.

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors