UCS-2 encoding problem

PeterJones · Feb 28, 2019, 7:05 PM

I expected it to work because it worked in the previous versions.

Sometimes features change between versions. That’s why many people recommend not succumbing to upgraditis – if it’s not broke, don’t fix it. Others recommend doing every update, because of potential security problems – that’s great advice for front-facing applications like phone apps or web browsers, which do a lot of networking; but for local-focused applications like Notepad++, that’s not as critical.

Since an older version works for you, you might consider re-installing the older version, and turning off auto-updates. In that case, you can either wait until your feature request is implemented and confirmed before upgrading, or just not bother upgrading.

In the end, it’s up to you. Good luck.

Alan Kilborn · Feb 28, 2019, 7:43 PM

BTW I have not found the hex editor plugin to be very good; in this case maybe best to use a separate hex editor. Notepad++, while we want it to do and be good at all things, isn’t the type of program with the necessary kinds of resources behind its development to support being all-powerful.

Meta Chuh · Feb 28, 2019, 7:50 PM

@Marek-Jindra

Now I have to use another software, or an old version of NPP, because I am not able to view the unicode file as bytes (8bit encoding).

i get the same results on all tested notepad++ versions, from very old to newest.
(5.9.3 ansi, 5.9.3 unicode, 7.5.5, 7.6.3)
are you sure that it behaved differently on an old version of npp ?
if yes, which version was it ?

if you have time, you can download all older portable versions from here:
https://notepad-plus-plus.org/download/all-versions.html
(choose the zip packages. they will not interfere with your installed version)
and find the version which did what you need.
reason: as soon as you file an issue report, it might be of help, if a notepad++ reference source code, that behaves like you would expect, has ever existed.

here are my test results:

original content of "Pound.txt", saved as ucs-2 le bom, displayed as ucs-2 le bom:
£1 = €1.17

-----

ansi/utf-8 view in notepad++ 7.5.5:

encoding > encode in ansi:
Â£1 = â‚¬1.17

encoding > encode in utf-8:
£1 = €1.17

-----

ansi/utf-8 view in notepad++ 7.6.3:

encoding > encode in ansi:
Â£1 = â‚¬1.17

encoding > encode in utf-8:
£1 = €1.17

-----

ansi/utf-8 view in notepad++ 5.9.3 unicode:

encoding > encode in ansi:
Â£1 = â‚¬1.17

encoding > encode in utf-8:
£1 = €1.17

-----

ansi/utf-8 view in notepad++ 5.9.3 ansi:

encoding > encode in ansi:
Â£1 = â‚¬1.17

encoding > encode in utf-8:
£1 = €1.17

Meta Chuh · Feb 28, 2019, 8:10 PM

i second @Alan-Kilborn with the separate hex editor (where are we now ? somewhere between 4096 and 65536 i guess ;-) )

@Marek-Jindra @Alan-Kilborn @PeterJones and all:
i currently use hxd 2.2.1 (https://mh-nexus.de/en/hxd/ )
which ones do you use ? maybe yours are even better for parsing character encodings, as hxd is good as a hex editor, but rather limited when it comes to file encodings.

Alan Kilborn · Feb 28, 2019, 8:52 PM

@Meta-Chuh

Not sure hxd needs to be good at file encodings. I use it as well when I have the need to get to that level.

PeterJones · Feb 28, 2019, 9:58 PM

Apparently I haven’t needed a hex editor since my last computer upgrade at work, but when I do, HxD is what I use.

When all I need to do is do a quick hex dump, which I use much more often than a full-blown hex editor, I use the xxd that’s bundled with the windows version of gvim.

Ekopalypse · Feb 28, 2019, 10:32 PM

Yep, I have two run menu entries HxD and HxD load current document :-)

Marek Jindra · Mar 1, 2019, 12:05 PM

Thank you all for your input. I will also have a look at the HxD.

@Meta-Chuh
I think this changed after I upgraded from NPP 7.5.9 to 7.6.2.
I am quite sure it behaved differently in the older version.
Now I tried the portable version and you are right, it behaves the same as the current version.
So it might be plugin-related or config-related.
I think I have got an older version of NPP on my other laptop, so I will investigate that and search for differences.

guy038 · Mar 1, 2019, 12:09 PM

Hello, @marek-jindra, @peterjones, @meta chuh, @alan-kilborn, @ekopalypse, and All,

I have the explanation of this behavior, but, unfortunately, I cannot confirm you that is the correct one :-/

I’m going to begin with some general notions. Then, I’ll try to give you an accurate answer. I know, encodings are really a nightmare for everyone of us :-((

If we write the string £1 = €1.17, in a new file then use the Convert to UCS-2 LE BOM N++ option and save it as pound.txt, the different bytes of this file and their signification are as below :

 BOM         £         1         SP        =        SP         €         1         .         1         7
-----      -----     -----     -----     -----     -----     -----     -----     -----     -----     -----
ff fe      a3 00     31 00     20 00     3d 00     20 00     ac 20     31 00     2e 00     31 00     37 00

Everything logical, here !

The UCS-2 encoding can only encode the Unicode characters of the BMP ( Basic Multilingual Plane ) of the range [\x{0000}-\x{D7FF}\x{E000}-\x{FFFF}] in a 16-bits code unit.
The LE terminology means that, for each character, the least significant byte ( containing the least significant byte ) is written first and the most significant byte comes last
The BOM syntax is an invisible Byte Order Mark, the Unicode character x{FEFF}, logically written FFFE according to the Little Endian rule witch identify the byte order, without ambiguity !

Refer to :

https://en.wikipedia.org/wiki/UTF-16

https://en.wikipedia.org/wiki/Endianness

Remarks :

It’s important to point out that the two N++ encodings UCS-2 LE and UCS-2 BE cannot represent Unicode characters, with code-points over \x{FFFF}, so over the BMP ( Basic Multilingual Plane )
In order to represent these characters ( for instance the emoticons characters , in range [\x{1f600}-\x{1F64F}] ), while keeping the two bytes architecture, the UTF-16 encoding ( BTW, the default Windows Unicode encoding ! ) codes them in two 16-bit units, called a surrogate pair
These two 16-bits are located in range [\x{D800}-\x{DBFF ( High surrogates ) and in range [\x{DC00}-\x{DFFF ( Low surrogates ). Refer, below, for additional information :

https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

This also means that, if your document contains characters, with Unicode code-point over x{FFFF}, it must be saved, exclusively, with the N++ UTF-8 or UTF-8 BOM encodings !

Now, Marek, let’s get back to your question :

From the definition of an encoding, this process should not change the file contents but simply re-interprets file contents, according the encoding map of the characters, in this encoding

So, in theory, it should be, strictly, as below ( I assume that the BOM is also ignored ) :

            £ NUL     1 NUL    SP NUL     = NUL    SP NUL     ¬ SP      1 NUL     . NUL     1 NUL     7 NUL

           a3 00     31 00     20 00     3d 00     20 00     ac 20     31 00     2e 00     31 00     37 00

Instead, after using the N++ Encode in ANSI option and saving the file, we get this strange layout :

            Â   £      1         SP        =        SP      â   ‚   ¬    1         .         1         7
           --  --     --         --       --        --     --  --  --   --        --        --        --
           c2  a3     31         20       3d        20     e2  82  ac   31        2e        31        37

At first sight, we cannot see any logic ! Actually, two phases occur :

Firstly, a transformation of the UCS-2 LE BOM representation of characters, with code-point > \x{007F}, into the analog UTF-8 representation of these characters
Secondly, the normal re-interpretation of these bytes in ANSI, which is, by the way, quite identical to the Windows-1252 encoding, in my country ( France )

So :

The £ character, of Unicode code-point \x00A3, and represented, in UTF-8, with the two-bytes sequence C2A3 is finally interpreted as the two ANSI characters Â and £
The € character, of Unicode code-point \x20AC, and represented, in UTF-8, with the three-bytes sequence E282AC is finally interpreted as the three ANSI characters â, ‚ and ¬

IMPORTANT : I don’t know if this behavior is a real bug or if some “hidden” rules could explain it :-(( In the meanwhile, we have to live with it !

Thus, then you performed you second operation Encode in UTF8, you see, again, the £1 = €1.17 text, with the internal representation :

             £         1         SP        =        SP         €         1         .         1         7
           -----      --         --        -        --     --------     --        --        --        -- 
           c2 a3      31         20       3d        20     e2 82 ac     31        2e        31        37

Now, let’s compare with some other N++ sequences of Encoding in / Convert to !

Let’s start, again, with your correct “Pound.txt” file, saved after the operation Convert to UCS-2 LE BOM" :

 BOM         £         1         SP        =        SP         €         1         .         1         7
-----      -----     -----     -----     -----     -----     -----     -----     -----     -----     -----
ff fe      a3 00     31 00     20 00     3d 00     20 00     ac 20     31 00     2e 00     31 00     37 00

If we use the Convert to UTF-8 BOM N++ option, first, we obtain, the same text, with the byte contents :

  BOM        £         1         SP        =        SP         €         1         .         1         7
--------   -----      --         --        -        --     --------     --        --        --        --
ef bb bf   c2 a3      31         20       3d        20     e2 82 ac     31        2e        31        37

BTW, note that the beginning byte sequence EF BB BF is simply the UTF-8 representation of the Unicode character of the BOM ( \x{FEFF} )

Then, after a Encode in ANSI operation, we get this layout, identical to what you obtained when changing, directly from Convert to UCS-2 LE BOM to Encode in ANSI

            Â   £      1         SP        =        SP      â   ‚   ¬    1         .         1         7
           --  --     --         --       --        --     --  --  --   --        --        --        --
           c2  a3     31         20       3d        20     e2  82  ac   31        2e        31        37

To end with, let’s, again, click on the Encode in UTF-8 BOM option. We read, logically, the correct text £1 = €1.17, with the bytes sequence :

  BOM        £         1         SP        =        SP         €         1         .         1         7
--------   -----      --         --        -        --     --------     --        --        --        --
ef bb bf   c2 a3      31         20       3d        20     e2 82 ac     31        2e        31        37

Now, if we click on the Convert to ANSI option, we get the same text £1 = €1.17, corresponding to :

             £         1         SP        =        SP         €         1         .         1         7
            --        --         --       --        --        --        --        --        --        --
            a3        31         20       3d        20        80        31        2e        31        37

IMPORTANT :

Unlike the encoding process, a conversion to a new encoding does modify file contents, trying to write all the characters displayed, in current encoding, according to the byte representation, of these characters, in the new desired encoding !

Hope that my answer gives you some hints !

Best Regards,

guy038

I’m quite used to this tiny but very useful on-line UTF-8 tool :

http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi ?

Before typing anything in the zone, I advice you :

To read the notes, carefully, at end of the page
To select the right type of your entry which, generally, will be, either, Interpret as Character or Interpret as Hex code point ( For instance, character € or Unicode value 20AC )

Marek Jindra · Mar 1, 2019, 12:31 PM

@guy038
Thank you for the explanation. You described very thoroughly what happens.

I think, this behavior is very good for people, who want to see a readable text and not bother with encodings. It doesn’t corrupt the characters even if you tell it to do so.
But I think NPP is not displaying the truth to me, how the UCS-2 LE really looks like if interpreted as ANSI.