UCS-2 encoding problem

PeterJones

I can confirm that behavior. I think you’re the first poster who has complained that changing the encoding doesn’t mess up the display in the way that the poster expects. Most are starting with messed-up behavior, and trying to fix some unexpected interpretation.

If saving it as UCS-2 LE makes the file have the correct bytes on the disk, why do you care what it looks like when you force the wrong encoding to be interpreted? What are you really trying to accomplish here?

And, really, given that the £ (U+00A3) is stored as the bytes A3 00 in UCS-2 LE, and that those two bytes are not a valid UTF-8 sequence, why do you have any expectations as to the UTF-8 interpretation?

Marek Jindra

Thank you for the reply.

Yes, saving as UCS-2 LE did save the correct bytes to disk.
However, I also want to use NPP to verify if the bytes in the file are correct. Now I have to use another software, or an old version of NPP, because I am not able to view the unicode file as bytes (8bit encoding).
Even the Hex-editor plugin in NPP does not work anymore and does not show the real hex values in this situation.

Sometimes I used to view or edit binary files in NPP. That is not always reliable now.
Imagine the situation, when a binary file is composed of ANSI parts and Unicode parts. Then there is no perfect encoding for the whole file and more encodings might accidentally seem to be valid. I wish to switch between them.
I need an editor to both edit binary files and convert encodings/reinterpret encodings.

An invalid UTF-8 sequence could result in some question marks or strange characters. I could use this unlikely scenario to view which parts of my corrupted file are not valid UTF-8 sequences. I expected it to work because it worked in the previous versions.

I believe there are situations, when a file might be interpreted as many encodings at once and even produce humanly readable content. Then it is just a thing of preference, which encoding you show as a default. Or you might have a partially broken file, which is only readable if you select UTF-8, even though it contains several corrupted bytes.

If NPP developers changed this function intentionally, I wish to have a setting to turn it off.

PeterJones

@Marek-Jindra said:

If NPP developers changed this function intentionally, I wish to have a setting to turn it off.

If you wish to make a feature request or bug report, this FAQ explains how. You will probably want to reference this thread (https://notepad-plus-plus.org/community/topic/17196/ucs-2-encoding-problem) from your feature request, and it’s considered polite to paste a link to the feature request back in this discussion.

PeterJones

@Marek-Jindra said:

I expected it to work because it worked in the previous versions.

Sometimes features change between versions. That’s why many people recommend not succumbing to upgraditis – if it’s not broke, don’t fix it. Others recommend doing every update, because of potential security problems – that’s great advice for front-facing applications like phone apps or web browsers, which do a lot of networking; but for local-focused applications like Notepad++, that’s not as critical.

Since an older version works for you, you might consider re-installing the older version, and turning off auto-updates. In that case, you can either wait until your feature request is implemented and confirmed before upgrading, or just not bother upgrading.

In the end, it’s up to you. Good luck.

Alan Kilborn

BTW I have not found the hex editor plugin to be very good; in this case maybe best to use a separate hex editor. Notepad++, while we want it to do and be good at all things, isn’t the type of program with the necessary kinds of resources behind its development to support being all-powerful.

Meta Chuh

@Marek-Jindra

Now I have to use another software, or an old version of NPP, because I am not able to view the unicode file as bytes (8bit encoding).

i get the same results on all tested notepad++ versions, from very old to newest.
(5.9.3 ansi, 5.9.3 unicode, 7.5.5, 7.6.3)
are you sure that it behaved differently on an old version of npp ?
if yes, which version was it ?

if you have time, you can download all older portable versions from here:
https://notepad-plus-plus.org/download/all-versions.html
(choose the zip packages. they will not interfere with your installed version)
and find the version which did what you need.
reason: as soon as you file an issue report, it might be of help, if a notepad++ reference source code, that behaves like you would expect, has ever existed.

here are my test results:

original content of "Pound.txt", saved as ucs-2 le bom, displayed as ucs-2 le bom:
£1 = €1.17

-----

ansi/utf-8 view in notepad++ 7.5.5:

encoding > encode in ansi:
Â£1 = â‚¬1.17

encoding > encode in utf-8:
£1 = €1.17

-----

ansi/utf-8 view in notepad++ 7.6.3:

encoding > encode in ansi:
Â£1 = â‚¬1.17

encoding > encode in utf-8:
£1 = €1.17

-----

ansi/utf-8 view in notepad++ 5.9.3 unicode:

encoding > encode in ansi:
Â£1 = â‚¬1.17

encoding > encode in utf-8:
£1 = €1.17

-----

ansi/utf-8 view in notepad++ 5.9.3 ansi:

encoding > encode in ansi:
Â£1 = â‚¬1.17

encoding > encode in utf-8:
£1 = €1.17

Meta Chuh

i second @Alan-Kilborn with the separate hex editor (where are we now ? somewhere between 4096 and 65536 i guess ;-) )

@Marek-Jindra @Alan-Kilborn @PeterJones and all:
i currently use hxd 2.2.1 (https://mh-nexus.de/en/hxd/)
which ones do you use ? maybe yours are even better for parsing character encodings, as hxd is good as a hex editor, but rather limited when it comes to file encodings.

Alan Kilborn

@Meta-Chuh

Not sure hxd needs to be good at file encodings. I use it as well when I have the need to get to that level.

PeterJones

Apparently I haven’t needed a hex editor since my last computer upgrade at work, but when I do, HxD is what I use.

When all I need to do is do a quick hex dump, which I use much more often than a full-blown hex editor, I use the xxd that’s bundled with the windows version of gvim.

Ekopalypse

Yep, I have two run menu entries HxD and HxD load current document :-)

Marek Jindra

Thank you all for your input. I will also have a look at the HxD.

@Meta-Chuh
I think this changed after I upgraded from NPP 7.5.9 to 7.6.2.
I am quite sure it behaved differently in the older version.
Now I tried the portable version and you are right, it behaves the same as the current version.
So it might be plugin-related or config-related.
I think I have got an older version of NPP on my other laptop, so I will investigate that and search for differences.

guy038

Hello, @marek-jindra, @peterjones, @meta chuh, @alan-kilborn, @ekopalypse, and All,

I have the explanation of this behavior, but, unfortunately, I cannot confirm you that is the correct one :-/

I’m going to begin with some general notions. Then, I’ll try to give you an accurate answer. I know, encodings are really a nightmare for everyone of us :-((

If we write the string £1 = €1.17, in a new file then use the Convert to UCS-2 LE BOM N++ option and save it as pound.txt, the different bytes of this file and their signification are as below :

 BOM         £         1         SP        =        SP         €         1         .         1         7
-----      -----     -----     -----     -----     -----     -----     -----     -----     -----     -----
ff fe      a3 00     31 00     20 00     3d 00     20 00     ac 20     31 00     2e 00     31 00     37 00

Everything logical, here !

The UCS-2 encoding can only encode the Unicode characters of the BMP ( Basic Multilingual Plane ) of the range [\x{0000}-\x{D7FF}\x{E000}-\x{FFFF}] in a 16-bits code unit.
The LE terminology means that, for each character, the least significant byte ( containing the least significant byte ) is written first and the most significant byte comes last
The BOM syntax is an invisible Byte Order Mark, the Unicode character x{FEFF}, logically written FFFE according to the Little Endian rule witch identify the byte order, without ambiguity !

Refer to :

https://en.wikipedia.org/wiki/UTF-16

https://en.wikipedia.org/wiki/Endianness

Remarks :

It’s important to point out that the two N++ encodings UCS-2 LE and UCS-2 BE cannot represent Unicode characters, with code-points over \x{FFFF}, so over the BMP ( Basic Multilingual Plane )
In order to represent these characters ( for instance the emoticons characters , in range [\x{1f600}-\x{1F64F}] ), while keeping the two bytes architecture, the UTF-16 encoding ( BTW, the default Windows Unicode encoding ! ) codes them in two 16-bit units, called a surrogate pair
These two 16-bits are located in range [\x{D800}-\x{DBFF ( High surrogates ) and in range [\x{DC00}-\x{DFFF ( Low surrogates ). Refer, below, for additional information :

https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

This also means that, if your document contains characters, with Unicode code-point over x{FFFF}, it must be saved, exclusively, with the N++ UTF-8 or UTF-8 BOM encodings !

Now, Marek, let’s get back to your question :

From the definition of an encoding, this process should not change the file contents but simply re-interprets file contents, according the encoding map of the characters, in this encoding

So, in theory, it should be, strictly, as below ( I assume that the BOM is also ignored ) :

            £ NUL     1 NUL    SP NUL     = NUL    SP NUL     ¬ SP      1 NUL     . NUL     1 NUL     7 NUL

           a3 00     31 00     20 00     3d 00     20 00     ac 20     31 00     2e 00     31 00     37 00

Instead, after using the N++ Encode in ANSI option and saving the file, we get this strange layout :

            Â   £      1         SP        =        SP      â   ‚   ¬    1         .         1         7
           --  --     --         --       --        --     --  --  --   --        --        --        --
           c2  a3     31         20       3d        20     e2  82  ac   31        2e        31        37

At first sight, we cannot see any logic ! Actually, two phases occur :

Firstly, a transformation of the UCS-2 LE BOM representation of characters, with code-point > \x{007F}, into the analog UTF-8 representation of these characters
Secondly, the normal re-interpretation of these bytes in ANSI, which is, by the way, quite identical to the Windows-1252 encoding, in my country ( France )

So :

The £ character, of Unicode code-point \x00A3, and represented, in UTF-8, with the two-bytes sequence C2A3 is finally interpreted as the two ANSI characters Â and £
The € character, of Unicode code-point \x20AC, and represented, in UTF-8, with the three-bytes sequence E282AC is finally interpreted as the three ANSI characters â, ‚ and ¬

IMPORTANT : I don’t know if this behavior is a real bug or if some “hidden” rules could explain it :-(( In the meanwhile, we have to live with it !

Thus, then you performed you second operation Encode in UTF8, you see, again, the £1 = €1.17 text, with the internal representation :

             £         1         SP        =        SP         €         1         .         1         7
           -----      --         --        -        --     --------     --        --        --        -- 
           c2 a3      31         20       3d        20     e2 82 ac     31        2e        31        37

Now, let’s compare with some other N++ sequences of Encoding in / Convert to !

Let’s start, again, with your correct “Pound.txt” file, saved after the operation Convert to UCS-2 LE BOM" :

 BOM         £         1         SP        =        SP         €         1         .         1         7
-----      -----     -----     -----     -----     -----     -----     -----     -----     -----     -----
ff fe      a3 00     31 00     20 00     3d 00     20 00     ac 20     31 00     2e 00     31 00     37 00

If we use the Convert to UTF-8 BOM N++ option, first, we obtain, the same text, with the byte contents :

  BOM        £         1         SP        =        SP         €         1         .         1         7
--------   -----      --         --        -        --     --------     --        --        --        --
ef bb bf   c2 a3      31         20       3d        20     e2 82 ac     31        2e        31        37

BTW, note that the beginning byte sequence EF BB BF is simply the UTF-8 representation of the Unicode character of the BOM ( \x{FEFF} )

Then, after a Encode in ANSI operation, we get this layout, identical to what you obtained when changing, directly from Convert to UCS-2 LE BOM to Encode in ANSI

            Â   £      1         SP        =        SP      â   ‚   ¬    1         .         1         7
           --  --     --         --       --        --     --  --  --   --        --        --        --
           c2  a3     31         20       3d        20     e2  82  ac   31        2e        31        37

To end with, let’s, again, click on the Encode in UTF-8 BOM option. We read, logically, the correct text £1 = €1.17, with the bytes sequence :

  BOM        £         1         SP        =        SP         €         1         .         1         7
--------   -----      --         --        -        --     --------     --        --        --        --
ef bb bf   c2 a3      31         20       3d        20     e2 82 ac     31        2e        31        37

Now, if we click on the Convert to ANSI option, we get the same text £1 = €1.17, corresponding to :

             £         1         SP        =        SP         €         1         .         1         7
            --        --         --       --        --        --        --        --        --        --
            a3        31         20       3d        20        80        31        2e        31        37

IMPORTANT :

Unlike the encoding process, a conversion to a new encoding does modify file contents, trying to write all the characters displayed, in current encoding, according to the byte representation, of these characters, in the new desired encoding !

Hope that my answer gives you some hints !

Best Regards,

guy038

I’m quite used to this tiny but very useful on-line UTF-8 tool :

http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

Before typing anything in the zone, I advice you :

To read the notes, carefully, at end of the page
To select the right type of your entry which, generally, will be, either, Interpret as Character or Interpret as Hex code point ( For instance, character € or Unicode value 20AC )

Marek Jindra

@guy038
Thank you for the explanation. You described very thoroughly what happens.

I think, this behavior is very good for people, who want to see a readable text and not bother with encodings. It doesn’t corrupt the characters even if you tell it to do so.
But I think NPP is not displaying the truth to me, how the UCS-2 LE really looks like if interpreted as ANSI.