UCS-2 encoding problem
-
Hi,
I have encountered a weird encoding behavior in Notepad++ 7.6.3.• Create a new file
• Type £1 = €1.17
• Convert it to UCS-2 little endian
• Save as Pound.txt• Change encoding to ANSI
• Actual result: £1 = €1.17 (looks like UTF-8 bytes)
• Expected result: ÿþ£ 1 = ¬ 1 . 1 7 (should look like UTF-16 bytes)• Change encoding to UTF-8
• Actual result: £1 = €1.17 (does not change at all)
• Expected result: ��1 = ̠1 . 1 7 (should have broken non-ascii characters)• If I view the file in my external hex-editor, it shows the UTF-16 bytes properly
Is it a bug?
-
I can confirm that behavior. I think you’re the first poster who has complained that changing the encoding doesn’t mess up the display in the way that the poster expects. Most are starting with messed-up behavior, and trying to fix some unexpected interpretation.
If saving it as UCS-2 LE makes the file have the correct bytes on the disk, why do you care what it looks like when you force the wrong encoding to be interpreted? What are you really trying to accomplish here?
And, really, given that the £ (U+00A3) is stored as the bytes
A3 00
in UCS-2 LE, and that those two bytes are not a valid UTF-8 sequence, why do you have any expectations as to the UTF-8 interpretation? -
Thank you for the reply.
Yes, saving as UCS-2 LE did save the correct bytes to disk.
However, I also want to use NPP to verify if the bytes in the file are correct. Now I have to use another software, or an old version of NPP, because I am not able to view the unicode file as bytes (8bit encoding).
Even the Hex-editor plugin in NPP does not work anymore and does not show the real hex values in this situation.Sometimes I used to view or edit binary files in NPP. That is not always reliable now.
Imagine the situation, when a binary file is composed of ANSI parts and Unicode parts. Then there is no perfect encoding for the whole file and more encodings might accidentally seem to be valid. I wish to switch between them.
I need an editor to both edit binary files and convert encodings/reinterpret encodings.An invalid UTF-8 sequence could result in some question marks or strange characters. I could use this unlikely scenario to view which parts of my corrupted file are not valid UTF-8 sequences. I expected it to work because it worked in the previous versions.
I believe there are situations, when a file might be interpreted as many encodings at once and even produce humanly readable content. Then it is just a thing of preference, which encoding you show as a default. Or you might have a partially broken file, which is only readable if you select UTF-8, even though it contains several corrupted bytes.
If NPP developers changed this function intentionally, I wish to have a setting to turn it off.
-
@Marek-Jindra said:
If NPP developers changed this function intentionally, I wish to have a setting to turn it off.
If you wish to make a feature request or bug report, this FAQ explains how. You will probably want to reference this thread (https://notepad-plus-plus.org/community/topic/17196/ucs-2-encoding-problem) from your feature request, and it’s considered polite to paste a link to the feature request back in this discussion.
-
@Marek-Jindra said:
I expected it to work because it worked in the previous versions.
Sometimes features change between versions. That’s why many people recommend not succumbing to upgraditis – if it’s not broke, don’t fix it. Others recommend doing every update, because of potential security problems – that’s great advice for front-facing applications like phone apps or web browsers, which do a lot of networking; but for local-focused applications like Notepad++, that’s not as critical.
Since an older version works for you, you might consider re-installing the older version, and turning off auto-updates. In that case, you can either wait until your feature request is implemented and confirmed before upgrading, or just not bother upgrading.
In the end, it’s up to you. Good luck.
-
BTW I have not found the hex editor plugin to be very good; in this case maybe best to use a separate hex editor. Notepad++, while we want it to do and be good at all things, isn’t the type of program with the necessary kinds of resources behind its development to support being all-powerful.
-
Now I have to use another software, or an old version of NPP, because I am not able to view the unicode file as bytes (8bit encoding).
i get the same results on all tested notepad++ versions, from very old to newest.
(5.9.3 ansi, 5.9.3 unicode, 7.5.5, 7.6.3)
are you sure that it behaved differently on an old version of npp ?
if yes, which version was it ?if you have time, you can download all older portable versions from here:
https://notepad-plus-plus.org/download/all-versions.html
(choose the zip packages. they will not interfere with your installed version)
and find the version which did what you need.
reason: as soon as you file an issue report, it might be of help, if a notepad++ reference source code, that behaves like you would expect, has ever existed.here are my test results:
original content of "Pound.txt", saved as ucs-2 le bom, displayed as ucs-2 le bom: £1 = €1.17 ----- ansi/utf-8 view in notepad++ 7.5.5: encoding > encode in ansi: £1 = €1.17 encoding > encode in utf-8: £1 = €1.17 ----- ansi/utf-8 view in notepad++ 7.6.3: encoding > encode in ansi: £1 = €1.17 encoding > encode in utf-8: £1 = €1.17 ----- ansi/utf-8 view in notepad++ 5.9.3 unicode: encoding > encode in ansi: £1 = €1.17 encoding > encode in utf-8: £1 = €1.17 ----- ansi/utf-8 view in notepad++ 5.9.3 ansi: encoding > encode in ansi: £1 = €1.17 encoding > encode in utf-8: £1 = €1.17
-
i second @Alan-Kilborn with the separate hex editor (where are we now ? somewhere between 4096 and 65536 i guess ;-) )
@Marek-Jindra @Alan-Kilborn @PeterJones and all:
i currently use hxd 2.2.1 (https://mh-nexus.de/en/hxd/)
which ones do you use ? maybe yours are even better for parsing character encodings, as hxd is good as a hex editor, but rather limited when it comes to file encodings. -
Not sure hxd needs to be good at file encodings. I use it as well when I have the need to get to that level.
-
Apparently I haven’t needed a hex editor since my last computer upgrade at work, but when I do, HxD is what I use.
When all I need to do is do a quick hex dump, which I use much more often than a full-blown hex editor, I use the xxd that’s bundled with the windows version of gvim.
-
Yep, I have two run menu entries HxD and HxD load current document :-)
-
Thank you all for your input. I will also have a look at the HxD.
@Meta-Chuh
I think this changed after I upgraded from NPP 7.5.9 to 7.6.2.
I am quite sure it behaved differently in the older version.
Now I tried the portable version and you are right, it behaves the same as the current version.
So it might be plugin-related or config-related.
I think I have got an older version of NPP on my other laptop, so I will investigate that and search for differences. -
Hello, @marek-jindra, @peterjones, @meta chuh, @alan-kilborn, @ekopalypse, and All,
I have the explanation of this behavior, but, unfortunately, I cannot confirm you that is the correct one :-/
I’m going to begin with some general notions. Then, I’ll try to give you an accurate answer. I know, encodings are really a nightmare for everyone of us :-((
If we write the string
£1 = €1.17
, in a new file then use theConvert to UCS-2 LE BOM
N++ option and save it as pound.txt, the different bytes of this file and their signification are as below :BOM £ 1 SP = SP € 1 . 1 7 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ff fe a3 00 31 00 20 00 3d 00 20 00 ac 20 31 00 2e 00 31 00 37 00
Everything logical, here !
-
The
UCS-2
encoding can only encode the Unicode characters of theBMP
( Basic Multilingual Plane ) of the range[\x{0000}-\x{D7FF}\x{E000}-\x{FFFF}]
in a16-bits
code unit. -
The
LE
terminology means that, for each character, the least significant byte ( containing the least significant byte ) is written first and the most significant byte comes last -
The
BOM
syntax is an invisible Byte Order Mark, the Unicode characterx{FEFF}
, logically writtenFFFE
according to the Little Endian rule witch identify the byte order, without ambiguity !
Refer to :
https://en.wikipedia.org/wiki/UTF-16
https://en.wikipedia.org/wiki/Endianness
Remarks :
-
It’s important to point out that the two N++ encodings
UCS-2 LE
andUCS-2 BE
cannot represent Unicode characters, with code-points over\x{FFFF}
, so over theBMP
( Basic Multilingual Plane ) -
In order to represent these characters ( for instance the emoticons characters , in range
[\x{1f600}-\x{1F64F}]
), while keeping the two bytes architecture, theUTF-16
encoding ( BTW, the default Windows Unicode encoding ! ) codes them in two16-bit
units, called a surrogate pair -
These two
16-bits
are located in range[\x{D800}-\x{DBFF
( High surrogates ) and in range[\x{DC00}-\x{DFFF
( Low surrogates ). Refer, below, for additional information :
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF
- This also means that, if your document contains characters, with Unicode code-point over
x{FFFF}
, it must be saved, exclusively, with the N++UTF-8
orUTF-8 BOM
encodings !
Now, Marek, let’s get back to your question :
From the definition of an encoding, this process should not change the file contents but simply re-interprets file contents, according the encoding map of the characters, in this encoding
So, in theory, it should be, strictly, as below ( I assume that the
BOM
is also ignored ) :£ NUL 1 NUL SP NUL = NUL SP NUL ¬ SP 1 NUL . NUL 1 NUL 7 NUL a3 00 31 00 20 00 3d 00 20 00 ac 20 31 00 2e 00 31 00 37 00
Instead, after using the N++
Encode in ANSI
option and saving the file, we get this strange layout :Â £ 1 SP = SP â ‚ ¬ 1 . 1 7 -- -- -- -- -- -- -- -- -- -- -- -- -- c2 a3 31 20 3d 20 e2 82 ac 31 2e 31 37
At first sight, we cannot see any logic ! Actually, two phases occur :
-
Firstly, a transformation of the
UCS-2 LE BOM
representation of characters, with code-point >\x{007F}
, into the analogUTF-8
representation of these characters -
Secondly, the normal re-interpretation of these bytes in
ANSI
, which is, by the way, quite identical to theWindows-1252
encoding, in my country ( France )
So :
-
The
£
character, of Unicode code-point\x00A3
, and represented, inUTF-8
, with the two-bytes sequenceC2A3
is finally interpreted as the two ANSI charactersÂ
and£
-
The
€
character, of Unicode code-point\x20AC
, and represented, inUTF-8
, with the three-bytes sequenceE282AC
is finally interpreted as the three ANSI charactersâ
,‚
and¬
IMPORTANT : I don’t know if this behavior is a real bug or if some “hidden” rules could explain it :-(( In the meanwhile, we have to live with it !
Thus, then you performed you second operation
Encode in UTF8
, you see, again, the £1 = €1.17 text, with the internal representation :£ 1 SP = SP € 1 . 1 7 ----- -- -- - -- -------- -- -- -- -- c2 a3 31 20 3d 20 e2 82 ac 31 2e 31 37
Now, let’s compare with some other N++ sequences of
Encoding in / Convert to
!Let’s start, again, with your correct “Pound.txt” file, saved after the operation
Convert to UCS-2 LE BOM"
:BOM £ 1 SP = SP € 1 . 1 7 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ff fe a3 00 31 00 20 00 3d 00 20 00 ac 20 31 00 2e 00 31 00 37 00
If we use the
Convert to UTF-8 BOM
N++ option, first, we obtain, the same text, with the byte contents :BOM £ 1 SP = SP € 1 . 1 7 -------- ----- -- -- - -- -------- -- -- -- -- ef bb bf c2 a3 31 20 3d 20 e2 82 ac 31 2e 31 37
BTW, note that the beginning byte sequence
EF BB BF
is simply theUTF-8
representation of the Unicode character of the BOM (\x{FEFF}
)Then, after a
Encode in ANSI
operation, we get this layout, identical to what you obtained when changing, directly fromConvert to UCS-2 LE BOM
toEncode in ANSI
 £ 1 SP = SP â ‚ ¬ 1 . 1 7 -- -- -- -- -- -- -- -- -- -- -- -- -- c2 a3 31 20 3d 20 e2 82 ac 31 2e 31 37
To end with, let’s, again, click on the
Encode in UTF-8 BOM
option. We read, logically, the correct text £1 = €1.17, with the bytes sequence :BOM £ 1 SP = SP € 1 . 1 7 -------- ----- -- -- - -- -------- -- -- -- -- ef bb bf c2 a3 31 20 3d 20 e2 82 ac 31 2e 31 37
Now, if we click on the
Convert to ANSI
option, we get the same text £1 = €1.17, corresponding to :£ 1 SP = SP € 1 . 1 7 -- -- -- -- -- -- -- -- -- -- a3 31 20 3d 20 80 31 2e 31 37
IMPORTANT :
Unlike the encoding process, a conversion to a new encoding does modify file contents, trying to write all the characters displayed, in current encoding, according to the byte representation, of these characters, in the new desired encoding !
Hope that my answer gives you some hints !
Best Regards,
guy038
I’m quite used to this tiny but very useful on-line UTF-8 tool :
http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?
Before typing anything in the zone, I advice you :
-
To read the notes, carefully, at end of the page
-
To select the right type of your entry which, generally, will be, either, Interpret as Character or Interpret as Hex code point ( For instance, character
€
or Unicode value20AC
)
-
-
@guy038
Thank you for the explanation. You described very thoroughly what happens.I think, this behavior is very good for people, who want to see a readable text and not bother with encodings. It doesn’t corrupt the characters even if you tell it to do so.
But I think NPP is not displaying the truth to me, how the UCS-2 LE really looks like if interpreted as ANSI.