UCS-2 encoding problem
- 
 Thank you for the reply. Yes, saving as UCS-2 LE did save the correct bytes to disk. 
 However, I also want to use NPP to verify if the bytes in the file are correct. Now I have to use another software, or an old version of NPP, because I am not able to view the unicode file as bytes (8bit encoding).
 Even the Hex-editor plugin in NPP does not work anymore and does not show the real hex values in this situation.Sometimes I used to view or edit binary files in NPP. That is not always reliable now. 
 Imagine the situation, when a binary file is composed of ANSI parts and Unicode parts. Then there is no perfect encoding for the whole file and more encodings might accidentally seem to be valid. I wish to switch between them.
 I need an editor to both edit binary files and convert encodings/reinterpret encodings.An invalid UTF-8 sequence could result in some question marks or strange characters. I could use this unlikely scenario to view which parts of my corrupted file are not valid UTF-8 sequences. I expected it to work because it worked in the previous versions. I believe there are situations, when a file might be interpreted as many encodings at once and even produce humanly readable content. Then it is just a thing of preference, which encoding you show as a default. Or you might have a partially broken file, which is only readable if you select UTF-8, even though it contains several corrupted bytes. If NPP developers changed this function intentionally, I wish to have a setting to turn it off. 
- 
 @Marek-Jindra said: If NPP developers changed this function intentionally, I wish to have a setting to turn it off. If you wish to make a feature request or bug report, this FAQ explains how. You will probably want to reference this thread (https://notepad-plus-plus.org/community/topic/17196/ucs-2-encoding-problem) from your feature request, and it’s considered polite to paste a link to the feature request back in this discussion. 
- 
 @Marek-Jindra said: I expected it to work because it worked in the previous versions. Sometimes features change between versions. That’s why many people recommend not succumbing to upgraditis – if it’s not broke, don’t fix it. Others recommend doing every update, because of potential security problems – that’s great advice for front-facing applications like phone apps or web browsers, which do a lot of networking; but for local-focused applications like Notepad++, that’s not as critical. Since an older version works for you, you might consider re-installing the older version, and turning off auto-updates. In that case, you can either wait until your feature request is implemented and confirmed before upgrading, or just not bother upgrading. In the end, it’s up to you. Good luck. 
- 
 BTW I have not found the hex editor plugin to be very good; in this case maybe best to use a separate hex editor. Notepad++, while we want it to do and be good at all things, isn’t the type of program with the necessary kinds of resources behind its development to support being all-powerful. 
- 
 Now I have to use another software, or an old version of NPP, because I am not able to view the unicode file as bytes (8bit encoding). i get the same results on all tested notepad++ versions, from very old to newest. 
 (5.9.3 ansi, 5.9.3 unicode, 7.5.5, 7.6.3)
 are you sure that it behaved differently on an old version of npp ?
 if yes, which version was it ?if you have time, you can download all older portable versions from here: 
 https://notepad-plus-plus.org/download/all-versions.html
 (choose the zip packages. they will not interfere with your installed version)
 and find the version which did what you need.
 reason: as soon as you file an issue report, it might be of help, if a notepad++ reference source code, that behaves like you would expect, has ever existed.here are my test results: original content of "Pound.txt", saved as ucs-2 le bom, displayed as ucs-2 le bom: £1 = €1.17 ----- ansi/utf-8 view in notepad++ 7.5.5: encoding > encode in ansi: £1 = €1.17 encoding > encode in utf-8: £1 = €1.17 ----- ansi/utf-8 view in notepad++ 7.6.3: encoding > encode in ansi: £1 = €1.17 encoding > encode in utf-8: £1 = €1.17 ----- ansi/utf-8 view in notepad++ 5.9.3 unicode: encoding > encode in ansi: £1 = €1.17 encoding > encode in utf-8: £1 = €1.17 ----- ansi/utf-8 view in notepad++ 5.9.3 ansi: encoding > encode in ansi: £1 = €1.17 encoding > encode in utf-8: £1 = €1.17
- 
 i second @Alan-Kilborn with the separate hex editor (where are we now ? somewhere between 4096 and 65536 i guess ;-) ) @Marek-Jindra @Alan-Kilborn @PeterJones and all: 
 i currently use hxd 2.2.1 (https://mh-nexus.de/en/hxd/)
 which ones do you use ? maybe yours are even better for parsing character encodings, as hxd is good as a hex editor, but rather limited when it comes to file encodings.
- 
 Not sure hxd needs to be good at file encodings. I use it as well when I have the need to get to that level. 
- 
 Apparently I haven’t needed a hex editor since my last computer upgrade at work, but when I do, HxD is what I use. When all I need to do is do a quick hex dump, which I use much more often than a full-blown hex editor, I use the xxd that’s bundled with the windows version of gvim. 
- 
 Yep, I have two run menu entries HxD and HxD load current document :-) 
- 
 Thank you all for your input. I will also have a look at the HxD. @Meta-Chuh 
 I think this changed after I upgraded from NPP 7.5.9 to 7.6.2.
 I am quite sure it behaved differently in the older version.
 Now I tried the portable version and you are right, it behaves the same as the current version.
 So it might be plugin-related or config-related.
 I think I have got an older version of NPP on my other laptop, so I will investigate that and search for differences.
- 
 Hello, @marek-jindra, @peterjones, @meta chuh, @alan-kilborn, @ekopalypse, and All, I have the explanation of this behavior, but, unfortunately, I cannot confirm you that is the correct one :-/ I’m going to begin with some general notions. Then, I’ll try to give you an accurate answer. I know, encodings are really a nightmare for everyone of us :-(( 
 If we write the string £1 = €1.17, in a new file then use theConvert to UCS-2 LE BOMN++ option and save it as pound.txt, the different bytes of this file and their signification are as below :BOM £ 1 SP = SP € 1 . 1 7 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ff fe a3 00 31 00 20 00 3d 00 20 00 ac 20 31 00 2e 00 31 00 37 00Everything logical, here ! - 
The UCS-2encoding can only encode the Unicode characters of theBMP( Basic Multilingual Plane ) of the range[\x{0000}-\x{D7FF}\x{E000}-\x{FFFF}]in a16-bitscode unit.
- 
The LEterminology means that, for each character, the least significant byte ( containing the least significant byte ) is written first and the most significant byte comes last
- 
The BOMsyntax is an invisible Byte Order Mark, the Unicode characterx{FEFF}, logically writtenFFFEaccording to the Little Endian rule witch identify the byte order, without ambiguity !
 Refer to : https://en.wikipedia.org/wiki/UTF-16 https://en.wikipedia.org/wiki/Endianness Remarks : - 
It’s important to point out that the two N++ encodings UCS-2 LEandUCS-2 BEcannot represent Unicode characters, with code-points over\x{FFFF}, so over theBMP( Basic Multilingual Plane )
- 
In order to represent these characters ( for instance the emoticons characters , in range [\x{1f600}-\x{1F64F}]), while keeping the two bytes architecture, theUTF-16encoding ( BTW, the default Windows Unicode encoding ! ) codes them in two16-bitunits, called a surrogate pair
- 
These two 16-bitsare located in range[\x{D800}-\x{DBFF( High surrogates ) and in range[\x{DC00}-\x{DFFF( Low surrogates ). Refer, below, for additional information :
 https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF - This also means that, if your document contains characters, with Unicode code-point over x{FFFF}, it must be saved, exclusively, with the N++UTF-8orUTF-8 BOMencodings !
 
 Now, Marek, let’s get back to your question : From the definition of an encoding, this process should not change the file contents but simply re-interprets file contents, according the encoding map of the characters, in this encoding So, in theory, it should be, strictly, as below ( I assume that the BOMis also ignored ) :£ NUL 1 NUL SP NUL = NUL SP NUL ¬ SP 1 NUL . NUL 1 NUL 7 NUL a3 00 31 00 20 00 3d 00 20 00 ac 20 31 00 2e 00 31 00 37 00Instead, after using the N++ Encode in ANSIoption and saving the file, we get this strange layout :Â £ 1 SP = SP â ‚ ¬ 1 . 1 7 -- -- -- -- -- -- -- -- -- -- -- -- -- c2 a3 31 20 3d 20 e2 82 ac 31 2e 31 37At first sight, we cannot see any logic ! Actually, two phases occur : - 
Firstly, a transformation of the UCS-2 LE BOMrepresentation of characters, with code-point >\x{007F}, into the analogUTF-8representation of these characters
- 
Secondly, the normal re-interpretation of these bytes in ANSI, which is, by the way, quite identical to theWindows-1252encoding, in my country ( France )
 So : - 
The £character, of Unicode code-point\x00A3, and represented, inUTF-8, with the two-bytes sequenceC2A3is finally interpreted as the two ANSI charactersÂand£
- 
The €character, of Unicode code-point\x20AC, and represented, inUTF-8, with the three-bytes sequenceE282ACis finally interpreted as the three ANSI charactersâ,‚and¬
 IMPORTANT : I don’t know if this behavior is a real bug or if some “hidden” rules could explain it :-(( In the meanwhile, we have to live with it ! Thus, then you performed you second operation Encode in UTF8, you see, again, the £1 = €1.17 text, with the internal representation :£ 1 SP = SP € 1 . 1 7 ----- -- -- - -- -------- -- -- -- -- c2 a3 31 20 3d 20 e2 82 ac 31 2e 31 37
 Now, let’s compare with some other N++ sequences of Encoding in / Convert to!Let’s start, again, with your correct “Pound.txt” file, saved after the operation Convert to UCS-2 LE BOM":BOM £ 1 SP = SP € 1 . 1 7 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ff fe a3 00 31 00 20 00 3d 00 20 00 ac 20 31 00 2e 00 31 00 37 00If we use the Convert to UTF-8 BOMN++ option, first, we obtain, the same text, with the byte contents :BOM £ 1 SP = SP € 1 . 1 7 -------- ----- -- -- - -- -------- -- -- -- -- ef bb bf c2 a3 31 20 3d 20 e2 82 ac 31 2e 31 37BTW, note that the beginning byte sequence EF BB BFis simply theUTF-8representation of the Unicode character of the BOM (\x{FEFF})Then, after a Encode in ANSIoperation, we get this layout, identical to what you obtained when changing, directly fromConvert to UCS-2 LE BOMtoEncode in ANSIÂ £ 1 SP = SP â ‚ ¬ 1 . 1 7 -- -- -- -- -- -- -- -- -- -- -- -- -- c2 a3 31 20 3d 20 e2 82 ac 31 2e 31 37
 To end with, let’s, again, click on the Encode in UTF-8 BOMoption. We read, logically, the correct text £1 = €1.17, with the bytes sequence :BOM £ 1 SP = SP € 1 . 1 7 -------- ----- -- -- - -- -------- -- -- -- -- ef bb bf c2 a3 31 20 3d 20 e2 82 ac 31 2e 31 37Now, if we click on the Convert to ANSIoption, we get the same text £1 = €1.17, corresponding to :£ 1 SP = SP € 1 . 1 7 -- -- -- -- -- -- -- -- -- -- a3 31 20 3d 20 80 31 2e 31 37IMPORTANT : Unlike the encoding process, a conversion to a new encoding does modify file contents, trying to write all the characters displayed, in current encoding, according to the byte representation, of these characters, in the new desired encoding ! Hope that my answer gives you some hints ! Best Regards, guy038 I’m quite used to this tiny but very useful on-line UTF-8 tool : http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi? Before typing anything in the zone, I advice you : - 
To read the notes, carefully, at end of the page 
- 
To select the right type of your entry which, generally, will be, either, Interpret as Character or Interpret as Hex code point ( For instance, character €or Unicode value20AC)
 
- 
- 
 @guy038 
 Thank you for the explanation. You described very thoroughly what happens.I think, this behavior is very good for people, who want to see a readable text and not bother with encodings. It doesn’t corrupt the characters even if you tell it to do so. 
 But I think NPP is not displaying the truth to me, how the UCS-2 LE really looks like if interpreted as ANSI.




