Hex-Editor plugin failed to handle files other than UTF8 encoding

Li Eric · Nov 3, 2020, 9:33 AM

Notepad++ v9.7 (64-bit)
Hex-Editor plugin v0.9.8

I tested the hex editor with 3 files containing same content but different encodings - UTF8, BIG5, & GB2312. However, no matter which encoding the file used, the hex editor always shows the content in UTF8 encoding.

File content : 你好嗎？

correct result

[UTF8] E4 BD A0 E5 A5 BD E5 97 8E EF BC 9F
[BIG5] A7 41 A6 6E B6 DC A1 48
[GB2312] C4 E3 BA C3 86 E1 A3 BF

hex-editor plugin result

[UTF8] E4 BD A0 E5 A5 BD E5 97 8E EF BC 9F
[BIG5] E4 BD A0 E5 A5 BD E5 97 8E EF BC 9F
[GB2312] E4 BD A0 E5 A5 BD E5 97 8E EF BC 9F

PeterJones · Nov 3, 2020, 8:43 PM

@Li-Eric ,

That does not surprise me.

Notepad++ is a text editor, and it will treat text as text characters, not as individual bytes. Notepad++ only guarantees what the encoding is when it’s on the disk (for read or for write), not what it is in memory, nor what other plugins might do with the bytes from memory.

Based on your results, it is apparent that the Hex Editor gets the contents of the file from the Scintilla editor object, not from the bytes on the disk. And I believe that the Scintilla editor object stores the text in memory as UTF8, so Hex Editor would see the same.

In theory, you could put in a feature request with the developer of HexEditor to allow it to use the real disk contents or the Scintilla-edited contents. However, the official repo (https://sourceforge.net/projects/npp-plugins/files/Hex Editor/ ) hasn’t been updated in years – the author appears to have abandoned the plugin. @chcg has provided a bugfix version (https://github.com/chcg/NPP_HexEdit/ ), but he makes it clear that it’s “unofficial”, and I don’t know whether or not he is actively taking feature requests.

If you want a true hex editor, which doesn’t hide the encoding, I suggest using a standalone one (possibly like hxd ).

guy038 · Nov 4, 2020, 11:20 AM

Hello, @peterjones, @li-eric and **All,

Very informative answer, indeed, Peter !

This explains the first of the major problems, found while testing the Summary feature and mentioned at the very beginning of this long post of mine :

https://community.notepad-plus-plus.org/post/59069

So, we must remember that the Summary feature just looks into the Notepad++ buffer, which is UTF-8 encoded. Thus, this fact explains the Document length value seen for UCS-2 BE/LE BOM encoded files. However, the value of Characters (without line endings) is totally wrong for these encodings and seems to be, instead, the number of bytes of its corresponding UTF-8[BOM] file !

For instance, the Unicode value of the four bytes, of @li-eric text, are 4F60 597D 55CE FF1F . As all these codes belong to the range [\x{0800}-\x{FFFF}, then each ideograph is coded with 3 bytes in an UTF-8 encoded file, as well as the ？ fullwidth question mark. But, when converted in the UCS-2 BE/LE BOM encoding, the View > Summary... option returns 12 ( so 4 chars x 3 bytes ) instead of the value 4, as correctly reported in the status bar after a Select All operation

Best Regards

guy038

Alan Kilborn · Nov 4, 2020, 1:24 AM

@PeterJones said in Hex-Editor plugin failed to handle files other than UTF8 encoding:

And I believe that the Scintilla editor object stores the text in memory as UTF8

Is this always true, though?
If it is, why does Scintilla have/need the SCI_SETCODEPAGE and SCI_GETCODEPAGE functions?
I probably have (many) misunderstandings of encoding issues…

PeterJones · Nov 4, 2020, 3:25 PM

@Alan-Kilborn said in Hex-Editor plugin failed to handle files other than UTF8 encoding:

@PeterJones said in Hex-Editor plugin failed to handle files other than UTF8 encoding:

And I believe that the Scintilla editor object stores the text in memory as UTF8

Is this always true, though?

Apparently not.

Using a file with your ➤ character (from the other post) in it, I started playing around. UTF8, UCS2-LE-BOM, UCS2-BE-BOM all show up differently (and with the number of bytes and endianness that I would expect), but if you choose one of the 8-bit character set “encodings”, it always uses the UTF-8 byte sequence for the ➤ character.

So, it appears the Hex Editor works as expected for any of the UTF8/UCS2 encodings, but not with the character set. Given that character sets are treated differently in the code (there aren’t many NPPM_ or SCI_ messages dealing with character sets, compared to more with the Unicode encodings), it appears that character set is mostly used during read-from-disk or write-to-disk, and not during the internal manipulation.

why does Scintilla have/need the SCI_SETCODEPAGE and SCI_GETCODEPAGE functions?

Not sure, really. No matter what character set or encoding I choose for a given editor tab, editor.getCodePage() always returns 65001. As far as I can tell, the SCI_GETCODEPAGE only ever returns 65001. Even when I change the settings so that New documents are in something like OEM 855, which should be codepage 855, it returns 65001. Even when I go to cmd.exe, chcp 855, then launch a new instance of Notepad++ from that cmd.exe environment, it returns 65001. Even if I editor.setCodePage(855), a subsequent editor.getCodePage() returns 65001.

Oh, from Scintilla SCI_GETCODEPAGE docs, “Code page can be set to 65001 (UTF-8), 932 (Japanese Shift-JIS), 936 (Simplified Chinese GBK), 949 (Korean Unified Hangul Code), 950 (Traditional Chinese Big5), or 1361 (Korean Johab).” So it’s there for choosing either UTF-8 or one of the Asian codepages. The docs also mentioned that 0 is valid (for disabling multibyte support). When I set 0, then it reads back 0; similarly for the 6 explicit values quoted.
But if I choose a charset like Shift+JIS, it does not change the SCI_GETCODEPAGE readback, so the codepage and charset are separate entities.

But back to the higher level:

Basically, it appears that if it’s a real Unicode encoding (UTF8, UCS2), HexEditor will use that byte representation. But if it’s a “character set”, then it shows UTF8 bytes for non-ASCII characters, no matter how it happens to be encoded on disk. I am still guessing this is based on how Scintilla handles the bytes internally.