@Alan-Kilborn said in Hex-Editor plugin failed to handle files other than UTF8 encoding:
@PeterJones said in Hex-Editor plugin failed to handle files other than UTF8 encoding:
And I believe that the Scintilla editor object stores the text in memory as UTF8
Is this always true, though?
Apparently not.
Using a file with your ➤ character (from the other post) in it, I started playing around. UTF8, UCS2-LE-BOM, UCS2-BE-BOM all show up differently (and with the number of bytes and endianness that I would expect), but if you choose one of the 8-bit character set “encodings”, it always uses the UTF-8 byte sequence for the ➤ character.
So, it appears the Hex Editor works as expected for any of the UTF8/UCS2 encodings, but not with the character set. Given that character sets are treated differently in the code (there aren’t many NPPM_ or SCI_ messages dealing with character sets, compared to more with the Unicode encodings), it appears that character set is mostly used during read-from-disk or write-to-disk, and not during the internal manipulation.
why does Scintilla have/need the SCI_SETCODEPAGE and SCI_GETCODEPAGE functions?
Not sure, really. No matter what character set or encoding I choose for a given editor tab, editor.getCodePage() always returns 65001. As far as I can tell, the SCI_GETCODEPAGE only ever returns 65001. Even when I change the settings so that New documents are in something like OEM 855, which should be codepage 855, it returns 65001. Even when I go to cmd.exe, chcp 855, then launch a new instance of Notepad++ from that cmd.exe environment, it returns 65001. Even if I editor.setCodePage(855), a subsequent editor.getCodePage() returns 65001.
Oh, from Scintilla SCI_GETCODEPAGE docs, “Code page can be set to 65001 (UTF-8), 932 (Japanese Shift-JIS), 936 (Simplified Chinese GBK), 949 (Korean Unified Hangul Code), 950 (Traditional Chinese Big5), or 1361 (Korean Johab).” So it’s there for choosing either UTF-8 or one of the Asian codepages. The docs also mentioned that 0 is valid (for disabling multibyte support). When I set 0, then it reads back 0; similarly for the 6 explicit values quoted.
But if I choose a charset like Shift+JIS, it does not change the SCI_GETCODEPAGE readback, so the codepage and charset are separate entities.
But back to the higher level:
Basically, it appears that if it’s a real Unicode encoding (UTF8, UCS2), HexEditor will use that byte representation. But if it’s a “character set”, then it shows UTF8 bytes for non-ASCII characters, no matter how it happens to be encoded on disk. I am still guessing this is based on how Scintilla handles the bytes internally.