2-byte characters recently broken? Or do I misremember?

Jay Libove

I was fairly sure that I recalled that Notepad++ supports 2-byte characters (i.e. an “a” with an umlaut over it, “ä”). However, recently, I notice that whenever I type such a character, save the text file in Notepad++, and then re-open the file, the ä gets replaced by a questionmark ?

PeterJones

@jay-libove ,

Notepad++, even the newest v8.9.1.2 handles non-ASCII characters just fine.

You will want to check your encoding – make sure that Notepad++ thinks the encoding is what the file is actually encoded as. For example, if Notepad++ thinks it’s UTF8, but your file is actually one of the ANSI encodings (like the Windows 1252 character set), then the file will have a single byte 0xE4 for ä, but Notepad++ sees that as an incomplete UTF8 sequence, and doesn’t know what to do with it – 0xE4 is actually a byte that says to a UTF8 interpreter “this is the first byte of a 3-byte sequence”, but then there are no more bytes that meet proper UTF8 encoding that follow, so it shows a ? to indicate it’s reaction of “huh, what?”.

So if you have a file that is showing ? instead of ä, look down in the status bar to see if Notepad++ thinks the file is UTF8 – it will say near the lower-right corner. If it does, try going to Encoding > ANSI and see if that now displays the file as you expect.

Jay Libove

@peterjones Apologies, I hadn’t seen that you’d replied.
Weirdness. The encoding is showing as “TIS-620”. (Thai …)
If I click on Encoding->ANSI or Encoding->UTF-8 the TIS-620 in the status bar does not change.
At the bottom left it says “Normal text file”.
Further thoughts appreciated, thanks. (n.b. this is now Notepad++ v8.1.9.3)
-Jay

Alan Kilborn

@jay-libove said in 2-byte characters recently broken? Or do I misremember?:

The encoding is showing as “TIS-620”. (Thai …)

It is probably your intent that the file is UTF-8?
And you have autodetection of encoding turned on in the Preferences?
Hmmm, there’s a known bug where UTF-8 files are detected as TIS-620 … maybe this is happening to you?

Here are some references to this bug:

Autodetection is not an exact science (well, it hasn’t been proven to be, anyway). I came up with a method to mitigate this bug somewhat, you may want to have a look HERE.

Another way to “solve” this problem is to turn autodetect of encoding off. Then, with N++ settings as default, your file probably will show UTF-8 on the status bar after loading.

@jay-libove said in 2-byte characters recently broken? Or do I misremember?:

If I click on Encoding->ANSI or Encoding->UTF-8 the TIS-620 in the status bar does not change.

This is because Notepad++ thinks your file is encoded as TIS-620 and you are telling it to reinterpret it (without changing it) as UTF-8. Probably the reinterpret fails because of the corruption the bug has caused?

Jay Libove

Thanks very much @Alan-Kilborn
I’ll jump in to the other thread (levicki).
-Jay