Cannot change Encoding to correct encoding of UTF-8

Tom Sasson

@Coises said in Cannot change Encoding to correct encoding of UTF-8:

Open Settings | Preferences… | MISC. and look for Autodetect character encoding.

Thank you for your suggestion, this liitle “trick” with unchecking Autodetect fixed the issue for me

debiedowner

I have the same problem. My UTF-8 file is identified as Chinese Big5 (Traditional) encoding. When I choose the correct UTF-8 encoding (not convert) nothing happens, the encoding that is used remains as Big5. I was able to open the file with correct encoding by disabling “Autodetect character encoding” like @Coises suggested, which allowed me open the file with correct encoding. But I don’t understand why that is needed, I want character encoding autodetection to be on, but shouldn’t I be able to manually override the autodetected character encoding?

As @PeterJones suggested, I pared my text file to the minimum, and uploaded it at https://pastebin.com/rHpqwWLp . It’s just a file with the line 🥲ııııı. When I open it Notepad++ v8.6.4 32 bit on Windows 10 22H2, I see Chinese characters and I am unable to change the encoding to UTF-8. I suspect it might not be possible to replicate this on Windows 11, because the 🥲 emoji does not exist in Windows 10 (I am unable to see it right now as I type this reply), which probably contributes to the issue. But I tried this with different emojis that also don’t exist in Windows 10, but I don’t have the same issue then. Just this emoji, followed by five ı characters create this problem (four characters don’t reproduce). Different text sequences, e.g. five ş characters do not reproduce it either, but something like ıışıı does. Anyway, I don’t have an issue with the autodetection being wrong, I just have an issue with being unable to override it without completely disabling autodetection.

PeterJones

@debiedowner said in Cannot change Encoding to correct encoding of UTF-8:

I just have an issue with being unable to override it without completely disabling autodetection.

Under most circumstances, you can (for example, the ones listed above). You have found a weird edge case.

I tried the following experiment:

Settings > Preferences > MISC > autodetect off
- created a new UTF-8 file
- paste in 🥲ııııı (which can be copied from your post, and now this post)
- save it to a filename
- Encoding > Character Sets > Chinese > Big5 => it now looks like
- Encoding > UTF-8 (not Convert To…), and it properly goes back to
Settings > Preferences > MISC > autodetect on
- same file is still loaded, and still showing as the correct UTF-8
- Encoding > Character Sets > Chinese > Big5 => it’s back to the weird version it was before, as expected when you interpret those bytes as Big5
- Encoding > UTF-8 =>
  - expect, like before, that it would go back to being a smiley first,
  - but instead, it stays as Big5, which it should not do.

This seems like a bug to me. Autodetect character encoding option should only be in force when opening a file; using the Encoding menu to change how the bytes are interpreted should override anything that’s autodetected, because that’s the whole point of being able to change it with a file loaded.

Do you have the ability / account to be able to put in an official bug report in the issue tracker, because this is something the developer should be made aware of. If so, doing a combination of your steps-to-reproduce plus mine should give him a good picture of what’s going wrong. (Yours shows the real-world use case. Mine will help debug the process without having to open/close files frequently.)

(Maybe someone else [even the developer] will disagree with me, claiming that the Encoding menu should follow that option, too. I would be disappointed if that were the decision, but it’s a possibility.)

I want character encoding autodetection to be on

Out of curiosity, what character encodings do you normally use? If you have auto-detect turned off, does Notepad++ read your normal files incorrectly?

Coises

@PeterJones said in Cannot change Encoding to correct encoding of UTF-8:

You have found a weird edge case.

I was pursuing the same sort of tests when you posted. The edge cases seem to be:

When a file is opened as a character set, it can’t be reinterpreted as ANSI or any of the UTF encodings. It can be reinterpreted as a different character set.
When a file is opened as ANSI or any of the UTF encodings, it can be reinterpreted as a character set. If you then attempt to reinterpret it again as ANSI or and of the UTF encodings, it will be reinterpreted as the encoding with which it originally opened. (After that you can change it to a different ANSI or UTF encoding.)

It seems like those two behaviors might be related.

I’ve tried before to understand the code that reads files and works out character encoding. I was not successful. So I don’t know whether to think this is an accepted limitation or a mistake. From a user’s point of view it doesn’t make much sense.

rdipardo

@Coises, @PeterJones

Looks to me like an example of this known issue.

The common trigger is a sequence of emoji followed by ASCII alphanumerics.

I’m guessing the character boundaries get confused between the last emoji and the first ASCII character, with the “tail” of the emoji being mistaken for a DBCS lead byte.

Coises

@rdipardo said in Cannot change Encoding to correct encoding of UTF-8:

@Coises, @PeterJones

Looks to me like an example of this known issue.

That could be. Character set inference will always have a non-zero failure rate on Windows text files.

The bigger question is, why can’t the interpretation of the file be manually reset? If Notepad++ guesses ANSI, or UTF-8, and it’s wrong, you can change between them, or to a specific character set. But if Notepad++ guesses a character set, you can reinterpret the file as a different character set, but you can’t reinterpret it as Unicode (or ANSI). You have to turn off the Autodetect character encoding setting, so it won’t guess a character set, then open the file again.

PeterJones

The common trigger is a sequence of emoji followed by ASCII alphanumerics.
I’m guessing the character boundaries get confused between the last emoji and the first ASCII character, with the “tail” of the emoji being mistaken for a DBCS lead byte.

Interesting, in that this example is emoji followed by U+0131. So there isn’t an ASCII character to trigger it. But it is still likely the same bug, or intimately related.

What surprises me, and what I think is different, is that the Encoding menu actions don’t behave the same, based on the option. I was under the impression that the Encoding menu actions (without “convert to”) were just supposed to change how Notepad++ interprets the bytes in the file – primarily for the purpose of fixing it when autodetect is wrong. So the fact that with the option off, it does allow it to be fixed, but when the option is on, it doesn’t allow it to be fixed, is what looks like a separate bug, to me. (Unless there’s someplace where it’s expressed that the intention is for the Encoding menu actions to honor that setting, in which case they are useless for fixing autodetection issues.)

PeterJones

@Coises said in Cannot change Encoding to correct encoding of UTF-8:

You have to turn off the Autodetect character encoding setting, so it won’t guess a character set, then open the file again.

Or, as I showed, turn off the option, then just change the encoding in the menu. It properly re-interprets the bytes as whatever you choose, if you have the autodetect turned off. No closing/re-opening of the file needed.

Coises

@PeterJones said in Cannot change Encoding to correct encoding of UTF-8:

@Coises said in Cannot change Encoding to correct encoding of UTF-8:

You have to turn off the Autodetect character encoding setting, so it won’t guess a character set, then open the file again.

Or, as I showed, turn off the option, then just change the encoding in the menu. It properly re-interprets the bytes as whatever you choose, if you have the autodetect turned off. No closing/re-opening of the file needed.

Yes, you are correct. I missed that.

It seems like Notepad++ should, but doesn’t, ignore that setting and behave as if it were unchecked when you explicitly request reinterpreting the file as ANSI or Unicode.

guy038

Hello, @tom-sasson, @peterjones, @coises, @debiedowner and All,

Peter, I’ve just tried your python 3 file, named hexdump.py against the single line that you gave :

DOWNLOAD_FOLDER = r"C:\Users\MYUSER\Documents\עבודה\SOMFOLDER\תלושים"

And I did obtain the same dump than yours !

The usual Unicode Hebrew script lies between 0591 and 05F4.

Thus, as this text is coded in UTF-8, any Hebrew character is always coded in two bytes, from D6 91 to D7 B4. So :

The sequence d7 a2 d7 91 d7 95 d7 93 d7 94 corresponds, as expected, to 5 Hebrew characters ( 10 bytes ), between the two antislashs.
The sequence d7 aa d7 9c d7 95 d7 a9 d7 99 d7 9d corresponds, as expected to 6 Hebrew characters ( 12 bytes ), between the last antislash and the last double-quote.

To my mind, in this specific case, we should not speak about low byte, medium byte and high byte, but rather of :

The leading byte, which can be D6 or D7
The continuation byte, which can be any value between 91 and BF, when the leading byte = D6 and any value between 80 and 87, OR between 90 and AA OR between AF and B4, when the leading byte is D7

Given the last Unicode v17.0 Hebrew code chart at :

https://www.unicode.org/charts/PDF/U0590.pdf

Best Regards,

guy038

PeterJones

@guy038 said in Cannot change Encoding to correct encoding of UTF-8:

we should not speak about low byte, medium byte and high byte,

It was a colloquial choice. I wasn’t talking about valid sequences – and especially not valid Hebrew sequences. I was assigning lables (“low”, “medium”, and “high”) to specific groups of byte values. In valid UTF-8, there will never be what I called a “low byte” [00-7f] followed by what I called a “medium byte” [80-BF], because UTF-8 doesn’t allow starting a new multi-byte sequence with anything in that range. Furthermore, in valid UTF-8, there will never be a “high byte” [C0-F7] followed by another “high byte” or a “low byte”: there will always be one or more “medium bytes” [80-BF] after a “high byte”. Thus, I was describing a way to find a bad sequence. (And I also wasn’t limiting myself to talking about Hebrew characters, because I wanted my description to work even if the file had other UTF-8 characters as well, and my description would have helped find those, too. Also, I wasn’t very optimistic that one existed: it was just something I suggested looking for, and as the reply indicated, the issue wasn’t bad UTF-8 sequences, but Notepad++'s failure to autodetect.)

update: If I had known/remembered the term, I probably should have used the word “continuation byte”, but I didn’t know that UTF-8 already had a term for that.

mpheath

I consider back to basics logic might be needed so will explain in some details how I understand the forever it seems inducing headache caused by multiple encodings and character sets/ranges.

I will try to explain it this way: ANSI encodings use character sets with re-using ASCII Extended bytes as a value. Unicode encodings use character ranges as are unique being linear with their own bytes as a value.

Changing a character set does not need re-encoding though Notepad++ wants to convert to wide-char/unicode can at times screw it up. Like German ANSI converted to Hebrew unicode is screwed with many issue reports and seen it myself in testing. Notepad++ with using the CharDet library can select the incorrect character range as it can choose to implicitly convert to unicode.

In SciTE for example, I can change an ANSI character set by changing the property value of character.set without any re-encoding to unicode. The code.page property can set a character set that is suitable so that setting character.set may not be needed at times. What Notepad++ does is not like what other editors do with some probable error of assumption of my view. Those learning the Notepad++ way without learning from others could be trapped in a belief that may not be correct. That means trying to solve these related problems can be derailed in discussions.

Those that think that encoding is directly related to character sets IMO are missing the basic concept as they are more so indirectly related. Character sets in ANSI is more like choosing font characters rather then the encoding. With ANSI encoding, Asian languages as being more complex languages IMO which can be 2 bytes per character to increase the size of the character set.

While automation can save effort of re-doing things. Computers have the processing speed while humans should have the fundamental logic if their brains work OK. When it comes to CharDet, the user might be better at choosing the code page or the character set for ANSI files that use any Extended ASCII bytes instead of trusting a library that can sometimes fail to make the correct choice. The implicit unicode conversion can exasperate the failure.

rdipardo

@mpheath said in Cannot change Encoding to correct encoding of UTF-8:

What Notepad++ does is not like what other editors do with some probable error of assumption of my view.

Not only that — some of those faulty assumptions have even been patched in to N++'s copy of Scintilla!

See, for example, Neil’s critique of this modification to ScintillaWin.cxx.

Coises

@debiedowner said in Cannot change Encoding to correct encoding of UTF-8:

Anyway, I don’t have an issue with the autodetection being wrong, I just have an issue with being unable to override it without completely disabling autodetection.

In this thread @mpheath and @rdipardo have discussed other problems, and @PeterJones has noted that you can disable the setting, change encoding and immediately enable the setting again, but I agree that what you describe seems like unintended, and certainly unexpected, behavior.

The culprit is here:

if (nppGui._detectEncoding)
    fileFormat._encoding = detectCodepage(data, lenFile);

This occurs as the file is reloaded. (Changing the encoding in Notepad++ always reloads the file. That makes sense because it needs the original data to reinterpret using the new encoding. File data is not always stored in the editor the same way it is stored in the file itself.) It’s the same process used to load the file originally. There is no flag passed down to tell it to skip the codepage detection and just do what the user requested.

A few lines earlier there is a similar test for Unicode byte order marks. If a user with default code page Windows-1252 (that covers most places with a latin-based alphabet) opens a new tab in Notepad++, sets encoding to ANSI (if necessary), pastes in the following text:

ï»¿abc

and saves the file, then closes the tab and opens the file again, the first three characters will be missing, and the file will be opened as UTF-8 with BOM. Nothing you can do will show those three characters again in the editor; you can change the encoding to ANSI, but they will still be missing.

It seems to me that the file loading routine should somehow be made aware of the context and should skip all file format detection when it is being called to reinterpret a file using a different encoding. At this point, I don’t have a suggestion as to how best to make that information available to FileManager::loadFileData.

Coises

Reported as Issue #17033.