Cannot change Encoding to correct encoding of UTF-8

rdipardo

Looks to me like an example of this known issue.

The common trigger is a sequence of emoji followed by ASCII alphanumerics.

I’m guessing the character boundaries get confused between the last emoji and the first ASCII character, with the “tail” of the emoji being mistaken for a DBCS lead byte.

Coises

@rdipardo said in Cannot change Encoding to correct encoding of UTF-8:

@Coises, @PeterJones

Looks to me like an example of this known issue.

That could be. Character set inference will always have a non-zero failure rate on Windows text files.

The bigger question is, why can’t the interpretation of the file be manually reset? If Notepad++ guesses ANSI, or UTF-8, and it’s wrong, you can change between them, or to a specific character set. But if Notepad++ guesses a character set, you can reinterpret the file as a different character set, but you can’t reinterpret it as Unicode (or ANSI). You have to turn off the Autodetect character encoding setting, so it won’t guess a character set, then open the file again.

PeterJones

The common trigger is a sequence of emoji followed by ASCII alphanumerics.
I’m guessing the character boundaries get confused between the last emoji and the first ASCII character, with the “tail” of the emoji being mistaken for a DBCS lead byte.

Interesting, in that this example is emoji followed by U+0131. So there isn’t an ASCII character to trigger it. But it is still likely the same bug, or intimately related.

What surprises me, and what I think is different, is that the Encoding menu actions don’t behave the same, based on the option. I was under the impression that the Encoding menu actions (without “convert to”) were just supposed to change how Notepad++ interprets the bytes in the file – primarily for the purpose of fixing it when autodetect is wrong. So the fact that with the option off, it does allow it to be fixed, but when the option is on, it doesn’t allow it to be fixed, is what looks like a separate bug, to me. (Unless there’s someplace where it’s expressed that the intention is for the Encoding menu actions to honor that setting, in which case they are useless for fixing autodetection issues.)

PeterJones

@Coises said in Cannot change Encoding to correct encoding of UTF-8:

You have to turn off the Autodetect character encoding setting, so it won’t guess a character set, then open the file again.

Or, as I showed, turn off the option, then just change the encoding in the menu. It properly re-interprets the bytes as whatever you choose, if you have the autodetect turned off. No closing/re-opening of the file needed.

Coises

@PeterJones said in Cannot change Encoding to correct encoding of UTF-8:

@Coises said in Cannot change Encoding to correct encoding of UTF-8:

You have to turn off the Autodetect character encoding setting, so it won’t guess a character set, then open the file again.

Or, as I showed, turn off the option, then just change the encoding in the menu. It properly re-interprets the bytes as whatever you choose, if you have the autodetect turned off. No closing/re-opening of the file needed.

Yes, you are correct. I missed that.

It seems like Notepad++ should, but doesn’t, ignore that setting and behave as if it were unchecked when you explicitly request reinterpreting the file as ANSI or Unicode.

guy038

Hello, @tom-sasson, @peterjones, @coises, @debiedowner and All,

Peter, I’ve just tried your python 3 file, named hexdump.py against the single line that you gave :

DOWNLOAD_FOLDER = r"C:\Users\MYUSER\Documents\עבודה\SOMFOLDER\תלושים"

And I did obtain the same dump than yours !

The usual Unicode Hebrew script lies between 0591 and 05F4.

Thus, as this text is coded in UTF-8, any Hebrew character is always coded in two bytes, from D6 91 to D7 B4. So :

The sequence d7 a2 d7 91 d7 95 d7 93 d7 94 corresponds, as expected, to 5 Hebrew characters ( 10 bytes ), between the two antislashs.
The sequence d7 aa d7 9c d7 95 d7 a9 d7 99 d7 9d corresponds, as expected to 6 Hebrew characters ( 12 bytes ), between the last antislash and the last double-quote.

To my mind, in this specific case, we should not speak about low byte, medium byte and high byte, but rather of :

The leading byte, which can be D6 or D7
The continuation byte, which can be any value between 91 and BF, when the leading byte = D6 and any value between 80 and 87, OR between 90 and AA OR between AF and B4, when the leading byte is D7

Given the last Unicode v17.0 Hebrew code chart at :

https://www.unicode.org/charts/PDF/U0590.pdf

Best Regards,

guy038

PeterJones

@guy038 said in Cannot change Encoding to correct encoding of UTF-8:

we should not speak about low byte, medium byte and high byte,

It was a colloquial choice. I wasn’t talking about valid sequences – and especially not valid Hebrew sequences. I was assigning lables (“low”, “medium”, and “high”) to specific groups of byte values. In valid UTF-8, there will never be what I called a “low byte” [00-7f] followed by what I called a “medium byte” [80-BF], because UTF-8 doesn’t allow starting a new multi-byte sequence with anything in that range. Furthermore, in valid UTF-8, there will never be a “high byte” [C0-F7] followed by another “high byte” or a “low byte”: there will always be one or more “medium bytes” [80-BF] after a “high byte”. Thus, I was describing a way to find a bad sequence. (And I also wasn’t limiting myself to talking about Hebrew characters, because I wanted my description to work even if the file had other UTF-8 characters as well, and my description would have helped find those, too. Also, I wasn’t very optimistic that one existed: it was just something I suggested looking for, and as the reply indicated, the issue wasn’t bad UTF-8 sequences, but Notepad++'s failure to autodetect.)

update: If I had known/remembered the term, I probably should have used the word “continuation byte”, but I didn’t know that UTF-8 already had a term for that.

mpheath

I consider back to basics logic might be needed so will explain in some details how I understand the forever it seems inducing headache caused by multiple encodings and character sets/ranges.

I will try to explain it this way: ANSI encodings use character sets with re-using ASCII Extended bytes as a value. Unicode encodings use character ranges as are unique being linear with their own bytes as a value.

Changing a character set does not need re-encoding though Notepad++ wants to convert to wide-char/unicode can at times screw it up. Like German ANSI converted to Hebrew unicode is screwed with many issue reports and seen it myself in testing. Notepad++ with using the CharDet library can select the incorrect character range as it can choose to implicitly convert to unicode.

In SciTE for example, I can change an ANSI character set by changing the property value of character.set without any re-encoding to unicode. The code.page property can set a character set that is suitable so that setting character.set may not be needed at times. What Notepad++ does is not like what other editors do with some probable error of assumption of my view. Those learning the Notepad++ way without learning from others could be trapped in a belief that may not be correct. That means trying to solve these related problems can be derailed in discussions.

Those that think that encoding is directly related to character sets IMO are missing the basic concept as they are more so indirectly related. Character sets in ANSI is more like choosing font characters rather then the encoding. With ANSI encoding, Asian languages as being more complex languages IMO which can be 2 bytes per character to increase the size of the character set.

While automation can save effort of re-doing things. Computers have the processing speed while humans should have the fundamental logic if their brains work OK. When it comes to CharDet, the user might be better at choosing the code page or the character set for ANSI files that use any Extended ASCII bytes instead of trusting a library that can sometimes fail to make the correct choice. The implicit unicode conversion can exasperate the failure.

rdipardo

@mpheath said in Cannot change Encoding to correct encoding of UTF-8:

What Notepad++ does is not like what other editors do with some probable error of assumption of my view.

Not only that — some of those faulty assumptions have even been patched in to N++'s copy of Scintilla!

See, for example, Neil’s critique of this modification to ScintillaWin.cxx.

Coises

@debiedowner said in Cannot change Encoding to correct encoding of UTF-8:

Anyway, I don’t have an issue with the autodetection being wrong, I just have an issue with being unable to override it without completely disabling autodetection.

In this thread @mpheath and @rdipardo have discussed other problems, and @PeterJones has noted that you can disable the setting, change encoding and immediately enable the setting again, but I agree that what you describe seems like unintended, and certainly unexpected, behavior.

The culprit is here:

if (nppGui._detectEncoding)
    fileFormat._encoding = detectCodepage(data, lenFile);

This occurs as the file is reloaded. (Changing the encoding in Notepad++ always reloads the file. That makes sense because it needs the original data to reinterpret using the new encoding. File data is not always stored in the editor the same way it is stored in the file itself.) It’s the same process used to load the file originally. There is no flag passed down to tell it to skip the codepage detection and just do what the user requested.

A few lines earlier there is a similar test for Unicode byte order marks. If a user with default code page Windows-1252 (that covers most places with a latin-based alphabet) opens a new tab in Notepad++, sets encoding to ANSI (if necessary), pastes in the following text:

ï»¿abc

and saves the file, then closes the tab and opens the file again, the first three characters will be missing, and the file will be opened as UTF-8 with BOM. Nothing you can do will show those three characters again in the editor; you can change the encoding to ANSI, but they will still be missing.

It seems to me that the file loading routine should somehow be made aware of the context and should skip all file format detection when it is being called to reinterpret a file using a different encoding. At this point, I don’t have a suggestion as to how best to make that information available to FileManager::loadFileData.

Coises

Reported as Issue #17033.