ASCII compatibility questions

Coises

I’ve found myself down a bit of a rabbit-hole. Considering possible solutions has pushed to the front of my mind some questions that concerned me while developing my plugin, which I had set aside because I didn’t know how to answer them.

I’ve seen that Notepad++ only uses two character set encodings within Scintilla: CP_ACP, the system’s default code page, and CP_UTF8. Other character sets are translated to UTF-8 before they are loaded into Scintilla.

The default code pages in Windows installations for Western European languages are single-byte character sets which are supersets of ASCII. My questions are about other systems (Japanese, Ukranian, etc.). I have no experience with, nor any way I know of to test, what happens on those systems.

I tried to follow the Notepad++ code for file loading and encoding detection, but I am somewhat lost. Before I beat my head against that particular wall any further, I thought I would ask if someone here already knows:

In practice, will Notepad++ ever load a document into Scintilla using an encoding in which ASCII characters are not represented as single bytes with the same values as in ASCII?
In practice, will Notepad++ ever load a document into Scintilla using a multi-byte character set encoding in which a byte of a multi-byte character can be within the ASCII range (x00-x7F)? (Note that UTF-8 was designed so that this cannot happen in UTF-8, so long as “character” is taken to refer to Unicode code points and not combining graphemes — which is good enough for my purpose.)
If 1 or 2 above is possible, same questions for the printable ASCII range (x20-x7E).
If any of the above are possible, what happens with regular expressions when those code pages are active?

rdipardo

@Coises said in ASCII compatibility questions:

In practice, will Notepad++ ever load a document into Scintilla using an encoding in which ASCII characters are not represented as single bytes with the same values as in ASCII?

That’s exactly what happens when your Windows version is at least 1903 and you’ve configured the system’s “ANSI” code page as UTF-8:

Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"ACP"=REG_SZ:"65001"
"OEMCP"=REG_SZ:"437"
"MACCP"=REG_SZ:"65001"

Basically all of N++'s encoding logic depends on the assumption of a single-byte system code page. There is currently zero fault tolerance for the times when that assumption is false, e.g.,

Coises

@rdipardo said in ASCII compatibility questions:

@Coises said in ASCII compatibility questions:

In practice, will Notepad++ ever load a document into Scintilla using an encoding in which ASCII characters are not represented as single bytes with the same values as in ASCII?

That’s exactly what happens when your Windows version is at least 1903 and you’ve configured the system’s “ANSI” code page as UTF-8:

Thanks, but that is not what I meant. (The cases you listed are interesting, though!)

UTF-8 by design represents ASCII characters as single bytes with the same values as in ASCII, and multi-byte characters never use valid ASCII codes (0-127) as any byte of the character. That makes it “safe” for searches, when what you’re searching for is ASCII, even if what you’re searching in is not.

I think that all single byte character sets in use have the first property (the second doesn’t apply), but that not all legacy double byte or multibyte character sets have those properties. However, I got the impression somewhere — and now I cannot remember where — that Notepad++ converts certain character sets to UTF-8 on loading (and back again on saving). What I wondered was if, in fact, it is known that by the time Notepad++ has loaded a file into Scintilla, the first or both of the properties I mentioned will always be true of the Scintilla text.

rdipardo

@Coises said in ASCII compatibility questions:

I got the impression somewhere — and now I cannot remember where — that Notepad++ converts certain character sets to UTF-8 on loading (and back again on saving).

I know for certain that Geany does that. You may be thinking of N++'s option to encode “opened ANSI files” as UTF-8 under Settings > New Document > Encoding, which seems to be enabled by default. There’s a least one open issue suggesting that 8-bit ANSI is what you get when that option is turned off.

What I wondered was if, in fact, it is known that by the time Notepad++ has loaded a file into Scintilla, the first or both of the properties I mentioned will always be true of the Scintilla text.

My impression is that “Scintilla text” is always a stream of “raw” 32-bit code points; in other words, the API treats every “character” as an int, never an 8-bit char. Of course the application has to encode the stream at some point; exactly when is hard to pin down. It’s probably much earlier than anything a plugin could detect. Asking Scintilla about a document’s target encoding through any of the querying APIs returns only general information, at least in my limited experience.