Encoding of files with ASCII only

Karl Karlser · Jan 9, 2023, 8:07 AM

Hello,

when I open a textfile that only contains ASCII characters Notepad++ shows encoding as UTF8 w/o BOM.
If I add a non ASCII sign like §, it shows encoding ANSI, which is what I actually defined for this file.

Is this normal or is my file really UTF8 encoding of some sort?

Ekopalypse · Jan 9, 2023, 9:35 AM

@Karl-Karlser

ASCII is a subset of many encodings such as utf8, ansi, etc., so there is no way to figure out which encoding was intended

rdipardo · Jan 9, 2023, 10:58 AM

ASCII is a subset of many encodings such as utf8, ansi, etc., so there is no way to figure out which encoding was intended

All you need is a hex viewer (*1). “ASCII” is a general term for any variety of single-byte encoding, so expect to see a 1:1 correspondence between characters and bytes:

§ is included in many single-byte encodings, like the default OEM code page on Windows PCs. Go to ? > Debug Info... and check the Current ANSI codepage. If the number is 1252, then § is a valid “ASCII” character. Or just type this into a Python REPL:

print('§'.encode('cp1252'))

The output will be the single byte b'\xa7'.

If the file is truly UTF-8, then § (and only § ) will occupy multiple bytes:

Or, at the Python REPL:

print('§'.encode('utf8'))
# => b'\xc2\xa7'

(*1) I used the HexEdit plugin.

Ekopalypse · Jan 9, 2023, 11:07 AM

@rdipardo

If I understood the question correctly, OP implicitly asked if there is a way to report the encoded file as, in his case, ANSI if it contains only ASCII characters. Based on my previous statement, this is not possible. Even if I use a hex editor, there is no way to tell if I wanted to use the file as ANSI or as some other encoding with ASCII characters as a subset. If I misunderstood the question, sorry.