Encoding of files with ASCII only
-
Hello,
when I open a textfile that only contains ASCII characters Notepad++ shows encoding as UTF8 w/o BOM.
If I add a non ASCII sign like §, it shows encoding ANSI, which is what I actually defined for this file.Is this normal or is my file really UTF8 encoding of some sort?
-
ASCII is a subset of many encodings such as utf8, ansi, etc., so there is no way to figure out which encoding was intended
-
ASCII is a subset of many encodings such as utf8, ansi, etc., so there is no way to figure out which encoding was intended
All you need is a hex viewer (*1). “ASCII” is a general term for any variety of single-byte encoding, so expect to see a 1:1 correspondence between characters and bytes:
§
is included in many single-byte encodings, like the default OEM code page on Windows PCs. Go to? > Debug Info...
and check theCurrent ANSI codepage
. If the number is1252
, then§
is a valid “ASCII” character. Or just type this into a Python REPL:print('§'.encode('cp1252'))
The output will be the single byte
b'\xa7'
.If the file is truly UTF-8, then
§
(and only§
) will occupy multiple bytes:Or, at the Python REPL:
print('§'.encode('utf8')) # => b'\xc2\xa7'
(*1) I used the HexEdit plugin.
-
If I understood the question correctly, OP implicitly asked if there is a way to report the encoded file as, in his case, ANSI if it contains only ASCII characters. Based on my previous statement, this is not possible. Even if I use a hex editor, there is no way to tell if I wanted to use the file as ANSI or as some other encoding with ASCII characters as a subset. If I misunderstood the question, sorry.