Encoding says file is UTF8 but it's not
-
I’m using 7.8.5 (32bit). Encoding info always says file is utf8.
But if I check using Linux file command, it isn’t.# file aaa.sh aaa.sh: Bourne-Again shell script, ASCII text executable
It seems a bug.
-
Even if I’ve done “Convert to ANSI”, but it stills UTF8.
Does other guys see this problem with version 7.8.5?
I want to know if this is found only to me… -
It is not clear what is your goal, I assume you want a file to be ANSI and for Notepad++ to show it.
You need some background about unicode encoding.
UTF8 extends ANSI in a backward compatible fashion.
Every ANSI file is also UTF8 by definition but NOT vice versa.
Rest assured that if your UTF8 file only contains symbols in the range of 0-127 than it is also ANSI.
Just make sure it does not have BOM in case it need to be consumed by older applications that fail to skip it.Basically the term ANSI, in the context of text encoding, should stop being used. People should be educated that requirement for “ANSI” is in fact requirement for “UTF8”.
-
Hello, @asbyonejj, @gstavi and All,
Reminders :
-
An
ANSI
encoded file is generally a file with an encoding, fromWindows-1250
toWindows-1258
, and codes256
characters, divided in two parts :-
Characters with Unicode code-point between
\x00
and\x7F
( from0
to127
), coded with1
byte, which belongs to the old US-ASCII encoding -
Characters with Unicode code-point between
\x80
and\xFF
( from128
to255
) which are specific to a one-bit encoding, like, for instance, theISO-...
encodings family, theOEM-....
encodings family and theWindows-....
encodings family, for the best-known
-
For an almost exhaustive list of encodings, refer to the table, at :
https://en.wikipedia.org/wiki/Windows-1252#External_links
-
An
UT8
encoded file, due to the cleverUTF-8
encoding (U
niversal Character SetT
ransformationF
ormat -8
bits ), can encode any of the1,114,112
possible Unicode characters with :-
1
byte if the Unicode code-point of this character is betweenx{0000}
and\x{007F}
-
2
bytes if the Unicode code-point of this character is betweenx{0080}
and\x{07FF}
-
3
bytes if the Unicode code-point of this character is betweenx{0800}
and\x{FFFF}
-
4
bytes if the Unicode code-point of this character is betweenx{10000}
and\x{10FFF}
-
However, note that the last Unicode
v13.0
release contains only283,506
assigned characters and there still are830,606
non-assigned characters, so reserved for future use !https://www.unicode.org/versions/stats/charcountv13_0.html
You may have noticed than the encoding of the first
128
characters, betweenx{0000}
and\x{007F}
, is quite identical in anANSI
and in anUTF-8
encoded fileSo, if your file contain only characters, with Unicode code-point lower than
\x{0080}
and that your file is notUTF-8-BOM
encoded, it’s impossible to any editor, including Notepad++, to guess that the user assumes anANSI
or anUTF-8
encoding. See :http://www.unicode.org/charts/PDF/U0000.pdf
Thus, in order to remove the ambiguity and help N++ to decide of the right user-encoding, you have two possibilities :
-
Include in your file, at least,
1
character, with code-point above\x{007F}
-
UN-tick the option
Apply to opened ANSI files
inSettings > Preferences... > New Document > Encoding
, if theUTF-8
option is ticked
This way, after saving your file with the
ANSI
encoding (Encoding > Convert to ANSI
), it will now keep thisANSI
encoding for future N++ startups ;-))Best Regards,
guy038
-
-
@guy038 said in Encoding says file is UTF8 but it's not:
UN-tick the option Apply to opened ANSI files in Settings > Preferences… > New Document > Encoding, if the UTF-8 option is ticked
This was my case. I didn’t know this option. Thank you.