Encoding says file is UTF8 but it's not
-
I’m using 7.8.5 (32bit). Encoding info always says file is utf8.
But if I check using Linux file command, it isn’t.# file aaa.sh aaa.sh: Bourne-Again shell script, ASCII text executableIt seems a bug.
-
Even if I’ve done “Convert to ANSI”, but it stills UTF8.
Does other guys see this problem with version 7.8.5?
I want to know if this is found only to me… -
It is not clear what is your goal, I assume you want a file to be ANSI and for Notepad++ to show it.
You need some background about unicode encoding.
UTF8 extends ANSI in a backward compatible fashion.
Every ANSI file is also UTF8 by definition but NOT vice versa.
Rest assured that if your UTF8 file only contains symbols in the range of 0-127 than it is also ANSI.
Just make sure it does not have BOM in case it need to be consumed by older applications that fail to skip it.Basically the term ANSI, in the context of text encoding, should stop being used. People should be educated that requirement for “ANSI” is in fact requirement for “UTF8”.
-
Hello, @asbyonejj, @gstavi and All,
Reminders :
-
An
ANSIencoded file is generally a file with an encoding, fromWindows-1250toWindows-1258, and codes256characters, divided in two parts :-
Characters with Unicode code-point between
\x00and\x7F( from0to127), coded with1byte, which belongs to the old US-ASCII encoding -
Characters with Unicode code-point between
\x80and\xFF( from128to255) which are specific to a one-bit encoding, like, for instance, theISO-...encodings family, theOEM-....encodings family and theWindows-....encodings family, for the best-known
-
For an almost exhaustive list of encodings, refer to the table, at :
https://en.wikipedia.org/wiki/Windows-1252#External_links
-
An
UT8encoded file, due to the cleverUTF-8encoding (Universal Character SetTransformationFormat -8bits ), can encode any of the1,114,112possible Unicode characters with :-
1byte if the Unicode code-point of this character is betweenx{0000}and\x{007F} -
2bytes if the Unicode code-point of this character is betweenx{0080}and\x{07FF} -
3bytes if the Unicode code-point of this character is betweenx{0800}and\x{FFFF} -
4bytes if the Unicode code-point of this character is betweenx{10000}and\x{10FFF}
-
However, note that the last Unicode
v13.0release contains only283,506assigned characters and there still are830,606non-assigned characters, so reserved for future use !https://www.unicode.org/versions/stats/charcountv13_0.html
You may have noticed than the encoding of the first
128characters, betweenx{0000}and\x{007F}, is quite identical in anANSIand in anUTF-8encoded fileSo, if your file contain only characters, with Unicode code-point lower than
\x{0080}and that your file is notUTF-8-BOMencoded, it’s impossible to any editor, including Notepad++, to guess that the user assumes anANSIor anUTF-8encoding. See :http://www.unicode.org/charts/PDF/U0000.pdf
Thus, in order to remove the ambiguity and help N++ to decide of the right user-encoding, you have two possibilities :
-
Include in your file, at least,
1character, with code-point above\x{007F} -
UN-tick the option
Apply to opened ANSI filesinSettings > Preferences... > New Document > Encoding, if theUTF-8option is ticked
This way, after saving your file with the
ANSIencoding (Encoding > Convert to ANSI), it will now keep thisANSIencoding for future N++ startups ;-))Best Regards,
guy038
-
-
@guy038 said in Encoding says file is UTF8 but it's not:
UN-tick the option Apply to opened ANSI files in Settings > Preferences… > New Document > Encoding, if the UTF-8 option is ticked
This was my case. I didn’t know this option. Thank you.