Encoding says file is UTF8 but it's not

asbyonejj

I’m using 7.8.5 (32bit). Encoding info always says file is utf8.
But if I check using Linux file command, it isn’t.

# file aaa.sh
aaa.sh: Bourne-Again shell script, ASCII text executable

It seems a bug.

asbyonejj

Even if I’ve done “Convert to ANSI”, but it stills UTF8.
Does other guys see this problem with version 7.8.5?
I want to know if this is found only to me…

gstavi

It is not clear what is your goal, I assume you want a file to be ANSI and for Notepad++ to show it.

You need some background about unicode encoding.

UTF8 extends ANSI in a backward compatible fashion.
Every ANSI file is also UTF8 by definition but NOT vice versa.
Rest assured that if your UTF8 file only contains symbols in the range of 0-127 than it is also ANSI.
Just make sure it does not have BOM in case it need to be consumed by older applications that fail to skip it.

Basically the term ANSI, in the context of text encoding, should stop being used. People should be educated that requirement for “ANSI” is in fact requirement for “UTF8”.

guy038

Hello, @asbyonejj, @gstavi and All,

Reminders :

An ANSI encoded file is generally a file with an encoding, from Windows-1250 to Windows-1258, and codes 256 characters, divided in two parts :
- Characters with Unicode code-point between \x00 and \x7F ( from 0 to 127 ), coded with 1 byte, which belongs to the old US-ASCII encoding
- Characters with Unicode code-point between \x80 and \xFF ( from 128 to 255 ) which are specific to a one-bit encoding, like, for instance, the ISO-... encodings family, the OEM-.... encodings family and the Windows-.... encodings family, for the best-known

For an almost exhaustive list of encodings, refer to the table, at :

https://en.wikipedia.org/wiki/Windows-1252#External_links

An UT8 encoded file, due to the clever UTF-8 encoding ( Universal Character Set Transformation Format - 8 bits ), can encode any of the 1,114,112 possible Unicode characters with :
- 1 byte if the Unicode code-point of this character is between x{0000} and \x{007F}
- 2 bytes if the Unicode code-point of this character is between x{0080} and \x{07FF}
- 3 bytes if the Unicode code-point of this character is between x{0800} and \x{FFFF}
- 4 bytes if the Unicode code-point of this character is between x{10000} and \x{10FFF}

However, note that the last Unicode v13.0 release contains only 283,506 assigned characters and there still are 830,606 non-assigned characters, so reserved for future use !

https://www.unicode.org/versions/stats/charcountv13_0.html

You may have noticed than the encoding of the first 128 characters, between x{0000} and \x{007F}, is quite identical in an ANSI and in an UTF-8 encoded file

So, if your file contain only characters, with Unicode code-point lower than \x{0080} and that your file is not UTF-8-BOM encoded, it’s impossible to any editor, including Notepad++, to guess that the user assumes an ANSI or an UTF-8 encoding. See :

http://www.unicode.org/charts/PDF/U0000.pdf

Thus, in order to remove the ambiguity and help N++ to decide of the right user-encoding, you have two possibilities :

Include in your file, at least, 1 character, with code-point above \x{007F}
UN-tick the option Apply to opened ANSI files in Settings > Preferences... > New Document > Encoding, if the UTF-8 option is ticked

This way, after saving your file with the ANSI encoding ( Encoding > Convert to ANSI ), it will now keep this ANSI encoding for future N++ startups ;-))

Best Regards,

guy038

asbyonejj

@guy038 said in Encoding says file is UTF8 but it's not:

UN-tick the option Apply to opened ANSI files in Settings > Preferences… > New Document > Encoding, if the UTF-8 option is ticked

This was my case. I didn’t know this option. Thank you.