Convert to ANSI

Michael Liddiard

When I open a HTML file Notepad++ shows UTF-8 w/o BOM, and then I click on Encoding in the menu bar, then convert to ANSI. The file is converted to ANSI as indicated in the bottom right. Then I save the file, close the file, then reopen the file it opens in UTF-8 w/o BOM. I want the HTML file to always be ANSI.

jfaustino

Try Settings > New Document > Uncheck Apply to opened ANSI files under the radio button UTF-8 on the Encoding Section.

Michael Liddiard

That didn’t work. File still comes up with UTF-8 w/o BOM

My file has html extension, that is what is causing Notepad++ to always show UTF-8 w/o BOM. If I open a new text file, it will show ANSI, if I save my html file as a text file it will show ANSI, if my file has HTML extension then NotePad++ will always show UTF-8 w/o BOM.

guy038

Hello, Michael Liddiard,

Not 100% sure, but rather than trying to use the menu option Encoding - Encoding in ANSI, here is, below, a method that should work !

Anywhere, in your HTML file, just add a comment line, that contains, at least, one character, with Unicode code-point higher than \x007F. Let’s say // € ( only the Euro sign, whose Unicode code-point is \x20AC )
Select the option Encoding - Convert to ANSI
Save your HTML file
Close and re-start Notepad

=> Your HTML file should, always, be opened, from now on, with the ANSI encoding

If it works, I’ll explain, next time, the fundamental differences between the encode and convert actions

Best Regards,

guy038

Rafael Lopes Vivian

@guy038 It worked for me. What is going on? I’ve always used ISO-8859-1 files, now it became impossible without a hack?

Jim Dailey

@Rafael-Lopes-Vivian

Here is a short quiz: If I give you a file that contains nothing but the ASCII code for the letter “A” (65 or 0x41), is it an ANSI file or a UTF-8 file without a BOM?

The correct answer is, “I don’t know”, and I would also except, “Either.”

Unless a UTF-8 without BOM file contains a a non-ANSI character, then there is no way to tell it from an ANSI file. If you really want to tell the difference no matter what the file contains, you need to have a BOM in your UTF-8 file.

Rafael Lopes Vivian

@Jim-Dailey Look, I don’t know about the technicalities, I only know what I’ve been doing for years. What you said, I’m pretty sure that’s not how it had worked before, or it’s been a damn bloody huge coincidence I never got a wrong charset for about 10 years using Notepad++ and ISO-8859-1 before 2016. In the years before I ALWAYS relied on the fact Notepad++ knew the right charset I saved or opened the file with. How it knew, I have no idea.

If I had a file with only the letter A or any ASCII characters for that matter, I wouldn’t care about charsets. Because I write in portuguese (and not in cyrillic or japanese for example), the catch is that “accented” characters (which are non-ASCII) are commonly used and are present in both UTF-8 and ISO-8859-1, but if it gets the charset wrong, the characters are all scrambled.

More to the point, if I TELL the app to use one, it shouldn’t err. You’re saying the only way to know is by having special characters, but the files have to be essentially different in more than that, otherwise non-BOM UTF-8 and ISO-8859-1 wouldn’t have different filesizes.

Jim Dailey

@Rafael-Lopes-Vivian

I think part of what is in play here is that UTF-8 is the default encoding for an HTML-5 file.

The characters that make up a “text” file are code points. A character encoding scheme maps the code points it understands into numbers that are stored in the file.

UTF-8 and ISO-8859-1 are two different character encoding schemes. A file can technically be encoded in one or the other (or some other scheme entirely) but not in both.

However, when a file contains only code points that are encoded identically by 2 or more encoding schemes, then, unless there is some special meta-data in the file to indicate which encoding scheme is being used, the “proper” scheme is not knowable from the file’s contents.

UTF-8 is a variable-length encoding (the size of the number each code point is mapped to varies); ISO-8859-1 is a fixed length encoding (the size of the number each code point is mapped to is 1 byte).

Since ISO-8859-1 represents each code point in a byte, it can only encode 256 code points, the first 256 code points of the Unicode character set. UTF-8 can be used to encode most (if not all) code points of the UNICODE character set.

Code points 0 - 127 are encoded identically by the UTF-8 and ISO-8859-1 schemes. Code points 128 - 255 differ by becoming a 2-byte sequence with UTF-8 whereas they are single bytes with ISO-8859-1.

So, if an ISO-8859-1 encoded file contains any code point from 128-255, then it will be a different size than a UTF-8 encoded file that contains the same code points.