NP++ 6.8.8 file encode type does not stay as UTF8

Hsiang Pan

I make file convert to UTF8 from encoding menu, but it only works on current session.

After close the file and re-open again the file encode change back to ANSI.

Claudia Frank

Hello Hsiang-Pan,

menu Settings->Preferences->New Document

check UTF-8 radio box.

Cheers
Claudia

Hsiang Pan

Hello Claudia,

Thank you for reply me. I think you misunderstand my question.

My issue is not related to open the new document from the program itself.

I see an issue.

When I convert the document type to UTF8 and save it, next time I open the same file the encoding type still set as ANSI.

I am sure it was fine on version 5.x

Claudia Frank

Hello Hsiang Pan,

I can only assume what npp does under the hood but what I have discovered so far is,
if content of the file can be either ascci or utf-8 (both share the same characters from
0-127) npp seems to use the default from the settings. Therefore this setting
would allow to open existing as well as new documents with utf-8 encoding.

Cheers
Claudia

guy038

Hello Hsiang Pan,

Well, I understood what’s happened and Claudia guessed it, too !

Could you confirm that, in Settings - Preferences… - New Document - Encoding, your default choice, for a new document, is the ANSI option ?

In that specific case, if, in addition, your file does NOT contain any character with Unicode code-point >\x{7F}, an encoding action in UTF-8, for N++ versions > 6.8.0 ( or UTF-8 without BOM, for N++ versions < 6.8.1 ) seems to be ignored and your file keeps its previous ANSI encoding, on next opening of N++ !

Why that weird behaviour ? If you think about it, it’s quite logical !

When a file contains ONLY characters with code-point < 128 ( \x{80} ), each character needs an UNIQUE byte and your file is EXACTLY the same in the ANSI, UTF-8, UTF-8 BOM, UCS-2 BE BOM, and UCS-2 LE BOM encodings.

To that purpose, just follow the link below :

https://en.wikipedia.org/wiki/ASCII

So, even with an hexadecimal editor, nobody would ever guess the exact encoding of such a file :-((

If you had chosen to encode your file in UTF-8 BOM, UCS-2 BE BOM, or UCS-2 LE BOM, your file would have kept this new encoding, on next N++ opening. Indeed, due to the BOM ( Byte Order Mark ), 2 or 3 invisible bytes, at the very beginning of a file, these three encodings are clearly identified !

On the other hand, if your file would have contained, at least, ONE character, with code-point >\x{7F}, once your file encoded in UTF-8, all the characters with code > 128 are coded with 2, 3 or 4 bytes, which helps to correctly identify an UT8 encoding, despite the absence of the BOM. So Notepad++ “accepts” this new UTF-8 encoding !

But, if all the code-points of the characters, of your file, are <\x{80}, Notepad++ doesn’t take your UTF-8 demand in account and keeps the ANSI encoding, since it’s, also, your default encoding for a new document :-)

Best Regards,

guy038

Hsiang Pan

Thank you guys to clear out my issue.

Now I understand the logic, since my file does not contain any UTF8 text on it. So next time when I open that file it will stay as ANSI.

If I have one UTF8 text on it. It will open as **UTF8 ** next time

I tried it, it works.

BTW I do set Settings - Preferences - New Document - Encoding to UTF8.

guy038

Hi, Hsiang Pan and Claudia and **All,

These last days, I tried to go into these questions , more deeply, and I tested numerous encoding cases !

Below, I summed up, in a table, the resulting encoding of a file, on re-opening Notepad++ ( without any encoding or converting action, before closing ), depending on :

The present file encoding ( 3 Rows )
The present default encoding, for a new document ( 2 Columns )

                                ===============================================
                                ║        DEFAULT NEW document ENCODING        ║
                                ║                      in                     ║
                                ║   Settings - Preferences... - New Document  ║
                                ║----------------------------•----------------║
                                ║  ANSI     UTF-8 without □  |                ║
                                ║                            |                ║
                                ║  UTF-8 BOM   UCS-2 BE BOM  |  UTF-8 with □  ║
                                ║                            |                ║
                                ║  UCS-2 LE BOM    Other     |                ║
    ============================║============================•================║
    ║  PRESENT   |  ANSI        ║            ANSI            |  UTF-8 / ANSI  ║
    ║            |              ║                            |                ║
    ║   File     |  UTF-8       ║        ANSI / UTF-8        |      UTF-8     ║
    ║            |              ║                            |                ║
    ║  Encoding  |  Encoding Z  ║        Encoding Z          |   Encoding Z   ║
    ===========================================================================

NOTES :

The mention with □, means that the option Apply to opened ANSI files is checked
The mention without □, means that the option Apply to opened ANSI files is UNchecked
The mention Other, represents ANY encoding, from the list Characters sets / … / …
The syntax xxx / yyy means that, in re-opening N++, the resulting file encoding is :
- xxx, if ALL the characters, of a file, have an Unicode code-point < \x{80}
- yyy, if, at LEAST, ONE character, of the file, has an Unicode code-point > \x{7F}
The mention Encoding Z represents the present file encoding, which may be :
- The UTF-8 BOM encoding
- The UCS-2 BE BOM encoding
- The UCS-2 LE BOM encoding
- A specific encoding, from the list Character sets / … / …

So, from this table, we can conclude that, after re-opening N++, the ENCODING of a file has CHANGED, in TWO cases, only :

FIRST case :
- The previous file encoding was ANSI
- The new default encoding is UTF-8, with the Apply to opened ANSI files option checked
- ALL the characters, of the file, have an Unicode code-point < \x{80}
SECOND case :
- The previous file encoding was UTF-8
- The new default encoding is DIFFERENT from UTF-8, with the Apply to opened ANSI files option checked
- ALL the characters, of the file, have an Unicode code-point < \x{80}

We can, also, deduce that the easiest solution, to preserve the actual encoding of a file, is to use, for the DEFAULT NEW document encoding :

The ANSI encoding, if you want that your file has the ANSI encoding, in any case
The UTF-8 encoding, with the Apply to opened ANSI files option CHECKED, if you want that your file has the UTF-8 encoding, in any case
Any other encoding, if you want that your file has this according encoding, in any case

Now, if we consider an encoding or converting action, before closing and re-starting Notepad++, we get the new table, below, with the resulting encoding, of a file, depending on :

The present default encoding for a new document ( Column 1 )
The present file encoding ( Column 2 )
The encoding action ( Columns 3, 4 or 5 ) or the converting action ( Columns 6, 7 or 8 )

 ==============================================================================================================================
    ║ DEFAULT NEW       ║   PRESENT    ║                 ENCODE in                  ║                 CONVERT to                 ║
    ║    Document       ║     File     ║--------------•--------------•--------------║--------------•--------------•--------------║
    ║       ENCODING    ║   ENCODING   ║     ANSI     |    UTF-8     |  Encoding X  ║     ANSI     |    UTF-8     | Converting Y ║
    ║===================║==============║==============•==============•==============║==============•==============•==============║
    ║      ANSI         ║     ANSI     ║     ANSI     |     ANSI     |              ║     ANSI     | ANSI / UTF-8 |              ║
    ║ UTF-8   without □ ║--------------║--------------•--------------|              ║--------------•--------------|              ║
    ║ UTF-8 BOM         ║ UTF-8        ║              |              |              ║              |              |              ║
    ║                   ║ UTF-8 BOM    ║              |              |              ║              |              |              ║
    ║ UCS-2 BE BOM      ║              ║ ANSI / UTF-8 | ANSI / UTF-8 |              ║     ANSI     | ANSI / UTF-8 |              ║
    ║ UCS-2 LE BOM      ║ UCS-2 BE BOM ║              |              |              ║              |              |              ║
    ║                   ║ UCS-2 LE BOM ║              |              |              ║              |              |              ║
    ║ OTHER Encoding    ║--------------║--------------•--------------|              ║--------------•--------------|              ║
    ║                   ║  Encoding Z  ║     ANSI     |     ANSI     |              ║  Encoding Z  | ANSI / UTF-8 |              ║
    ║===================║==============║=============================|  Encoding X  ║=============================| Converting Y ║
    ║                   ║     ANSI     ║ UTF-8 / ANSI | UTF-8 / ANSI |              ║ UTF-8 / ANSI |    UTF-8     |              ║
    ║                   ║--------------║--------------•--------------|              ║--------------•--------------|              ║
    ║                   ║ UTF-8        ║              |              |              ║              |              |              ║
    ║                   ║ UTF-8 BOM    ║              |              |              ║              |              |              ║
    ║  UTF-8    with □  ║              ║    UTF-8     |    UTF-8     |              ║ UTF-8 / ANSI |    UTF-8     |              ║
    ║                   ║ UCS-2 BE BOM ║              |              |              ║              |              |              ║
    ║                   ║ UCS-2 LE BOM ║              |              |              ║              |              |              ║
    ║                   ║--------------║--------------•--------------|              ║--------------•--------------|              ║
    ║                   ║  Encoding Z  ║ UTF-8 / ANSI | UTF-8 / ANSI |              ║  Encoding Z  |    UTF-8     |              ║
    ==============================================================================================================================

NOTES :

In column 1, the mention with □, means that the option Apply to opened ANSI files is checked
In column 1, the mention without □, means that the option Apply to opened ANSI files is UNchecked
In columns 2 and 6, the mention Encoding Z represents a specific encoding, from the list Characters sets / … /…
In column 5, the mention Encoding X represents ONE of the encoding actions :
- Encode in UTF-8 BOM
- Encode in UCS-2 BE BOM
- Encode in UCS-2 LE BOM
- Character sets / … / …
In column 8, the mention Converting Y represents ONE of the converting actions :
- Convert to UTF-8 BOM
- Convert to UCS-2 BE BOM
- Convert to UCS-2 LE BOM
In columns 3, 4, 6 and 7, the syntax xxx / yyy means that the resulting file encoding is :
- xxx, if ALL the characters, of a file, have an Unicode code-point < \x{80}
- yyy, if, at LEAST, ONE character, of the file, has an Unicode code-point > \x{7F}

IMPORTANT :

These results are IDENTICAL, whether the option Settings - Preferences… - MISC - Autodetect character encoding is checked or UNchecked

Best Regards,

guy038

P.S. :

All tests done with the last v6.8.8 version of Notepad++

As a remainder , you’ll find, below, the main differences between an ENCODING and a CONVERTING action :

When you use the menu option Encoding - Encode in … or Encoding - Character sets - …, Notepad++ DOESN’T change the file, at all ! It just tries to re-interpret the present contents of the file, according to the new encoding

You’ll generally use that option, if some characters of the file look weird, or are replaced by an interrogation mark ( ? ), a small square box ( □ ) or the UNICODE replacement character ( \xFFFD ). You’ll also use this option, if the file seems completely unreadable :-((

After an ENCODING action, remember that the datas are NEVER changed, only displaying is CHANGED !

When you use the menu option Encoding - Convert to …, this time, Notepad++ DOES change the file, as it re-writes the SAME contents, according to the new encoding !

You’ll generally use that option, if the PRESENT file is quite readable but must be read with an OTHER editor, that does NOT support the original encoding.

After a CONVERTING action, datas are ALWAYS changed. In addition, all the characters, which can’t be represented with that new encoding, will be replaced by a Question Mark, a small square box ( □ ) or the UNICODE replacement character ( \xFFFD ).

This situation may likely occur, if the original encoding was an UNICODE encoding and the new encoding is the ANSI encoding !

To end with that topic, don’t forget that, whatever the encoding chosen, the font, used to display the glyphs, may NOT contain some characters of the file, and will, then, display some substitution characters, instead !

Claudia Frank

@guy038

thank you very much for the hard work, I really appreciate it.
Encoding is one of those points which I don’t really understand fully.
Now I have a reference how it is solved and used in npp.

Thx
Cheers
Claudia