Community
    • Login

    NP++ 6.8.8 file encode type does not stay as UTF8

    Scheduled Pinned Locked Moved General Discussion
    8 Posts 3 Posters 9.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Hsiang PanH
      Hsiang Pan
      last edited by

      I make file convert to UTF8 from encoding menu, but it only works on current session.

      After close the file and re-open again the file encode change back to ANSI.

      Claudia FrankC 1 Reply Last reply Reply Quote 0
      • Claudia FrankC
        Claudia Frank @Hsiang Pan
        last edited by

        Hello Hsiang-Pan,

        menu Settings->Preferences->New Document

        check UTF-8 radio box.

        Cheers
        Claudia

        1 Reply Last reply Reply Quote 0
        • Hsiang PanH
          Hsiang Pan
          last edited by

          Hello Claudia,

          Thank you for reply me. I think you misunderstand my question.

          My issue is not related to open the new document from the program itself.

          I see an issue.

          When I convert the document type to UTF8 and save it, next time I open the same file the encoding type still set as ANSI.

          I am sure it was fine on version 5.x

          1 Reply Last reply Reply Quote 0
          • Claudia FrankC
            Claudia Frank
            last edited by

            Hello Hsiang Pan,

            I can only assume what npp does under the hood but what I have discovered so far is,
            if content of the file can be either ascci or utf-8 (both share the same characters from
            0-127) npp seems to use the default from the settings. Therefore this setting
            would allow to open existing as well as new documents with utf-8 encoding.

            Cheers
            Claudia

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hello Hsiang Pan,

              Well, I understood what’s happened and Claudia guessed it, too !

              Could you confirm that, in Settings - Preferences… - New Document - Encoding, your default choice, for a new document, is the ANSI option ?

              In that specific case, if, in addition, your file does NOT contain any character with Unicode code-point >\x{7F}, an encoding action in UTF-8, for N++ versions > 6.8.0 ( or UTF-8 without BOM, for N++ versions < 6.8.1 ) seems to be ignored and your file keeps its previous ANSI encoding, on next opening of N++ !

              Why that weird behaviour ? If you think about it, it’s quite logical !

              When a file contains ONLY characters with code-point < 128 ( \x{80} ), each character needs an UNIQUE byte and your file is EXACTLY the same in the ANSI, UTF-8, UTF-8 BOM, UCS-2 BE BOM, and UCS-2 LE BOM encodings.

              To that purpose, just follow the link below :

              https://en.wikipedia.org/wiki/ASCII

              So, even with an hexadecimal editor, nobody would ever guess the exact encoding of such a file :-((


              If you had chosen to encode your file in UTF-8 BOM, UCS-2 BE BOM, or UCS-2 LE BOM, your file would have kept this new encoding, on next N++ opening. Indeed, due to the BOM ( Byte Order Mark ), 2 or 3 invisible bytes, at the very beginning of a file, these three encodings are clearly identified !

              On the other hand, if your file would have contained, at least, ONE character, with code-point >\x{7F}, once your file encoded in UTF-8, all the characters with code > 128 are coded with 2, 3 or 4 bytes, which helps to correctly identify an UT8 encoding, despite the absence of the BOM. So Notepad++ “accepts” this new UTF-8 encoding !

              But, if all the code-points of the characters, of your file, are <\x{80}, Notepad++ doesn’t take your UTF-8 demand in account and keeps the ANSI encoding, since it’s, also, your default encoding for a new document :-)

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 0
              • Hsiang PanH
                Hsiang Pan
                last edited by Hsiang Pan

                Thank you guys to clear out my issue.

                Now I understand the logic, since my file does not contain any UTF8 text on it. So next time when I open that file it will stay as ANSI.

                If I have one UTF8 text on it. It will open as **UTF8 ** next time

                I tried it, it works.

                BTW I do set Settings - Preferences - New Document - Encoding to UTF8.

                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, Hsiang Pan and Claudia and **All,

                  These last days, I tried to go into these questions , more deeply, and I tested numerous encoding cases !

                  Below, I summed up, in a table, the resulting encoding of a file, on re-opening Notepad++ ( without any encoding or converting action, before closing ), depending on :

                  • The present file encoding ( 3 Rows )

                  • The present default encoding, for a new document ( 2 Columns )

                                                  ===============================================
                                                  ║        DEFAULT NEW document ENCODING        ║
                                                  ║                      in                     ║
                                                  ║   Settings - Preferences... - New Document  ║
                                                  ║----------------------------•----------------║
                                                  ║  ANSI     UTF-8 without □  |                ║
                                                  ║                            |                ║
                                                  ║  UTF-8 BOM   UCS-2 BE BOM  |  UTF-8 with □  ║
                                                  ║                            |                ║
                                                  ║  UCS-2 LE BOM    Other     |                ║
                      ============================║============================•================║
                      ║  PRESENT   |  ANSI        ║            ANSI            |  UTF-8 / ANSI  ║
                      ║            |              ║                            |                ║
                      ║   File     |  UTF-8       ║        ANSI / UTF-8        |      UTF-8     ║
                      ║            |              ║                            |                ║
                      ║  Encoding  |  Encoding Z  ║        Encoding Z          |   Encoding Z   ║
                      ===========================================================================
                  
                  

                  NOTES :

                  • The mention with □, means that the option Apply to opened ANSI files is checked

                  • The mention without □, means that the option Apply to opened ANSI files is UNchecked

                  • The mention Other, represents ANY encoding, from the list Characters sets / … / …

                  • The syntax xxx / yyy means that, in re-opening N++, the resulting file encoding is :

                    • xxx, if ALL the characters, of a file, have an Unicode code-point < \x{80}

                    • yyy, if, at LEAST, ONE character, of the file, has an Unicode code-point > \x{7F}

                  • The mention Encoding Z represents the present file encoding, which may be :

                    • The UTF-8 BOM encoding

                    • The UCS-2 BE BOM encoding

                    • The UCS-2 LE BOM encoding

                    • A specific encoding, from the list Character sets / … / …


                  So, from this table, we can conclude that, after re-opening N++, the ENCODING of a file has CHANGED, in TWO cases, only :

                  • FIRST case :

                    • The previous file encoding was ANSI

                    • The new default encoding is UTF-8, with the Apply to opened ANSI files option checked

                    • ALL the characters, of the file, have an Unicode code-point < \x{80}

                  • SECOND case :

                    • The previous file encoding was UTF-8

                    • The new default encoding is DIFFERENT from UTF-8, with the Apply to opened ANSI files option checked

                    • ALL the characters, of the file, have an Unicode code-point < \x{80}


                  We can, also, deduce that the easiest solution, to preserve the actual encoding of a file, is to use, for the DEFAULT NEW document encoding :

                  • The ANSI encoding, if you want that your file has the ANSI encoding, in any case

                  • The UTF-8 encoding, with the Apply to opened ANSI files option CHECKED, if you want that your file has the UTF-8 encoding, in any case

                  • Any other encoding, if you want that your file has this according encoding, in any case


                  Now, if we consider an encoding or converting action, before closing and re-starting Notepad++, we get the new table, below, with the resulting encoding, of a file, depending on :

                  • The present default encoding for a new document ( Column 1 )

                  • The present file encoding ( Column 2 )

                  • The encoding action ( Columns 3, 4 or 5 ) or the converting action ( Columns 6, 7 or 8 )

                   ==============================================================================================================================
                      ║ DEFAULT NEW       ║   PRESENT    ║                 ENCODE in                  ║                 CONVERT to                 ║
                      ║    Document       ║     File     ║--------------•--------------•--------------║--------------•--------------•--------------║
                      ║       ENCODING    ║   ENCODING   ║     ANSI     |    UTF-8     |  Encoding X  ║     ANSI     |    UTF-8     | Converting Y ║
                      ║===================║==============║==============•==============•==============║==============•==============•==============║
                      ║      ANSI         ║     ANSI     ║     ANSI     |     ANSI     |              ║     ANSI     | ANSI / UTF-8 |              ║
                      ║ UTF-8   without □ ║--------------║--------------•--------------|              ║--------------•--------------|              ║
                      ║ UTF-8 BOM         ║ UTF-8        ║              |              |              ║              |              |              ║
                      ║                   ║ UTF-8 BOM    ║              |              |              ║              |              |              ║
                      ║ UCS-2 BE BOM      ║              ║ ANSI / UTF-8 | ANSI / UTF-8 |              ║     ANSI     | ANSI / UTF-8 |              ║
                      ║ UCS-2 LE BOM      ║ UCS-2 BE BOM ║              |              |              ║              |              |              ║
                      ║                   ║ UCS-2 LE BOM ║              |              |              ║              |              |              ║
                      ║ OTHER Encoding    ║--------------║--------------•--------------|              ║--------------•--------------|              ║
                      ║                   ║  Encoding Z  ║     ANSI     |     ANSI     |              ║  Encoding Z  | ANSI / UTF-8 |              ║
                      ║===================║==============║=============================|  Encoding X  ║=============================| Converting Y ║
                      ║                   ║     ANSI     ║ UTF-8 / ANSI | UTF-8 / ANSI |              ║ UTF-8 / ANSI |    UTF-8     |              ║
                      ║                   ║--------------║--------------•--------------|              ║--------------•--------------|              ║
                      ║                   ║ UTF-8        ║              |              |              ║              |              |              ║
                      ║                   ║ UTF-8 BOM    ║              |              |              ║              |              |              ║
                      ║  UTF-8    with □  ║              ║    UTF-8     |    UTF-8     |              ║ UTF-8 / ANSI |    UTF-8     |              ║
                      ║                   ║ UCS-2 BE BOM ║              |              |              ║              |              |              ║
                      ║                   ║ UCS-2 LE BOM ║              |              |              ║              |              |              ║
                      ║                   ║--------------║--------------•--------------|              ║--------------•--------------|              ║
                      ║                   ║  Encoding Z  ║ UTF-8 / ANSI | UTF-8 / ANSI |              ║  Encoding Z  |    UTF-8     |              ║
                      ==============================================================================================================================
                  
                  

                  NOTES :

                  • In column 1, the mention with □, means that the option Apply to opened ANSI files is checked

                  • In column 1, the mention without □, means that the option Apply to opened ANSI files is UNchecked

                  • In columns 2 and 6, the mention Encoding Z represents a specific encoding, from the list Characters sets / … /…

                  • In column 5, the mention Encoding X represents ONE of the encoding actions :

                    • Encode in UTF-8 BOM

                    • Encode in UCS-2 BE BOM

                    • Encode in UCS-2 LE BOM

                    • Character sets / … / …

                  • In column 8, the mention Converting Y represents ONE of the converting actions :

                    • Convert to UTF-8 BOM

                    • Convert to UCS-2 BE BOM

                    • Convert to UCS-2 LE BOM

                  • In columns 3, 4, 6 and 7, the syntax xxx / yyy means that the resulting file encoding is :

                    • xxx, if ALL the characters, of a file, have an Unicode code-point < \x{80}

                    • yyy, if, at LEAST, ONE character, of the file, has an Unicode code-point > \x{7F}


                  IMPORTANT :

                  These results are IDENTICAL, whether the option Settings - Preferences… - MISC - Autodetect character encoding is checked or UNchecked

                  Best Regards,

                  guy038

                  P.S. :

                  All tests done with the last v6.8.8 version of Notepad++

                  As a remainder , you’ll find, below, the main differences between an ENCODING and a CONVERTING action :


                  When you use the menu option Encoding - Encode in … or Encoding - Character sets - …, Notepad++ DOESN’T change the file, at all ! It just tries to re-interpret the present contents of the file, according to the new encoding

                  You’ll generally use that option, if some characters of the file look weird, or are replaced by an interrogation mark ( ? ), a small square box ( □ ) or the UNICODE replacement character ( \xFFFD ). You’ll also use this option, if the file seems completely unreadable :-((

                  After an ENCODING action, remember that the datas are NEVER changed, only displaying is CHANGED !


                  When you use the menu option Encoding - Convert to …, this time, Notepad++ DOES change the file, as it re-writes the SAME contents, according to the new encoding !

                  You’ll generally use that option, if the PRESENT file is quite readable but must be read with an OTHER editor, that does NOT support the original encoding.

                  After a CONVERTING action, datas are ALWAYS changed. In addition, all the characters, which can’t be represented with that new encoding, will be replaced by a Question Mark, a small square box ( □ ) or the UNICODE replacement character ( \xFFFD ).

                  This situation may likely occur, if the original encoding was an UNICODE encoding and the new encoding is the ANSI encoding !

                  To end with that topic, don’t forget that, whatever the encoding chosen, the font, used to display the glyphs, may NOT contain some characters of the file, and will, then, display some substitution characters, instead !

                  Claudia FrankC 1 Reply Last reply Reply Quote 0
                  • Claudia FrankC
                    Claudia Frank @guy038
                    last edited by

                    @guy038

                    thank you very much for the hard work, I really appreciate it.
                    Encoding is one of those points which I don’t really understand fully.
                    Now I have a reference how it is solved and used in npp.

                    Thx
                    Cheers
                    Claudia

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors