• Login
Community
  • Login

UTF-8 doc becomes ANSI doc !

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
24 Posts 6 Posters 24.8k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M
    Mahabarata
    last edited by Aug 15, 2016, 10:51 PM

    Hi,
    I create a new document in utf-8 without BOM.
    I write some words with “normal” ASCII characters : no é, no ï for example
    I save it as a txt doc and close it
    I open it
    Notepad says to me the doc is in ANSI !
    I didn’t see it the first time so I didn’t change the encoding
    I write inside some other words, one with an é for example
    I save the doc and close it
    I open the doc : now Notepad says to me that the doc is utf-8 and my é is the charactere xE9 !!!

    In the parameters I"ve tried the “detect automatically the encoding” (I don’t know if they are the good words, I use french version : “détecter l’encodage automatiquement”) but it changes nothing.

    How to have a saved utf-8 doc and an opened utf-8 doc even if inside there is no special chars ?

    Debug info :
    Notepad++ v6.9.2
    Build time : May 18 2016 - 00:34:05
    Path : C:\Program Files (x86)\Notepad++\notepad++.exe
    Admin mode : OFF
    Local Conf mode : OFF
    OS : Windows 8.1
    Plugins : mimeTools.dll NppConverter.dll NppExport.dll NppFTP.dll PluginManager.dll

    C 1 Reply Last reply Aug 16, 2016, 11:46 PM Reply Quote 0
    • C
      Claudia Frank @Mahabarata
      last edited by Aug 16, 2016, 11:46 PM

      @Mahabarata

      I do the same often but it is working for me.
      Personally I have the following setup.

      Settings->Preferences->New Document->Encoding
      UTF-8 and Apply to opened ANSI files

      Settings->Preferences->MISC.
      unchecked (not used) Autodetect character encoding

      Cheers
      Claudia

      1 Reply Last reply Reply Quote 1
      • M
        Mahabarata
        last edited by Aug 17, 2016, 12:40 AM

        Lol.
        I’ve spend all my day to find where was the pb
        And then I’ve spent all my day to find another text editor because I did’t find the pb

        Half an hour ago, I found another great software so I came here to tell I will put NPP to my trash !
        And I just read your good answer : it works ! Thanks !

        But now I don’t know what to do : go on with npp or change to the new one…

        1 Reply Last reply Reply Quote 0
        • B
          Bahram Yaghobinia
          last edited by Aug 17, 2016, 1:51 AM

          @Claudia-Frank
          Hi I have the same issue in NotePad++ 6.9.
          We have a job that reads some data and place the result in xml format and saves it as .txt file. If I do not have any special characters in the xml the file is successfully saved as UTF-8.
          When I have some special characters (£©®—€) in the xml the file is saved as ANSI. If I manually go to Encoding and select convert to UTF-8, then the file is saved as UTF-8 (this is not exceptable because our process is an automated process).

          Please advise how to save the files with special characters as UTF-8 automatically.

          My settings:
          Setting --> Perference --> New document: I have UTF-8 Apply to opened ANSI files is checked.
          I took your advise and unchecked the "Auto detect character encoding.

          C 1 Reply Last reply Aug 18, 2016, 12:19 AM Reply Quote 0
          • C
            Claudia Frank @Bahram Yaghobinia
            last edited by Aug 18, 2016, 12:19 AM

            @Bahram-Yaghobinia

            First let me clarify that, currently, there is no way to
            force npp to open/save documents as utf-8 ALWAYS.
            (I’m talking about builtin functionality - not about 3rd party solutions)
            Best what we can do is to minimize the number of “false” encoding hints.

            In regards to your question, sorry, don’t get the point.
            If this is an automated process, how does npp come into place?

            If the automatism is creating the file, then this process is
            responsible to make sure that everything is in utf-8 encoded,
            if it isn’t, npp seems to decide which encoding should be used.

            If you use my settings and you create a file with copyright and registered trademark chars,
            save it, close it and reopen it - should be still utf-8 (this is what I get.)
            Create a file with only the euro char, save it, close it and reopen it - still utf-8
            Close it again but open it with a hex editor you should see only three bytes.
            Within hex editor delete those bytes and put in 0x80 value, save it, close it and reopen it
            with npp -> file should now be opened as ANSI encoded. You get my point?

            If I misunderstood your question, please explain your steps in detail.

            Cheers
            Claudia

            1 Reply Last reply Reply Quote 0
            • B
              Bahram Yaghobinia
              last edited by Aug 18, 2016, 1:17 AM

              Thank you for the reply Claudia.
              Our process uses the default text editor on the server. Right now the default text editor is NP++ 6.9.
              NP++ is saving all the text files with UTF-8 except the files with special Characters (ALT ####).
              I was hoping that Np++ could save everything with UTF-8. If not, are there any third party solutions that I can look for?
              Your help will be greatly appreciated.

              G 1 Reply Last reply Aug 18, 2016, 7:52 AM Reply Quote 0
              • G
                gerdb42 @Bahram Yaghobinia
                last edited by Aug 18, 2016, 7:52 AM

                @Bahram-Yaghobinia

                Files in ANSI and UTF-8 w/o BOM cannot be distinguished if they do not contain any non-ASCII characters. So the settings Claudia mentioned tell NPP what to assume when it can’t decide upon the encoding (this becomes important once you later add non-ASCII characters).

                On the other hand, if a file does contain non-ASCII characters, NPP can clearly detect what the encoding is and it will not perform an automatic conversion.

                However, I also don’t get how NPP fits into an automated process. How do you control its actions? Since you know you receive ANSI and need UTF-8 how about throwing together a few lines of C# or powershell?

                1 Reply Last reply Reply Quote 0
                • G
                  guy038
                  last edited by guy038 Aug 18, 2016, 5:36 PM Aug 18, 2016, 12:16 PM

                  Hello, Bahram-Yaghobinia and All,

                  The encoding problems, generally speaking, coupled with the default Notepad++ behaviour, are difficult enough to understand :-((

                  In the last part of this post, I give you a possible solution to your problem !?

                  But, before testing some situations, here are my usual parameters, relative to encoding, of the last N++ version 6.9.2 :

                  • In Settings - Preferences… - New Document - Encoding => Options UTF-8 and Apply to opened ANSI files CHECKED

                  • In Settings - Preferences… - MISC => Option Autodetect character encoding UNCHECKED

                  In addition, I, generally, have the option Remember current session for next launch, in Settings - Preferences… - Backup, CHECKED


                  OK. Now, open Notepad++ and let’s make some simple tests :

                  • Open a new file ( CTRL + N ) => Information UTF-8, in the status bar

                  • Type an upper letter A

                  • Save the file, with name Test.txt => Still information UTF-8, in the status bar

                  • Close and restart N++

                  • Click, if necessary, on the Test.txt tab => Still information UTF-8, in the status bar

                  ( An Hex editor would show a one-byte file, of value 41 )

                  • Replace the letter A by the Euro character €, of Unicode value = 20AC ( > \x7F )

                  • Save the changes of Test.txt => Still information UTF-8, in the status bar

                  • Close and restart N++

                  • Click, if necessary, on the Test.txt tab => Still information UTF-8, in the status bar

                  ( An Hex editor would show a three-bytes file, of value e2 82 ac, which is the UTF-8 representation of the Unicode code-point 20AC of the Euro sign )

                  • Now, delete the Euro sign

                  • Save this empty file Test.txt => Still information UTF-8, in the status bar

                  • Close and restart N++

                  • Click, if necessary, on the Test.txt tab => This time, we get, for this empty file, the ANSI information in the status bar !!

                  ( An Hex editor would, effectively, show a zero-byte file )

                  REMARK :

                  Although I can understand that N++ cannot decide about the right encoding to choose ( as the file is just empty ) the logic would have been that it chose the default user choice, which is UTF-8 !! Let’s go on :

                  • Type, again, an upper letter A

                  • Re-save the file Test.txt => Still information ANSI, in the status bar

                  • Close and restart N++

                  • Click, if necessary, on the Test.txt tab => We have, again, the information UTF-8, in the status bar

                  ( An Hex editor would show a one-byte file, of value 41 )

                  • For the last time, delete the A character

                  • Save, again, this empty file Test.txt => Still information UTF-8, in the status bar

                  • Close and restart N++

                  • Click, if necessary, on the Test.txt tab => Again, we get, for this empty file, the ANSI information in the status bar !!

                  ( An Hex editor would, effectively, show a zero-byte file )

                  • Type, for the last time, the Euro sign €

                  • Save the changes of Test.txt => Still information ANSI, in the status bar

                  • Close and restart N++

                  • Click, if necessary, on the Test.txt tab => This time, the information remains ANSI, in the status bar !!

                  ( An Hex editor would show a one-byte file, of value 80, as the ANSI-1252 value, for the Euro character, is just … \x80 )

                  As you, certainly, would like to obtain an UTF-8 encoding, then, you would have to use the menu option Encoding - Convert to UTF-8 and re-save the file Test.txt, because N++ changed it, again, into a three-bytes file, with values e2 82 ac !

                  Beware not to use the option Encoding - Encode in UTF-8, which only tries to re-interpret the present contents of the file, in the UTF-8 encoding => An unknown one-byte x80 character, which is an invalid UTF-8 value !


                  So, from above, Bahram-Yaghobinia, it’s obvious that the simple UTF-8 encoding, ( formally named UTF-8 without BOM ) must NOT be used. I, strongly, advice you to adopt the UTF-8 BOM, instead ! Why ?

                  Just because the invisible BOM ( for Byte Order Mark ) identifies, without any ambiguity the encoding of a file !

                  When you save, from within Notepad++, a file, with the UTF-8 BOM encoding, three invisible bytes ( EF BB BF ) are added, at the very beginning of your file. These three bytes are, simply, the UTF-8 representation of the Byte Order Mark (BOM) of Unicode code-point FEFF

                  Thus, each time that N++, ( or any modern editor ! ) will open this file, it automatically understands that it’s a true UTF-8 file, due these invisible three bytes EF BB BF, located at the its beginning !


                  If you reproduce all the tests above, on a new file, with the UTF-8 BOM encoding ( instead of UTF-8 ), this encoding will remain UTF-8 BOM, throughout all the tests, even when the test file is empty ( just note that, in this specific case, the file is not entirely empty, as it, still, contains the three bytes of the Byte Order Mark !! )

                  Further information on :

                  http://en.wikipedia.org/wiki/Byte_order_mark

                  http://en.wikipedia.org/wiki/Unicode_Specials

                  http://en.wikipedia.org/wiki/Endianness

                  Best regards

                  guy038

                  1 Reply Last reply Reply Quote 1
                  • B
                    Bahram Yaghobinia
                    last edited by Aug 18, 2016, 4:01 PM

                    Thank you for all the details. I tested your logic and all went well.
                    I changed my set up to Settings - Preferences… - New Document - Encoding => Options UTF-8 BOM.
                    Still the same issue but now all my files are saved as Encoding ANSI.
                    I think the only way to do this is to write a few lines of codes to covert from ANSI to UTF-8.

                    C 1 Reply Last reply Aug 18, 2016, 5:20 PM Reply Quote 0
                    • C
                      Claudia Frank @Bahram Yaghobinia
                      last edited by Aug 18, 2016, 5:20 PM

                      @Bahram-Yaghobinia and all,

                      maybe I misunderstood the topic but I thought the problem is that a xml file,
                      which has been created outside from npp gets loaded, manipulated and saved as txt.
                      Is this the case? If not, why not explaining the steps in detail? Otherwise we
                      are fishing in the dark to try to find a solution for you.

                      In addition, i totally agree what guy wrote about detecting utf-8 files by npp,
                      but unfortunately there is also a reason not to use UTF-8 with BOM.
                      In the case your manipulated data gets loaded/processed by other applications,
                      e.g. databases, webserver, web framworks etc… it might be that those apps
                      can’t handle those data correctly.
                      Unfortunattely, there are still many of those applications active,
                      which don’t support UTF-8 BOM files.
                      I don’t know if this is the case for you - so, just for information.

                      Cheers
                      Claudia

                      1 Reply Last reply Reply Quote 0
                      • B
                        Bahram Yaghobinia
                        last edited by Aug 18, 2016, 6:34 PM

                        • Data is sitting in the queue.
                        • A job picks up the data. Converts to xml and saves the xml in text file.
                        • Brings it to the server and saves it on disk
                        • Right now Notepad++ is the default text editor on the server
                        • All text files are saved as UFT-8 except the files that have special characters (ALT ####. These files are saved as ANSI).
                        • Notepad++ set up:
                        o Setting --> Preference --> New document: I have UTF-8 Apply to opened ANSI files is checked.
                        o Unchecked the "Auto detect character encoding.
                        I am trying to add a script that will do this for me, but no idea how it works. Something like (notepad.runMenuCommand(“Encoding”, “Convert to UTF-8”)).

                        C 2 Replies Last reply Aug 18, 2016, 7:20 PM Reply Quote 0
                        • G
                          guy038
                          last edited by guy038 Aug 18, 2016, 7:21 PM Aug 18, 2016, 7:13 PM

                          Hi, Bahram-Yaghobinia,

                          I was quite surprised and really sorry that your problem isn’t solved at all and seems even worse than before :-(( But, after some minutes, I just realize that this behaviour is quite logical :

                          • As you changed the default encoding, for a new document, to UTF-8 BOM, which hasn’t, of course, an option relative to opened ANSI files, Notepad++ will never try to change an ANSI-style file read, in a true UTF-8 file !

                          • The contents of your file, opened with your default editor N++ 6.9, seem to be, only, one-byte characters. So, as well as characters with value < \x80, the characters, as, for instance, £ © ® — or €, are, also, written with a one-byte sequence, between \x80 and \xFF. Therefore, N++ always saved it, with its present ANSI encoding, without any encoding conversion !

                          ( See the list of all of them, using the N++ menu option Edit - Character Panel, for values > 127 )

                          So, a solution would be to run a simple script, when starting N++, which :

                          • would apply the menu option Encoding - Convert to UTF-8 BOM

                          • would save the new UTF-8 contents of your file

                          I think that a Python or NppExec script should do that job, easily !


                          For information, let’s suppose the exact text £ , © , ® , — or €, in a new file, with an ANSI encoding. This text would produce the sequence of bytes :

                          • A3 20 2C 20 A9 20 2C 20 AE 20 2C 20 97 20 6F 72 20 80 ( 18 bytes )

                          Once, this text converted with the UTF-8 BOM encoding, it would give the sequence of bytes, below :

                          • EF BB BF C2 A3 20 2C 20 C2 A9 20 2C 20 C2 AE 20 2C 20 E2 80 94 20 6F 72 20 E2 82 AC ( 28 bytes )

                          I just indicated, in bold, the values of the five characters, which have a different representation in ANSI and UTF-8 BOM encodings, as well and the value of the BOM, at the beginning of the second sequence, in red

                          Cheers,

                          guy038

                          1 Reply Last reply Reply Quote 0
                          • C
                            Claudia Frank @Bahram Yaghobinia
                            last edited by Aug 18, 2016, 7:20 PM

                            This post is deleted!
                            1 Reply Last reply Reply Quote 0
                            • C
                              Claudia Frank @Bahram Yaghobinia
                              last edited by Aug 18, 2016, 7:23 PM

                              @Bahram-Yaghobinia

                              I agree with guy, but I would say that your process, which creates the xml needs to take care
                              about it as utf-8 is the standard encoding for xml. If this process was designed for writing xml
                              it should have an option to save as utf-8 encoded. Did you double check this?

                              Cheers
                              Claudia

                              1 Reply Last reply Reply Quote 0
                              • Jim DaileyJ
                                Jim Dailey
                                last edited by Aug 18, 2016, 8:12 PM

                                @Bahram-Yaghobinia

                                Do the XML files contain something like this as their first line:

                                <?xml version="1.0" encoding="???" ?>
                                

                                If so, can you provide that line to us?

                                If there is no such line, or if it does not include information about the encoding method, then UTF-8 is assumed.

                                I think that means that if the file contains a £ encoded as “A3” instead of “C2 A3”, that it isn’t technically valid XML (because it isn’t encoded as UTF-8).

                                Guy, Claudia, or anyone else who has a better understanding please correct me if I am wrong.

                                1 Reply Last reply Reply Quote 1
                                • B
                                  Bahram Yaghobinia
                                  last edited by Aug 19, 2016, 2:35 AM

                                  Jim, all the XMLs have
                                  <?xml version=“1.0” encoding=“UFT-8” ?>
                                  I am working on the source as well to see if there is anything can be done to save as UFT-8. I would like to approach the script option (Python), but at this time I do not have any idea how it is done.

                                  C 1 Reply Last reply Aug 19, 2016, 10:42 PM Reply Quote 0
                                  • Jim DaileyJ
                                    Jim Dailey
                                    last edited by Aug 19, 2016, 11:41 AM

                                    @Bahram-Yaghobinia
                                    Sorry, I can’t help you with Python, but it seems like this part of the process:

                                    • A job picks up the data. Converts to xml and saves the xml in text file.

                                    is broken because it does not (always) create valid XML. The XML is invalid any time it claims to be in UTF-8 but includes characters encoded in a single byte (e.g. a “£” encoded as 0xA3) that require multiple bytes to be properly encoded in UTF-8 (e.g. “£” should be encoded as 0xC2 0xA3).

                                    1 Reply Last reply Reply Quote 1
                                    • C
                                      Claudia Frank @Bahram Yaghobinia
                                      last edited by Aug 19, 2016, 10:42 PM

                                      @Bahram-Yaghobinia

                                      I still think this is the wrong way to solve the problem because you loose the automatism by interacting
                                      with npp to run the script, but you insist on the python script plugin solution so the lines in question are

                                      notepad.runMenuCommand("Encoding", "Convert to UTF-8")
                                      notepad.save()
                                      

                                      But this does only work if you have an english ui, in case you use a different language than you have to replace
                                      “Encoding” and “Convert to UTF-8” with the ones from your language.

                                      Cheers
                                      Claudia

                                      1 Reply Last reply Reply Quote 0
                                      • M
                                        Mahabarata
                                        last edited by Aug 20, 2016, 4:47 PM

                                        For me the problem is rather a problem of npp than anything.

                                        The encoding of a text with no special character could be anything : iso-8859-1, iso-8859-2, iso-8859-15, windows-1252, utf-8, ASCII and I imagine a lot of others.
                                        The encoding of a file with only normal characters can’t not be known just by looking at it !!!

                                        In php, there is a function : mb_detect_encoding($txt, array(encoding1, encoding2)).
                                        When you use it, php try to know which encoding is used in the string $txt : it begins to look at the encoding1, if it fits the function answers encoding1, if it doesn’t the function try the second encoding and so and on.

                                        So if there is no special character in the $txt, the function will tell you the encoding is encoding1. In your case, with encoding1=utf-8, the function will tell you that the $txt is a utf-8 even if it’s impossible to know !

                                        npp has no option to do the same (it is what I was looking for between my first post and my second one) except the “Settings - Preferences… - New Document - Apply to opened ANSI file” (thanks again Claudia).
                                        You can’t obtain from npp that your docs are forever iso-5988-1, iso-5988-2 or ASCII ! For me it’s a big pb because a polish guy will probably use an ISO-5988-2 encoding and npp will tell some docs are in ANSI (that is windows-1252) and others are in ISO-5988-2 !

                                        So the pb is not only a pb with utf-8/ANSI but with a lot of encodings !

                                        I think it will be a good evolution of npp to add an option to tell what to do when a doc is open and it is impossible to know the encoding : only the user can tell it, npp what clever it is will never !

                                        C 1 Reply Last reply Aug 21, 2016, 11:27 PM Reply Quote 0
                                        • C
                                          Claudia Frank @Mahabarata
                                          last edited by Aug 21, 2016, 11:27 PM

                                          @Mahabarata

                                          you are correct, there is no way to always guess the correct encoding.
                                          In regards to phps mb_detect_encoding function, npp is,
                                          when “Autodetect character encoding” is checked, using mozillas chardet library,
                                          so it has such functionality but, as you found already out, cannot guess the
                                          correct encoding all time.

                                          I would also find it very useful if the setting
                                          New Document->Encoding: UTF-8 and Apply to opened ANSI files (or any other configured encoding)
                                          would force npp to treat all new opened documents as “configured encoding” when
                                          auto detection of encoding has been disabled.

                                          Cheers
                                          Claudia

                                          G 1 Reply Last reply Aug 22, 2016, 9:49 AM Reply Quote 0
                                          5 out of 24
                                          • First post
                                            5/24
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors