• Login
Community
  • Login

UTF-8 doc becomes ANSI doc !

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
24 Posts 6 Posters 25.1k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • C
    Claudia Frank @Bahram Yaghobinia
    last edited by Aug 18, 2016, 5:20 PM

    @Bahram-Yaghobinia and all,

    maybe I misunderstood the topic but I thought the problem is that a xml file,
    which has been created outside from npp gets loaded, manipulated and saved as txt.
    Is this the case? If not, why not explaining the steps in detail? Otherwise we
    are fishing in the dark to try to find a solution for you.

    In addition, i totally agree what guy wrote about detecting utf-8 files by npp,
    but unfortunately there is also a reason not to use UTF-8 with BOM.
    In the case your manipulated data gets loaded/processed by other applications,
    e.g. databases, webserver, web framworks etc… it might be that those apps
    can’t handle those data correctly.
    Unfortunattely, there are still many of those applications active,
    which don’t support UTF-8 BOM files.
    I don’t know if this is the case for you - so, just for information.

    Cheers
    Claudia

    1 Reply Last reply Reply Quote 0
    • B
      Bahram Yaghobinia
      last edited by Aug 18, 2016, 6:34 PM

      • Data is sitting in the queue.
      • A job picks up the data. Converts to xml and saves the xml in text file.
      • Brings it to the server and saves it on disk
      • Right now Notepad++ is the default text editor on the server
      • All text files are saved as UFT-8 except the files that have special characters (ALT ####. These files are saved as ANSI).
      • Notepad++ set up:
      o Setting --> Preference --> New document: I have UTF-8 Apply to opened ANSI files is checked.
      o Unchecked the "Auto detect character encoding.
      I am trying to add a script that will do this for me, but no idea how it works. Something like (notepad.runMenuCommand(“Encoding”, “Convert to UTF-8”)).

      C 2 Replies Last reply Aug 18, 2016, 7:20 PM Reply Quote 0
      • G
        guy038
        last edited by guy038 Aug 18, 2016, 7:21 PM Aug 18, 2016, 7:13 PM

        Hi, Bahram-Yaghobinia,

        I was quite surprised and really sorry that your problem isn’t solved at all and seems even worse than before :-(( But, after some minutes, I just realize that this behaviour is quite logical :

        • As you changed the default encoding, for a new document, to UTF-8 BOM, which hasn’t, of course, an option relative to opened ANSI files, Notepad++ will never try to change an ANSI-style file read, in a true UTF-8 file !

        • The contents of your file, opened with your default editor N++ 6.9, seem to be, only, one-byte characters. So, as well as characters with value < \x80, the characters, as, for instance, £ © ® — or €, are, also, written with a one-byte sequence, between \x80 and \xFF. Therefore, N++ always saved it, with its present ANSI encoding, without any encoding conversion !

        ( See the list of all of them, using the N++ menu option Edit - Character Panel, for values > 127 )

        So, a solution would be to run a simple script, when starting N++, which :

        • would apply the menu option Encoding - Convert to UTF-8 BOM

        • would save the new UTF-8 contents of your file

        I think that a Python or NppExec script should do that job, easily !


        For information, let’s suppose the exact text £ , © , ® , — or €, in a new file, with an ANSI encoding. This text would produce the sequence of bytes :

        • A3 20 2C 20 A9 20 2C 20 AE 20 2C 20 97 20 6F 72 20 80 ( 18 bytes )

        Once, this text converted with the UTF-8 BOM encoding, it would give the sequence of bytes, below :

        • EF BB BF C2 A3 20 2C 20 C2 A9 20 2C 20 C2 AE 20 2C 20 E2 80 94 20 6F 72 20 E2 82 AC ( 28 bytes )

        I just indicated, in bold, the values of the five characters, which have a different representation in ANSI and UTF-8 BOM encodings, as well and the value of the BOM, at the beginning of the second sequence, in red

        Cheers,

        guy038

        1 Reply Last reply Reply Quote 0
        • C
          Claudia Frank @Bahram Yaghobinia
          last edited by Aug 18, 2016, 7:20 PM

          This post is deleted!
          1 Reply Last reply Reply Quote 0
          • C
            Claudia Frank @Bahram Yaghobinia
            last edited by Aug 18, 2016, 7:23 PM

            @Bahram-Yaghobinia

            I agree with guy, but I would say that your process, which creates the xml needs to take care
            about it as utf-8 is the standard encoding for xml. If this process was designed for writing xml
            it should have an option to save as utf-8 encoded. Did you double check this?

            Cheers
            Claudia

            1 Reply Last reply Reply Quote 0
            • J
              Jim Dailey
              last edited by Aug 18, 2016, 8:12 PM

              @Bahram-Yaghobinia

              Do the XML files contain something like this as their first line:

              <?xml version="1.0" encoding="???" ?>
              

              If so, can you provide that line to us?

              If there is no such line, or if it does not include information about the encoding method, then UTF-8 is assumed.

              I think that means that if the file contains a £ encoded as “A3” instead of “C2 A3”, that it isn’t technically valid XML (because it isn’t encoded as UTF-8).

              Guy, Claudia, or anyone else who has a better understanding please correct me if I am wrong.

              1 Reply Last reply Reply Quote 1
              • B
                Bahram Yaghobinia
                last edited by Aug 19, 2016, 2:35 AM

                Jim, all the XMLs have
                <?xml version=“1.0” encoding=“UFT-8” ?>
                I am working on the source as well to see if there is anything can be done to save as UFT-8. I would like to approach the script option (Python), but at this time I do not have any idea how it is done.

                C 1 Reply Last reply Aug 19, 2016, 10:42 PM Reply Quote 0
                • J
                  Jim Dailey
                  last edited by Aug 19, 2016, 11:41 AM

                  @Bahram-Yaghobinia
                  Sorry, I can’t help you with Python, but it seems like this part of the process:

                  • A job picks up the data. Converts to xml and saves the xml in text file.

                  is broken because it does not (always) create valid XML. The XML is invalid any time it claims to be in UTF-8 but includes characters encoded in a single byte (e.g. a “£” encoded as 0xA3) that require multiple bytes to be properly encoded in UTF-8 (e.g. “£” should be encoded as 0xC2 0xA3).

                  1 Reply Last reply Reply Quote 1
                  • C
                    Claudia Frank @Bahram Yaghobinia
                    last edited by Aug 19, 2016, 10:42 PM

                    @Bahram-Yaghobinia

                    I still think this is the wrong way to solve the problem because you loose the automatism by interacting
                    with npp to run the script, but you insist on the python script plugin solution so the lines in question are

                    notepad.runMenuCommand("Encoding", "Convert to UTF-8")
                    notepad.save()
                    

                    But this does only work if you have an english ui, in case you use a different language than you have to replace
                    “Encoding” and “Convert to UTF-8” with the ones from your language.

                    Cheers
                    Claudia

                    1 Reply Last reply Reply Quote 0
                    • M
                      Mahabarata
                      last edited by Aug 20, 2016, 4:47 PM

                      For me the problem is rather a problem of npp than anything.

                      The encoding of a text with no special character could be anything : iso-8859-1, iso-8859-2, iso-8859-15, windows-1252, utf-8, ASCII and I imagine a lot of others.
                      The encoding of a file with only normal characters can’t not be known just by looking at it !!!

                      In php, there is a function : mb_detect_encoding($txt, array(encoding1, encoding2)).
                      When you use it, php try to know which encoding is used in the string $txt : it begins to look at the encoding1, if it fits the function answers encoding1, if it doesn’t the function try the second encoding and so and on.

                      So if there is no special character in the $txt, the function will tell you the encoding is encoding1. In your case, with encoding1=utf-8, the function will tell you that the $txt is a utf-8 even if it’s impossible to know !

                      npp has no option to do the same (it is what I was looking for between my first post and my second one) except the “Settings - Preferences… - New Document - Apply to opened ANSI file” (thanks again Claudia).
                      You can’t obtain from npp that your docs are forever iso-5988-1, iso-5988-2 or ASCII ! For me it’s a big pb because a polish guy will probably use an ISO-5988-2 encoding and npp will tell some docs are in ANSI (that is windows-1252) and others are in ISO-5988-2 !

                      So the pb is not only a pb with utf-8/ANSI but with a lot of encodings !

                      I think it will be a good evolution of npp to add an option to tell what to do when a doc is open and it is impossible to know the encoding : only the user can tell it, npp what clever it is will never !

                      C 1 Reply Last reply Aug 21, 2016, 11:27 PM Reply Quote 0
                      • C
                        Claudia Frank @Mahabarata
                        last edited by Aug 21, 2016, 11:27 PM

                        @Mahabarata

                        you are correct, there is no way to always guess the correct encoding.
                        In regards to phps mb_detect_encoding function, npp is,
                        when “Autodetect character encoding” is checked, using mozillas chardet library,
                        so it has such functionality but, as you found already out, cannot guess the
                        correct encoding all time.

                        I would also find it very useful if the setting
                        New Document->Encoding: UTF-8 and Apply to opened ANSI files (or any other configured encoding)
                        would force npp to treat all new opened documents as “configured encoding” when
                        auto detection of encoding has been disabled.

                        Cheers
                        Claudia

                        G 1 Reply Last reply Aug 22, 2016, 9:49 AM Reply Quote 0
                        • G
                          gerdb42 @Claudia Frank
                          last edited by Aug 22, 2016, 9:49 AM

                          @Claudia-Frank

                          Let’s assume a file contains Byte-sequence 20-A9-20 (in ANSI this would be Space-Copyright-Space). This Sequence is invalid in UTF-8 so NPP has no alternative other than assuming an single-Byte encoding. And since it never does changes to the file’s content on its own, it is left to treat such a file as ANSI (or whatever your favorite single-Byte encoding is).

                          This is not a shortcoming of NPP but part of that single-Byte heritage we still have to deal with today.

                          C 1 Reply Last reply Aug 22, 2016, 5:25 PM Reply Quote 1
                          • C
                            Claudia Frank @gerdb42
                            last edited by Aug 22, 2016, 5:25 PM

                            @gerdb42

                            I assume we have the same understanding so I’m interested to know
                            what I have written that could be misunderstood?
                            Could you point me to my error?

                            Thank you and cheers
                            Claudia

                            G 1 Reply Last reply Aug 23, 2016, 7:36 AM Reply Quote 0
                            • G
                              gerdb42 @Claudia Frank
                              last edited by Aug 23, 2016, 7:36 AM

                              @Claudia-Frank said:
                              Not quite an error, but

                              I would also find it very useful if the setting
                              New Document->Encoding: UTF-8 and Apply to opened ANSI files (or any other configured encoding)
                              would force npp to treat all new opened documents as “configured encoding” when
                              auto detection of encoding has been disabled.

                              would require an implicit conversion to UTF-8. And besides breaking the principle of not doing changes without user action, it will pop up a whole bunch of other issues.

                              C 1 Reply Last reply Aug 23, 2016, 10:49 PM Reply Quote 1
                              • C
                                Claudia Frank @gerdb42
                                last edited by Aug 23, 2016, 10:49 PM

                                @gerdb42

                                I agree that this would break the principle but on the other hand it could be beneficial as well.
                                But, now as I’m typing I’m thinking, when this conversion takes place and you don’t know from which encoding it came from
                                you might corrupt the document without knowing how to fix it.
                                Yes - bad idea.

                                Cheers
                                Claudia

                                1 Reply Last reply Reply Quote 0
                                19 out of 24
                                • First post
                                  19/24
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors