Community
    • Login

    Encoding a text file

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    ansicharacter setoem 852encoding
    10 Posts 5 Posters 22.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Soeren2017S
      Soeren2017
      last edited by

      Notepad++ v7.3.3 (32-bit) for windows do not save the new character set of a simple text file, after I translate the whole text with the function »encoding«. Example: Convert any simple text file from ANSI to OEM 852 and save this file. No changes saved! Is that a temporary feature or maybe an unknown error? A little bit strange …

      Claudia FrankC 1 Reply Last reply Reply Quote 0
      • Claudia FrankC
        Claudia Frank @Soeren2017
        last edited by Claudia Frank

        @Soeren2017

        Is that a temporary feature or maybe an unknown error? A little bit strange …

        Or maybe a misunderstanding from users point of view.
        Why do you think it failed?
        You are aware that different encodings share the some chars, aren’t you?
        What do you mean by ANSI to OEM852 - what is your system codepage?

        Cheers
        Claudia

        Soeren2017S 1 Reply Last reply Reply Quote 0
        • Soeren2017S
          Soeren2017 @Claudia Frank
          last edited by

          @Claudia-Frank

          Of course some chars. In the German language you need the characters like ä, ö, ü, ß. I have created a music list with the dos command “dir”. In the text file are some different characters, when I open this file with Notepad++. But I think Notpad++ changes the different characters of the file permanent and not temporary in this editor. A function encoding is irrelevant when the user can’t save the file with some different characters.
          By the way: Look at the status bar of this app. If you start Notepad++ you’ll see on the right side only the Word “ANSI” although the settings in “new document” are different.

          1 Reply Last reply Reply Quote 0
          • dinkumoilD
            dinkumoil
            last edited by dinkumoil

            @Soeren2017

            At first I will try to give you an overview of some basics regarding code pages and character encoding. At a glance: It’s all about numbers.

            After that I will provide some possible solutions for your problem.

            When you create a plain text file on your hard disk the program you use for this writes NUMBERS to the file. These numbers are codes for the actual characters. When the file is loaded to display its content the software used for this uses an internal table to map the code numbers to characters.

            You can imagine that there is an infinite number of possible mappings of code numbers to actual characters. For that reason over the past decades various standardized encoding schemes have been introduced to fit the challenges of a growing number of characters to encode (e.g. special characters in european languages and cyrillic and east asian character sets). These encoding schemes are called “code pages”.

            Every software dealing with plain text processing uses its own default code page to encode characters. In Notepad++ you can configure the default code page under

            Menu “Settings” -> New Document -> Encoding

            The default code page of Windows console commands like “dir” depends on the language of the Windows user interface. On german Windows installations the default code page of console commands is called “OEM 850”. That means if the output of the dir command is redirected to a file (like in your case) this file is written with the OEM 850 character encoding.

            When this file is loaded into Notepad++ its content (code numbers) is mapped to characters using Notepad++'s default code page which seems to be ANSI in your case. That’s the reason why german umlaut characters contained in the output of the dir command are displayed incorrectly. The code numbers for the äöüßÄÖÜ characters are different in ANSI and OEM 850.

            To solve your problem there are two different approaches:

            1. Setting an appropriate character encoding when creating the file.
            2. Using an appropriate code page when displaying the files content.

            In detail you have four options:

            1. When you create the directory listing you can start a console with the command
              cmd /u
              In this case the output of all internal commands (like “dir”) is done using the UNICODE encoding schema, UTF-16 Little Endian to be precise. Files with this encoding can be displayed correctly in text editors with automatic detection of character encoding (like Notepad++) on most Windos installations world wide. Unfortunately the files are not fully standard compliant because they lack the Byte Order Mark (BOM), a sequence of bytes at the beginning of the file to indicate the encoding. This may come into effect e.g on east asian Windows installations.

            2. Before executing the dir command you can change the output code page via the command
              chcp 1252
              On a german Windows installation this sets the output code page to the system’s ANSI code page (in countries other than Germany the system’s ANSI code page may be different, i.e. its number is other than 1252). Files written with this encoding are displayed correctly in every text editor on a german Windows installation.

            3. When displaying the file’s content in Notepad++ you can switch to code page OEM 850. This is done via
              Menu “Encoding” -> Character Set -> Western European -> OEM 850
              In this case Notepad++ uses the same code page to decode the file’s content as the console dir command used it to create the file and german umlaut characters are displayed correctly.

            4. When you create the file you can use a special filename extension for the output file, e.g. “mlst”. An example command would be
              dir *.mp3 > MusicList.mlst
              If you use my AutoCodepage plugin you can make Notepad++ to decode all mlst files automatically with codepage OEM 850.

            Soeren2017S 1 Reply Last reply Reply Quote 5
            • guy038G
              guy038
              last edited by guy038

              Hello, @dinkumoil,

              Very interesting and clear post, about encodings ;-))

              BTW, I did know the chcp DOS command to change the console encoding, but was not aware of the possibility to have an CMD instance, which outputs results, in Unicode ( UCS-2 Little Endian )

              So, I did a test : after opening a Windows console, I type the command cmd /u to open a second instance and I just type the simple command dir > My installation folder of N++\Test.txt. Then I typed, twice, on the exit command and opened my beloved editor.

              And, I can confirm that my Test.txt file is, automatically, encoded with the UCS-2 Little Endian encoding ( after a glance at the status bar ), despite it does not have, as you said, any Byte Order Mark ( BOM ), and despite the fact that my Autodetect character encoding, option, in Setting > Preferences… > MISC, is not set !

              Remark : Rather funny to notice that NO encoding seems attributed to this file, when you just click on the Encoding menu ! But, it’s quite logical, because its encoding is not the true UCS-2 Little Endian BOM encoding.

              So, to get a true Unicode file, with an invisible BOM ( The two invisible bytes FF FE ) , just use the Encoding > Convert to UCS-2 LE BOM option. This time, you should see, in the status bar, the indication UCS-2 LE BOM ! ( You may, either, use the UTF-8 encoding, with the Encoding > Convert to UTF-8 BOM option ).


              Although I don’t have many encoding problems, I’ll have a look to your AutoCodePage plugin, soon ! Combined with a specific file extension, it could be useful to some of us ;-))

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 0
              • dinkumoilD
                dinkumoil
                last edited by

                @guy038

                despite the fact that my Autodetect character encoding, option, …, is not set!

                Seems that our “beloved editor” ;-) is even better than we thought. Maybe the zero byte contained in each code number triggers the automatic decoding as UCS-2 LE.

                Concerning my plugin: Only a few people really need it, it’s the moribund species of batch scripters.

                1 Reply Last reply Reply Quote 0
                • David BaileyD
                  David Bailey
                  last edited by

                  @Soeren2017

                  Perhaps it is worth pointing out that unlike say .DOC files,text files use a very primitive format, in which almost every byte corresponds to a character on the screen. Other than the first two bytes (the BOM) that is sometimes used to indicate UTF8 or UNICODE, there just isn’t anywhere in a text file to stuff extra (invisible) information!

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by

                    Hi All,

                    I found an second solution for outputting Console text, with an Unicode encoding !

                    Once a window console opened, simply execute the command chcp 65001. From now on, until you close that CMD instance, the outputs will use the Unicode UTF-8 encoding :-))

                    Refer to the complete Microsoft table of Code Page Identifiers, below :

                    https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

                    Cheers,

                    guy038

                    1 Reply Last reply Reply Quote 0
                    • Soeren2017S
                      Soeren2017 @dinkumoil
                      last edited by

                      @dinkumoil Thank you for your detailed explanation. It was very great! But one question is still blurry for me. Why does Notepad++ not save the encoded file? I can encode any plain text file into any code page of the world for temporary watching, but obviously I can’t save the encoded file. Why? Where is the problem? I think this is an error of this nice app, isn’t it?

                      The other guys: Thank you for your response!

                      Claudia FrankC 1 Reply Last reply Reply Quote 0
                      • Claudia FrankC
                        Claudia Frank @Soeren2017
                        last edited by

                        @Soeren2017

                        if I’m allowed - the problem is that these “ansi code pages” do treat the underlying
                        data (numbers) the same. Means, each code page assigns always a 8bit value to a glyph.
                        So there is no way for npp to findout if the value it reads, e.g. 0xFC (252),
                        should be treaten as ü (CP1252) or as exponent three (CP850).

                        I guess it could be said that the different “ansi code pages” are just a different view
                        to the same underlying data and as long as there is no info in the file itself which
                        encoding/code page has been used, to create the file, you have to know it
                        or in npps view, it tries to guess it.

                        Makes this sense to you?

                        Cheers
                        Claudia

                        1 Reply Last reply Reply Quote 2
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors