Encoding a text file

Soeren2017

Notepad++ v7.3.3 (32-bit) for windows do not save the new character set of a simple text file, after I translate the whole text with the function »encoding«. Example: Convert any simple text file from ANSI to OEM 852 and save this file. No changes saved! Is that a temporary feature or maybe an unknown error? A little bit strange …

Claudia Frank

@Soeren2017

Is that a temporary feature or maybe an unknown error? A little bit strange …

Or maybe a misunderstanding from users point of view.
Why do you think it failed?
You are aware that different encodings share the some chars, aren’t you?
What do you mean by ANSI to OEM852 - what is your system codepage?

Cheers
Claudia

Soeren2017

@Claudia-Frank

Of course some chars. In the German language you need the characters like ä, ö, ü, ß. I have created a music list with the dos command “dir”. In the text file are some different characters, when I open this file with Notepad++. But I think Notpad++ changes the different characters of the file permanent and not temporary in this editor. A function encoding is irrelevant when the user can’t save the file with some different characters.
By the way: Look at the status bar of this app. If you start Notepad++ you’ll see on the right side only the Word “ANSI” although the settings in “new document” are different.

dinkumoil

@Soeren2017

At first I will try to give you an overview of some basics regarding code pages and character encoding. At a glance: It’s all about numbers.

After that I will provide some possible solutions for your problem.

When you create a plain text file on your hard disk the program you use for this writes NUMBERS to the file. These numbers are codes for the actual characters. When the file is loaded to display its content the software used for this uses an internal table to map the code numbers to characters.

You can imagine that there is an infinite number of possible mappings of code numbers to actual characters. For that reason over the past decades various standardized encoding schemes have been introduced to fit the challenges of a growing number of characters to encode (e.g. special characters in european languages and cyrillic and east asian character sets). These encoding schemes are called “code pages”.

Every software dealing with plain text processing uses its own default code page to encode characters. In Notepad++ you can configure the default code page under

Menu “Settings” -> New Document -> Encoding

The default code page of Windows console commands like “dir” depends on the language of the Windows user interface. On german Windows installations the default code page of console commands is called “OEM 850”. That means if the output of the dir command is redirected to a file (like in your case) this file is written with the OEM 850 character encoding.

When this file is loaded into Notepad++ its content (code numbers) is mapped to characters using Notepad++'s default code page which seems to be ANSI in your case. That’s the reason why german umlaut characters contained in the output of the dir command are displayed incorrectly. The code numbers for the äöüßÄÖÜ characters are different in ANSI and OEM 850.

To solve your problem there are two different approaches:

Setting an appropriate character encoding when creating the file.
Using an appropriate code page when displaying the files content.

In detail you have four options:

When you create the directory listing you can start a console with the command
cmd /u
In this case the output of all internal commands (like “dir”) is done using the UNICODE encoding schema, UTF-16 Little Endian to be precise. Files with this encoding can be displayed correctly in text editors with automatic detection of character encoding (like Notepad++) on most Windos installations world wide. Unfortunately the files are not fully standard compliant because they lack the Byte Order Mark (BOM), a sequence of bytes at the beginning of the file to indicate the encoding. This may come into effect e.g on east asian Windows installations.
Before executing the dir command you can change the output code page via the command
chcp 1252
On a german Windows installation this sets the output code page to the system’s ANSI code page (in countries other than Germany the system’s ANSI code page may be different, i.e. its number is other than 1252). Files written with this encoding are displayed correctly in every text editor on a german Windows installation.
When displaying the file’s content in Notepad++ you can switch to code page OEM 850. This is done via
Menu “Encoding” -> Character Set -> Western European -> OEM 850
In this case Notepad++ uses the same code page to decode the file’s content as the console dir command used it to create the file and german umlaut characters are displayed correctly.
When you create the file you can use a special filename extension for the output file, e.g. “mlst”. An example command would be
dir *.mp3 > MusicList.mlst
If you use my AutoCodepage plugin you can make Notepad++ to decode all mlst files automatically with codepage OEM 850.

guy038

Hello, @dinkumoil,

Very interesting and clear post, about encodings ;-))

BTW, I did know the chcp DOS command to change the console encoding, but was not aware of the possibility to have an CMD instance, which outputs results, in Unicode ( UCS-2 Little Endian )

So, I did a test : after opening a Windows console, I type the command cmd /u to open a second instance and I just type the simple command dir > My installation folder of N++\Test.txt. Then I typed, twice, on the exit command and opened my beloved editor.

And, I can confirm that my Test.txt file is, automatically, encoded with the UCS-2 Little Endian encoding ( after a glance at the status bar ), despite it does not have, as you said, any Byte Order Mark ( BOM ), and despite the fact that my Autodetect character encoding, option, in Setting > Preferences… > MISC, is not set !

Remark : Rather funny to notice that NO encoding seems attributed to this file, when you just click on the Encoding menu ! But, it’s quite logical, because its encoding is not the true UCS-2 Little Endian BOM encoding.

So, to get a true Unicode file, with an invisible BOM ( The two invisible bytes FF FE ) , just use the Encoding > Convert to UCS-2 LE BOM option. This time, you should see, in the status bar, the indication UCS-2 LE BOM ! ( You may, either, use the UTF-8 encoding, with the Encoding > Convert to UTF-8 BOM option ).

Although I don’t have many encoding problems, I’ll have a look to your AutoCodePage plugin, soon ! Combined with a specific file extension, it could be useful to some of us ;-))

Best Regards,

guy038

dinkumoil

@guy038

despite the fact that my Autodetect character encoding, option, …, is not set!

Seems that our “beloved editor” ;-) is even better than we thought. Maybe the zero byte contained in each code number triggers the automatic decoding as UCS-2 LE.

Concerning my plugin: Only a few people really need it, it’s the moribund species of batch scripters.

David Bailey

@Soeren2017

Perhaps it is worth pointing out that unlike say .DOC files,text files use a very primitive format, in which almost every byte corresponds to a character on the screen. Other than the first two bytes (the BOM) that is sometimes used to indicate UTF8 or UNICODE, there just isn’t anywhere in a text file to stuff extra (invisible) information!

guy038

Hi All,

I found an second solution for outputting Console text, with an Unicode encoding !

Once a window console opened, simply execute the command chcp 65001. From now on, until you close that CMD instance, the outputs will use the Unicode UTF-8 encoding :-))

Refer to the complete Microsoft table of Code Page Identifiers, below :

https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

Cheers,

guy038

Soeren2017

@dinkumoil Thank you for your detailed explanation. It was very great! But one question is still blurry for me. Why does Notepad++ not save the encoded file? I can encode any plain text file into any code page of the world for temporary watching, but obviously I can’t save the encoded file. Why? Where is the problem? I think this is an error of this nice app, isn’t it?

The other guys: Thank you for your response!

Claudia Frank

@Soeren2017

if I’m allowed - the problem is that these “ansi code pages” do treat the underlying
data (numbers) the same. Means, each code page assigns always a 8bit value to a glyph.
So there is no way for npp to findout if the value it reads, e.g. 0xFC (252),
should be treaten as ü (CP1252) or as exponent three (CP850).

I guess it could be said that the different “ansi code pages” are just a different view
to the same underlying data and as long as there is no info in the file itself which
encoding/code page has been used, to create the file, you have to know it
or in npps view, it tries to guess it.

Makes this sense to you?

Cheers
Claudia