How to auto-convert text (Umlaute) when changing file encoding from ANSI to UTF-8 BOM?

Claudia Svenson

Assume I open a new, empty file with file encoding ANSI.

I type some german text which contains Umlaute like äöü

Later I decide to switch the file encoding to UTF-8 by clicking on menu

Encoding—>UTF-8 BOM

Yes, the file encoding is now UTF-8 BOM.

But the Umlaute äöü appear now as
xE4 xF6 xFC (with black background)

How can I tell NP++ to automatically convert all Umlaute to the corresponding UTF-8 bytes when switching file encoding from ANSI to UTF-8?

If this is not possible automatically:
How can I mark the text and do it manually?

Coises

@Claudia-Svenson said:

Assume I open a new, empty file with file encoding ANSI.

I type some german text which contains Umlaute like äöü

Later I decide to switch the file encoding to UTF-8 by clicking on menu

Encoding—>UTF-8 BOM

Yes, the file encoding is now UTF-8 BOM.

But the Umlaute äöü appear now as
xE4 xF6 xFC (with black background)

How can I tell NP++ to automatically convert all Umlaute to the corresponding UTF-8 bytes when switching file encoding from ANSI to UTF-8?

If this is not possible automatically:
How can I mark the text and do it manually?

Use the bottom section of the Encoding menu, e.g, Convert to UTF-8-BOM, when you want to convert.

There are usually only two times you should use the top section:

when you have a completely empty new file and you want to change from the default encoding to something else before you start adding text;
when you have just opened a file and the encoding Notepad++ determined it to be is wrong, so you want to change it and have Notepad++ reread the file as a different encoding.

Claudia Svenson

@Coises

funzt. Danke

Claudia Svenson

It works only partially.

Assume I started with an empty file an ANSI encoding.
I write some text.

Then (later) I copied some UTF-8 encoded text from browser webpage or from other document into this ANSI file.

Now this file contains two types of text:
One part is ANSI encoded the other UTF-8 encoded.

No matter if I switch the file encoding or if I convert the text
a part of the file content does not match the encoding.

What I need is a smarter convert feature:

If I select a part of the text and click a “Convert to UTF-8 BOM” then NP++ should…

…check if some text is marked. If yes, then only the marked text should be converted. Otherwise the full text.

Can this be implemented in the next release?

Coises

@Claudia-Svenson said:
Assume I started with an empty file an ANSI encoding.
I write some text.

Then (later) I copied some UTF-8 encoded text from browser webpage or from other document into this ANSI file.

Now this file contains two types of text:
One part is ANSI encoded the other UTF-8 encoded.

No matter if I switch the file encoding or if I convert the text
a part of the file content does not match the encoding.

Have you actually tried this? Can you show a minimal demonstration? I can’t reproduce it.

When you paste text from the Windows clipboard into a document, the text should be converted right then to match the current encoding Scintilla (the control used to display documents in Notepad++) is using. (That encoding is not always the same as the file encoding that will be saved; it will always be either ANSI, if the file encoding is ANSI, or else UTF-8; anything else is converted when reading or writing the file.) There cannot be two different encodings in the same document window in Notepad++.

Does the text appear wrong in Notepad++ when you paste it? Or are you saying that it looks good when you paste it, but when you reload the file the text you pasted is corrupted?

If the text appears wrong when you paste, it is probably a problem with the application from which you are copying the text. If it is a common application that some of us might have, please tell us and give an example of how to reproduce the problem; but I suspect it will be out of Notepad++’s control.

If the text appears good when you paste it but is corrupt when you reload, then you are probably pasting characters that are not in the codepage you are using. That can happen if you are using a named legacy codepage (not ANSI, but something like ISO-8859-15), because internally Notepad++ uses UTF-8 when you have anything other than ANSI. The pasted characters look fine, because they exist in UTF-8, but they can’t be converted to the codepage when you save if they aren’t in the codepage.

Claudia Svenson

@Coises

You want a sample. Ok here it is.
Download the following simplified text file with UTF-8 BOM encoding
I zipped it to prevent conversion by webserver.

https://mega.nz/file/RMQlzCTD#LhDRpJSoWAL4Vi6EP8-XlUyDeHpfp1-_aRFLlCMzICk

The first two lines contain english sentence with german Umlaute
The last two lines russian/cyrillic text

If I switch encoding to ANSI I can see the Umlaute but the russian text is scrambled.

How can I convert only a part (e.g.first two lines from ANSI to UTF-8 BOM)?

Coises

@Claudia-Svenson said:

@Coises

You want a sample. Ok here it is.
Download the following simplified text file with UTF-8 BOM encoding
I zipped it to prevent conversion by webserver.

https://mega.nz/file/RMQlzCTD#LhDRpJSoWAL4Vi6EP8-XlUyDeHpfp1-_aRFLlCMzICk

The first two lines contain english sentence with german Umlaute
The last two lines russian/cyrillic text

If I switch encoding to ANSI I can see the Umlaute but the russian text is scrambled.

How can I convert only a part (e.g.first two lines from ANSI to UTF-8 BOM)?

Select the lines or characters that are incorrectly encoded (ANSI characters in a UTF-8-BOM file).
Switch to ANSI. (Encoding|ANSI).
Copy the highlighted characters (which should now appear correctly) to the clipboard.
Switch to UTF-8-BOM. (Encoding|UTF-8-BOM).
Paste.
You can now save the corrected file as UTF-8-BOM.

With your example file, the highlighted section persists perfectly when changing encoding. I wouldn’t trust that to be the case always: watch what is happening to be sure the right section is highlighted both when copying and when pasting.

(Earlier in this thread I said you usually shouldn’t use the top section of the Encoding menu except in the two particular cases that I listed. This is one the rare other cases where you need to use the top section.)

Claudia Svenson

@Coises

Thank you for your instructions They work.
However all these steps are somehow cumbersome

I would have expected from such a feature-rich tool like NP++ That I can mark/select just a part of the full text and convert it in-place to a new encoding (by a new menu or via smart via described earlier). All that without switching back and forth encoding and using clipboard.

Unfortunately this seems to be not the case.
Thank you anyway

FreeMeow

@Claudia-Svenson

If this is something you do many times, you can record it as a macro.

Start with the text selected
Press Macro|Start Recording
Do the sequence up to after the paste ( Coises’ steps 2-5 )
Press Macro|Stop Recording
Press Macro|Save Current Recorded Macro

From this point you can just select the text and press the macro either from the macro menu or from your shortcut.

Edit:
After some testing, it seems the encoding change does not get saved in the macro as I expected, maybe there is some other way to change encoding to make the action saved in the macro.

guy038

Hello, @claudia-svenson, @coises, @Freemeow and All,

@claudia-svenson, I did downloaded your sample.zip archive and extracted your sample.txt file

I tried to create a macro, from scratch, using the Macro > Start Recording and Macro > Stop Recording commands.

Unfortunately, looking within the Shortcuts.xml file, I realized that the Encoding > ANSI and the Encoding > UTF-8-BOM commands has not been saved in the macro !

BTW, @peterjones, I’m looking for a list of all commands which can be recordable, from scratch ! I also searched in the official documentation without success !

Thus, the only way to get a valid macro is :

Open your active Shortcuts.xml
Insert, right above the last </Macros> line, the folllowing text :

        <Macro name="Correct ANSI chars in UTF-8 file" Ctrl="no" Alt="no" Shift="no" Key="0">
            <Action type="2" message="0" wParam="45004" lParam="0" sParam="" />
            <Action type="0" message="2178" wParam="0" lParam="0" sParam="" />
            <Action type="2" message="0" wParam="45005" lParam="0" sParam="" />
            <Action type="0" message="2179" wParam="0" lParam="0" sParam="" />
            <Action type="2" message="0" wParam="41006" lParam="0" sParam="" />
        </Macro>

Save the current modifications
Close and restart Notepad++

Now :

Open your sample.txt file, in N++
Select the string of black chars xE4xF6xFC+xF6XFC
Just execute the Correct ANSI chars in UTF-8 file macro

You may also realize this same operation in two sequences :

Select the string of black chars xE4xF6xFC
Execute the Correct ANSI chars in UTF-8 file macro

Then :

Select the string of black chars F6XFC
Re-execute the Correct ANSI chars in UTF-8 file macro

Note that, if you want to test this macro with the original text of your sample.txt`file copied within a new N++ file, follow this method :

Open your sample.txt file in N++
Select all your text ( Ctrl + A )
Run the Edit > Paste Special > Copy Binary Content command
Open a new N++ file ( Ctrl + N )
Run the Edit > Paste Special > Paste Binary Content command

And, now :

Select the string of black chars xE4xF6xFC+xF6XFC
Execute the Correct ANSI chars in UTF-8 file macro

=> The Save dialog will occur

Save this new file with your desired name !

Best Regards,

guy038

Reminder :

If you installed N++ with an installer package, your active shortcuts.xml file should be within the folder %AppData%\Notepad++
If you installed N++ with a portable package, your active shortcuts.xml file should be along with the Notepad++.exe file, in your install folder

PeterJones

@guy038 said:

BTW, @peterjones, I’m looking for a list of all commands which can be recordable, from scratch ! I also searched in the official documentation without success !

As I say in the FAQ,

The FAQ author has not come across a reference list that enumerates every Notepad++ command and whether it can be recorded in a macro or just played back in a manually-created macro or cannot be used at all in a macro. If you know of such a list, or are willing to create it, feel free to contact the FAQ author, or put in a pull request to update the GitHub copy of this FAQ.

That said, per the source code, these menu commands are recordable, as are some (most?) of the Scintilla commands. But given how huge that full function is, there may be others that can be recorded that aren’t explicitly enumerated in that section. I’ve always been wary of just publishing that list, because there are so many unknowns to me as to other paths through which a menu entry or other action might be recorded or somehow limited, and I’ve never been able to find the equivalent list in the code for the recordability of scintilla commands.

mpheath

@Claudia-Svenson If you use the PythonScript 3 plugin, then this script can read the whole document and replace invalid utf8 bytes with those of mbcs/ansi decoded bytes.

# about: Replace utf8 invalid bytes with mbcs decoding
# help: https://community.notepad-plus-plus.org/topic/23039
# name: ReplaceUtf8InvalidBytesWithMbcsDecoding
# require: Notepad++ with PythonScript 3 plugin

from Npp import editor, notepad

def main():
    editor.beginUndoAction()
    i = 0

    for line in range(0, editor.getLineCount()):
        for _ in range(editor.lineLength(line)):
            try:
                text = editor.getLine(line)
            except UnicodeDecodeError as e:
                text = e.object[e.start:e.end].decode('mbcs')
                editor.setTargetRange(e.start, e.end)
                editor.replaceTarget(text)
                i += 1
            else:
                break

    editor.endUndoAction()
    notepad.messageBox('Decoded {} replacements with the mbcs codec'.format(i))

main()

guy038

Hi, @peterjones and All,

Thanks, Peter, for pointing me to this link. That’s exactly what I was looking for !

And after reading this section, it happens, as expected, that the commands :

IDM_FORMAT_ANSI
IDM_FORMAT_AS_UTF_8
IDM_FORMAT_UTF_8
IDM_FORMAT_UTF_16BE
IDM_FORMAT_UTF_16FE

IDM_FORMAT_CONV2_ANSI
IDM_FORMAT_CONV2_AS_UTF_8
IDM_FORMAT_CONV2_UTF_8
IDM_FORMAT_CONV2_UTF_16BE
IDM_FORMAT_CONV2_UTF_16FE

As well as all the other encodings, within the Encoding > Character sets section, are absent, thus not recordable, yet ! Only the three commands IDM_FORMAT_TODOS, IDM_FORMAT_TOUNIX and IDM_FORMAT_TOMAC can be recorded.

BR

guy038

P.S. :

If this link does not appear in the official doc, it would be nice to add it, somewhere in the Macro section ;-))

Coises

@Claudia-Svenson said:

I would have expected from such a feature-rich tool like NP++ That I can mark/select just a part of the full text and convert it in-place to a new encoding (by a new menu or via smart via described earlier). All that without switching back and forth encoding and using clipboard.

Probably there is no such tool included because this is a situation that should very, very rarely happen. One file normally contains one encoding — that applies everywhere, not just in Notepad++. (I believe there are some legacy exceptions for East Asian languages, but in those cases the encodings themselves include bytes that describe the encoding changes. To the best of my knowledge, Scintilla, and therefore Notepad++, does not support them.)

If this is happening to you often enough to be a concern, I urge you to look earlier in your process. How are you winding up with files which contain characters in more than one encoding?

As you can see from this demonstration, copying text from one encoding and pasting it into a document with a different encoding ordinarily converts the text to the encoding into which you paste it. So just copying and pasting from some other source should not cause this problem.

I am not ruling out that there could be a bug, whether in Notepad++ or in some other program, that is causing you to get wrongly-encoded text in a file. If so, we should identify that bug. (Especially if it turns out to be in Notepad++, where possibly it could be fixed.)

It is also possible that there is a step in whatever procedures you are following that predictably causes this, like concatenating two files with some other tool without first making sure the files have the same encoding, or using Edit | Paste Special | Copy/Cut/Paste Binary Content inappropriately.

If someone else is supplying you with files that contain multiple encodings and you can’t prevail upon them to correct their procedures, then using a Python script, as @mpheath suggested, is probably the best solution.

mpheath

The Python script I posted is expected to handle single byte errors from UTF8 decoding. Asian languages can be 2 byte for ANSI and so the script may need to get the following byte too to decode both bytes properly.

Probably will not find recovery tools and the like that will fix mixed encoding errors as the solution in some cases might be worse then the problem though in this case tested OK. Once saved, then there might be no going back and so creating a backup before the operation would be wise. Inspect the results to confirm is OK.