replace in txtFile with mixed Codepage letters utf8 and ANSI

Rainer Klingler

Hello CodePage Experts,

from a translation job i received a translation of VBA GUI-Texts german and spanish GUI Texts. I gave them utf-8 file with german utf8 and received - i think - a mixture with spanisch ANSI. Notepad++ shows the spanisch texts correct with ANSI-View but then the german umlauts are wrong displayed. In utf8 view the spanisch special chars displayed as xF3 xED in a black background box.

How can i search and replace this “xF3” with the search and replace window to “ó”
?

thanks for help
Rainer

PeterJones

@Rainer-Klingler ,

The problem is that the search dialog expects the file to be properly encoded, and it’s not made to easily search for invalid “characters” to convert them into valid ones.

What I might suggest doing is:

Save your file and keep a backup copy, so you can easily revert if something goes wrong
Encoding > ANSI (not Convert to ANSI) so that the bytes are all interpreted as ANSI – and since every byte is a valid ANSI character (though not necessarily the right character), every character you see in the ANSI-interpreted version will be a character you can search for.
- This will make the xF3xED into óí, but it would change ä to Ã¤
Now, in regular expression mode, searching for \xF3 or ó will find the ó, and similarly.
So if you search for the ANSI character and replace with the two-byte sequence below (below, I showed the translation for grave-accent, accute-accent, and the n-tilde characters; you can find an online mapping of ANSI characters to their UTF-8 byte sequences if you have others that you need to fix)
After all the replacements have been done to change the ANSI characters to UTF-8 byte sequences (so all the ANSI ones look “wrong” now), use Encoding > UTF-8 (again, not the Convert To version), and both your German ones and the Spanish ones should be correct in UTF-8 now.

ANSI	REPLACEMENT
`À`	`\xC3\x80`
`Á`	`\xC3\x81`
`È`	`\xC3\x88`
`É`	`\xC3\x89`
`Ì`	`\xC3\x8C`
`Í`	`\xC3\x8D`
`Ñ`	`\xC3\x91`
`Ò`	`\xC3\x92`
`Ó`	`\xC3\x93`
`Ù`	`\xC3\x99`
`Ú`	`\xC3\x9A`
`à`	`\xC3\xA0`
`á`	`\xC3\xA1`
`è`	`\xC3\xA8`
`é`	`\xC3\xA9`
`ì`	`\xC3\xAC`
`í`	`\xC3\xAD`
`ñ`	`\xC3\xB1`
`ò`	`\xC3\xB2`
`ó`	`\xC3\xB3`
`ù`	`\xC3\xB9`
`ú`	`\xC3\xBA`

-----
update: I knew we had previously discussed how to search for invalid bytes. Look at my post from a couple years ago and guy’s reply. So it’s possible to search for invalid bytes when in UTF-8 mode, but the same search can match multiple characters, so it makes it hard to do unique replacements that way. So I think for your description, my multi-step sequence here might be safer for you.

Rainer Klingler

@PeterJones Thank you very much - it works! But you must be careful after some replaces exists new “wrong” Chars in ANSI view with the possibility of also not wanted newly replacements … so the replace all button is dangerous…