Old WhatsApp conversations with legacy encoded emojis - find and replace?

jakobeig

Hello. I am trying to export and archive some old WhatsApp conversations from 2012-2013.
However… some of these conversations use legacy encoded emojis from IOS 4.0, pre the Unicode 6.0 Standard. They display like this in the exported WhatsApp text file: 

For example  This is U+E058, which is disappointed face in the legacy standard - but the Unicode disappointed face is U+1F61E. If I create a test WhatsApp conversation put in a disappointed face and export the chat, the basic text file renders the disappointed face correctly (😞) as it uses the most recent Unicode.

In Notepad++, might it be possible to display the hex code points of my old chats, and then perhaps do some kind of find and replace on U+E058 to make it U+1F61E? How would I go about doing something like this? I am approaching the limits of my understanding.

Many thanks.

PeterJones

@jakobeig ,

For such things, I use a script for the PythonScript plugin WhatUniChar.py, which I then assign a keyboard shortcut to (see our FAQ: install and run a script in PythonScript for how to use that plugin to run the downloaded script)

If you want to replace it with a Unicode character that’s not in the BMP (ie, U+10000 or higher), you have to use surrogate codes (as mentioned here in the User Manual).

A site like fileformat.info can tell you what they are – for example, it’s entry for U+1F61E: look at the UTF-16 (0xD83D 0xDE1E) or “C/C++/Java source code” ("\uD83D\uDE1E") values. (Sorry for the ads there; I have my adblocker prevent its horrendous number of ads, so it doesn’t bother me, and I often forget to warn people about the ads when I post a link there; I remembered this time ;-).)

Once you know that you want to convert U+E058 into the surrogate pair U+D83D U+DE1E , then
FIND = \x{E058}
REPLACE = \x{D83D}\x{DE1E}

: =>

jakobeig

@PeterJones I did it - it worked! Thank you so much for going through the process in such detail. I knew it must be possible, and this whole process has taught me a great deal - not least an increased knowledge of Unicode characters!
Now I will look into how to semi-automate this process…

guy038

Hello, @jakobeig, @peterjones and All,

Here are two links to easily get the surrogate values :

https://russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm

You may enter, either, from up to bottom :

The code point of the character
The surrogate pair of the character
The character itself

https://onlinetools.com/unicode/convert-unicode-to-utf16

You enter the character, itself, on the left panel
You get the surrogate pair on the right panel

And, from the same site, the link :

https://onlinetools.com/unicode/convert-unicode-to-code-points

Would return the code-point of the character

Now, for me, the best way is to use the compart.com site and add, either, the character, its code-point, its UTF-16 encoding or its UTF-8 encoding :

So, type in, in the Google search zone, either :

site:compart.com 🚂
site:compart.com 1F682
site:compart.com 0xd83d 0xde82
site:compart.com 0xf0 0x9f 0x9a 0x82

And click, generally, on the first or second proposed link !

The good news is that I also created a macro which calculates, from within N++, the surrogate pair of any char over the BMP, so with code-point >= \x{10000} !!

This macro changes any \x{[1]#####} string, of a stream selection, into its surrogate pair \x{####}\x{####} :

        <Macro name="Surrogates Pairs in Selection" Ctrl="no" Alt="yes" Shift="yes" Key="83">
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?-i)\\x\{(10|[[:xdigit:]])[[:xdigit:]]{4}" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="$0\x1F" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />

            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?i)(?:(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(A)|(B)|(C)|(D)|(E)|(F)|(10))(?=[[:xdigit:]]{4}\x1F\})|(?:(0)|(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(A)|(B)|(C)|(D)|(E)|(F))(?=[[:xdigit:]]{0,3}\x1F\})" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="(?{1}0000)(?{2}0001)(?{3}0010)(?{4}0011)(?{5}0100)(?{6}0101)(?{7}0110)(?{8}0111)(?{9}1000)(?{10}1001)(?{11}1010)(?{12}1011)(?{13}1100)(?{14}1101)(?{15}1110)(?{16}1111)(?{17}0000)(?{18}0001)(?{19}0010)(?{20}0011)(?{21}0100)(?{22}0101)(?{23}0110)(?{24}0111)(?{25}1000)(?{26}1001)(?{27}1010)(?{28}1011)(?{29}1100)(?{30}1101)(?{31}1110)(?{32}1111)" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />

            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="([01]{10})([01]{10})(?=\x1F)" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="110110\1\x1F}\\x{110111\2" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />

            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?:(0000)|(0001)|(0010)|(0011)|(0100)|(0101)|(0110)|(0111)|(1000)|(1001)|(1010)|(1011)|(1100)|(1101)|(1110)|(1111))(?=[[:xdigit:]]*\x1F\})|\x1F" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="(?{1}0)(?{2}1)(?{3}2)(?{4}3)(?{5}4)(?{6}5)(?{7}6)(?{8}7)(?{9}8)(?{10}9)(?11A)(?12B)(?13C)(?14D)(?15E)(?16F)" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
        </Macro>

Place it within the <macros>..........</macros>. section of your active shortcuts.wml file
Save your shortcuts.wml file
Stop and restart Notepad++

Thus, for example, from the selected text, below, located in any file :

\x{10000}     #  The FIRST Unicode character of PLANE 1 ( SMP )
....
....
....

\x{1F682}     #  The 'STREAM LOCOMOTIVE' character
....
....
....
\x{10FFFF}    #  The LAST Unicode character of PLANE 16 ( SPUA-B)

After execution of this macro, you would get the expected text :

\x{D800}\x{DC00}     #  The FIRST Unicode character of PLANE 1 ( SMP )
....
....
....

\x{D83D}\x{DE82}     #  The 'STREAM LOCOMOTIVE' character
....
....
....
\x{DBFF}\x{DFFF}    #  The LAST Unicode character of PLANE 16 ( SPUA-B)

Remark : I, personally, chose the Alt + Shift + S shortcut, but, of course, you may use any other one by changing the first line of the macro !

Best Regards,

guy038

PeterJones

@PeterJones said in Old WhatsApp conversations with legacy encoded emojis - find and replace?:

For such things, I use a script for the PythonScript plugin WhatUniChar.py , which I then assign a keyboard shortcut to (see our FAQ: install and run a script in PythonScript for how to use that plugin to run the downloaded script)

… and I have another script that might not be helpful for your exact use case (because you have the actual character, not the text of the codepoint), but my pyscReplaceBackslashSequence.py, which I assign to Alt+\, which allows me to type something like U+1F61E or \x{1F61E} or 😞 then hit my Alt+\ shortcut, and it will convert it from the codepoint / entity into the actual character (and it handles the surrogate pair calculation inside the script) … so given the same “input” data that Guy showed in his example, instead of running Guy’s macro to show you the surrogate pair, you could use my script to actually convert that hex-notation into the actual character.

Hmm, and I’ve just added a “TODO” for me to update my WhatUniChar script to show surrogate pairs for those not in the BMP… I’ll post an update here once I’ve got that script updated.

UPDATE: WhatUniChar.py has been updated to show surrogate pair (it of course won’t help with your early-IOS-specific codepoints)

guy038

Hello @peterjones and All,

Peter, I’m afraid that your excellent and improved Python script WhatUniChar.py does not work at all for any NON-ASCII character found in an ANSI encoded file :-(

Luckily, everything is allright if we deal with Unicode encoded files ( UTF-8, UTF-8-BOM, UTF-16 BE and UTF-16 LE )

I suppose that it will not be easy to find out an acceptable solution, because :

The well-known Windows-125x ANSI encodings contains some NON-defined characters
For characters between U+0080 and U+009F, they usually match some unicode characters with code-point over U+00FF
For characters between U+00A0 and U+00FF, they can be searched with an unique syntaxe in ANSI files and with two syntaxes in an Unicode file !

Refer, for example, to the list of the Win-1252 characters, over U+007F, below :

          •---------------------•-------------------------•------------•
          |    SEARCH in an     |      SEARCH in an       | SEARCHED   |
          |  ANSI encoded file  |  UNICODE encoded file   | character  |
          •---------------------•----------•--------------•------------•
          |        \x80         |          :   \x{20AC}   |     €      |
          |        \x81         |   \x81   :   \x{0081}   |         |
          |        \x82         |          :   \x{201A}   |     ‚      |
          |        \x83         |          :   \x{0192}   |     ƒ      |
          |        \x84         |          :   \x{201E}   |     „      |
          |        \x85         |          :   \x{2026}   |     …      |
          |        \x86         |          :   \x{2020}   |     †      |
          |        \x87         |          :   \x{2021}   |     ‡      |
          |        \x88         |          :   \x{02C6}   |     ˆ      |
          |        \x89         |          :   \x{2030}   |     ‰      |
          |        \x8A         |          :   \x{0160}   |     Š      |
          |        \x8B         |          :   \x{2039}   |     ‹      |
          |        \x8C         |          :   \x{0152}   |     Œ      |
          |        \x8D         |   \x8D   :   \x{008D}   |          |
          |        \x8E         |          :   \x{017D}   |     Ž      |
          |        \x8F         |   \X8F   :   \x{008F}   |         |
          •---------------------•----------•--------------•------------•
          |        \x90         |   \x90   :   \x{0090}   |         |
          |        \x91         |          :   \x{2018}   |     ‘      |
          |        \x92         |          :   \x{2019}   |     ’      |
          |        \x93         |          :   \x{201C}   |     “      |
          |        \x94         |          :   \x{201D}   |     ”      |
          |        \x95         |          :   \x{2022}   |     •      |
          |        \x96         |          :   \x{2013}   |     –      |
          |        \x97         |          :   \x{2014}   |     —      |
          |        \x98         |          :   \x{02DC}   |     ˜      |
          |        \x99         |          :   \x{2122}   |     ™      |
          |        \x9A         |          :   \x{0161}   |     š      |
          |        \x9B         |          :   \x{203A}   |     ›      |
          |        \x9C         |          :   \x{0153}   |     œ      |
          |        \x9D         |   \x9D   :   \x{009D}   |         |
          |        \x9E         |          :   \x{017E}   |     ž      |
          |        \x9F         |          :   \x{0178}   |     Ÿ      |
          •---------------------•----------•--------------•------------•
          |        \xA0         |   \xA0   :   \x{00A0}   |            |
          |        \xA1         |   \xA1   :   \x{00A1}   |     ¡      |
          |        \xA2         |   \xA2   :   \x{00A2}   |     ¢      |
          |        \xA3         |   \xA3   :   \x{00A3}   |     £      |
          |        \xA4         |   \xA4   :   \x{00A4}   |     ¤      |
          |        \xA5         |   \xA5   :   \x{00A5}   |     ¥      |
          |        \xA6         |   \xA6   :   \x{00A6}   |     ¦      |
          |        \xA7         |   \xA7   :   \x{00A7}   |     §      |
          |        \xA8         |   \xA8   :   \x{00A8}   |     ¨      |
          |        \xA9         |   \xA9   :   \x{00A9}   |     ©      |
          |        \xAA         |   \xAA   :   \x{00AA}   |     ª      |
          |        \xAB         |   \xAB   :   \x{00AB}   |     «      |
          |        \xAC         |   \xAC   :   \x{00AC}   |     ¬      |
          |        \xAD         |   \xAD   :   \x{00AD}   |           |
          |        \xAE         |   \xAE   :   \x{00AE}   |     ®      |
          |        \xAF         |   \xAF   :   \x{00AF}   |     ¯      |
          •---------------------•----------•--------------•------------•
          |        \xB0         |   \xB0   :   \x{00B0}   |     °      |
          |        \xB1         |   \xB1   :   \x{00B1}   |     ±      |
          |        \xB2         |   \xB2   :   \x{00B2}   |     ²      |
          |        \xB3         |   \xB3   :   \x{00B3}   |     ³      |
          |        \xB4         |   \xB4   :   \x{00B4}   |     ´      |
          |        \xB5         |   \xB5   :   \x{00B5}   |     µ      |
          |        \xB6         |   \xB6   :   \x{00B6}   |     ¶      |
          |        \xB7         |   \xB7   :   \x{00B7}   |     ·      |
          |        \xB8         |   \xB8   :   \x{00B8}   |     ¸      |
          |        \xB9         |   \xB9   :   \x{00B9}   |     ¹      |
          |        \xBA         |   \xBA   :   \x{00BA}   |     º      |
          |        \xBB         |   \xBB   :   \x{00BB}   |     »      |
          |        \xBC         |   \xBC   :   \x{00BC}   |     ¼      |
          |        \xBD         |   \xBD   :   \x{00BD}   |     ½      |
          |        \xBE         |   \xBE   :   \x{00BE}   |     ¾      |
          |        \xBF         |   \xBF   :   \x{00BF}   |     ¿      |
          •---------------------•----------•--------------•------------•
          |        \xC0         |   \xC0   :   \x{00C0}   |     À      |
          |        \xC1         |   \xC1   :   \x{00C1}   |     Á      |
          |        \xC2         |   \xC2   :   \x{00C2}   |     Â      |
          |        \xC3         |   \xC3   :   \x{00C3}   |     Ã      |
          |        \xC4         |   \xC4   :   \x{00C4}   |     Ä      |
          |        \xC5         |   \xC5   :   \x{00C5}   |     Å      |
          |        \xC6         |   \xC6   :   \x{00C6}   |     Æ      |
          |        \xC7         |   \xC7   :   \x{00C7}   |     Ç      |
          |        \xC8         |   \xC8   :   \x{00C8}   |     È      |
          |        \xC9         |   \xC9   :   \x{00C9}   |     É      |
          |        \xCA         |   \xCA   :   \x{00CA}   |     Ê      |
          |        \xCB         |   \xCB   :   \x{00CB}   |     Ë      |
          |        \xCC         |   \xCC   :   \x{00CC}   |     Ì      |
          |        \xCD         |   \xCD   :   \x{00CD}   |     Í      |
          |        \xCE         |   \xCE   :   \x{00CE}   |     Î      |
          |        \xCF         |   \xCF   :   \x{00CF}   |     Ï      |
          •---------------------•----------•--------------•------------•
          |        \xD0         |   \xD0   :   \x{00D0}   |     Ð      |
          |        \xD1         |   \xD1   :   \x{00D1}   |     Ñ      |
          |        \xD2         |   \xD2   :   \x{00D2}   |     Ò      |
          |        \xD3         |   \xD3   :   \x{00D3}   |     Ó      |
          |        \xD4         |   \xD4   :   \x{00D4}   |     Ô      |
          |        \xD5         |   \xD5   :   \x{00D5}   |     Õ      |
          |        \xD6         |   \xD6   :   \x{00D6}   |     Ö      |
          |        \xD7         |   \xD7   :   \x{00D7}   |     ×      |
          |        \xD8         |   \xD8   :   \x{00D8}   |     Ø      |
          |        \xD9         |   \xD9   :   \x{00D9}   |     Ù      |
          |        \xDA         |   \xDA   :   \x{00DA}   |     Ú      |
          |        \xDB         |   \xDB   :   \x{00DB}   |     Û      |
          |        \xDC         |   \xDC   :   \x{00DC}   |     Ü      |
          |        \xDD         |   \xDD   :   \x{00DD}   |     Ý      |
          |        \xDE         |   \xDE   :   \x{00DE}   |     Þ      |
          |        \xDF         |   \xDF   :   \x{00DF}   |     ß      |
          •---------------------•----------•--------------•------------•
          |        \xE0         |   \xE0   :   \x{00E0}   |     à      |
          |        \xE1         |   \xE1   :   \x{00E1}   |     á      |
          |        \xE2         |   \xE2   :   \x{00E2}   |     â      |
          |        \xE3         |   \xE3   :   \x{00E3}   |     ã      |
          |        \xE4         |   \xE4   :   \x{00E4}   |     ä      |
          |        \xE5         |   \xE5   :   \x{00E5}   |     å      |
          |        \xE6         |   \xE6   :   \x{00E6}   |     æ      |
          |        \xE7         |   \xE7   :   \x{00E7}   |     ç      |
          |        \xE8         |   \xE8   :   \x{00E8}   |     è      |
          |        \xE9         |   \xE9   :   \x{00E9}   |     é      |
          |        \xEA         |   \xEA   :   \x{00EA}   |     ê      |
          |        \xEB         |   \xEB   :   \x{00EB}   |     ë      |
          |        \xEC         |   \xEC   :   \x{00EC}   |     ì      |
          |        \xED         |   \xED   :   \x{00ED}   |     í      |
          |        \xEE         |   \xEE   :   \x{00EE}   |     î      |
          |        \xEF         |   \xEF   :   \x{00EF}   |     ï      |
          •---------------------•----------•--------------•------------•
          |        \xF0         |   \xF0   :   \x{00F0}   |     ð      |
          |        \xF1         |   \xF1   :   \x{00F1}   |     ñ      |
          |        \xF2         |   \xF2   :   \x{00F2}   |     ò      |
          |        \xF3         |   \xF3   :   \x{00F3}   |     ó      |
          |        \xF4         |   \xF4   :   \x{00F4}   |     ô      |
          |        \xF5         |   \xF5   :   \x{00F5}   |     õ      |
          |        \xF6         |   \xF6   :   \x{00F6}   |     ö      |
          |        \xF7         |   \xF7   :   \x{00F7}   |     ÷      |
          |        \xF8         |   \xF8   :   \x{00F8}   |     ø      |
          |        \xF9         |   \xF9   :   \x{00F9}   |     ù      |
          |        \xFA         |   \xFA   :   \x{00FA}   |     ú      |
          |        \xFB         |   \xFB   :   \x{00FB}   |     û      |
          |        \xFC         |   \xFC   :   \x{00FC}   |     ü      |
          |        \xFD         |   \xFD   :   \x{00FD}   |     ý      |
          |        \xFE         |   \xFE   :   \x{00FE}   |     þ      |
          |        \xFF         |   \xFF   :   \x{00FF}   |     ÿ      |
          •---------------------•----------•--------------•------------•

Best Regards,

guy038

Alan Kilborn

@guy038 said in Old WhatsApp conversations with legacy encoded emojis - find and replace?:

WhatUniChar.py does not work at all for any NON-ASCII character found in an ANSI encoded file

The shocking thing is is that you seem to expect that to actually work. :-)
Aside from historical points of interest (perhaps this is one of those), isn’t it time to let go of non-Unicode encodings?

PeterJones

@guy038 said in Old WhatsApp conversations with legacy encoded emojis - find and replace?:

I’m afraid that your excellent and improved Python script WhatUniChar.py does not work at all for any NON-ASCII character found in an ANSI encoded file

Of course it doesn’t. That’s why I named it WhatUniChar, where Uni is short for Unicode. If the characters are not encoded as Unicode codepoints, that script is the wrong tool for the job.

The ANSI character sets were useful in the 80s, when there was no single, agreed-upon international encoding, but since Unicode was developed in the 90s and popularized in the 00’s, there’s been no convincing reason for me to continue to use those ancient charsets. (And the fact that Notepad++ still has major, daily-use features – such as UDL and auto-completions – that are not Unicode-enabled is a huge embarrassment to an otherwise-excellent application.)

guy038

Hello, @peterjones, @alan-kilborn and All,

From, your two last posts, Alan and Peter, I asked myself : which is the distribution of all my text files, regarding their encoding ?

I considered, as a text file, all the files with the main following extensions, by importance level :

txt, .py, html, htm, xml, ini, msg, csv, log as well as few other files with rare extension

Now, using the iconv.exe utility to get all the NON-UTF8 files and, then, the xxd.exe software to omit the UFT-16 encoded files, I was able, little by little, to restrict my list to 360 files, about, for which I possibly could change the encoding from ANSI to UTF-8 !

Of course, opening all the files, one at a time, in N++, changing their encoding and saving them seemed rather tedious. Thus, I used a simple python script to achieve this task easily :

'''
NAME   : Move_to_UTF8_encoding.py

REMARK : The fonction 'npp_get_statusbar' is an idea of @alan-kilborn


This script :

      - Opens a file which contains a list of ABSOLUTE file-paths

      - Read, successively, the file-paths from that list

      - Open EACH file in N++

      - Perform the 'Convert to UTD-8' action on the CURRENT opened ANSI file

      - Save and close EACH file, one at a time


NOTES :

      - The file, containing the list of ABSOLUTE file-paths to OPEN, is an UTF-8 encoded file, with 'Windows' EOL

      - This list must NOT contain EMPTY or BLANK lines

      - But, any line beginning with the '#' character is simply IGNORED ( So begin any EMPTY line or COMMENT line with a '#' char ! )

      - The PATHS are designated by a SIMPLE character ANTI-SLASH ( Ex : D:\Dir_1\Dir_2\Name.txt ). NO need to DOUBLE the ANTISLASH ( \\ )

      - On the same way, NO need to SURROUND the file-paths, containing SPACE characters, with DOUBLE-QUOTES

      - This list may contain some ACCENTUATED characters
'''

from Npp import *

import time

import ctypes

from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT

console.show()

console.clear()

with open('D:\\Verif.txt') as file:

    for file_path in file:

        file_path = file_path.strip('\n')

        if file_path[0] == "#":
            continue

        notepad.open(file_path)

        # ----------------------------------------------------------------------------------------------------------------------------------------------------
        #  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
        # ----------------------------------------------------------------------------------------------------------------------------------------------------

        def npp_get_statusbar(statusbar_item_number):

            WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
            FindWindowW = ctypes.windll.user32.FindWindowW
            FindWindowExW = ctypes.windll.user32.FindWindowExW
            SendMessageW = ctypes.windll.user32.SendMessageW
            LRESULT = LPARAM
            SendMessageW.restype = LRESULT
            SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
            EnumChildWindows = ctypes.windll.user32.EnumChildWindows
            GetClassNameW = ctypes.windll.user32.GetClassNameW
            create_unicode_buffer = ctypes.create_unicode_buffer

            SBT_OWNERDRAW = 0x1000
            WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13

            npp_get_statusbar.STATUSBAR_HANDLE = None

            def get_result_from_statusbar(statusbar_item_number):
                assert statusbar_item_number <= 5
                retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
                length = retcode & 0xFFFF
                type = (retcode >> 16) & 0xFFFF
                assert (type != SBT_OWNERDRAW)
                text_buffer = create_unicode_buffer(length)
                retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
                retval = '{}'.format(text_buffer[:length])
                return retval

            def EnumCallback(hwnd, lparam):
                curr_class = create_unicode_buffer(256)
                GetClassNameW(hwnd, curr_class, 256)
                if curr_class.value.lower() == "msctls_statusbar32":
                    npp_get_statusbar.STATUSBAR_HANDLE = hwnd
                    return False  # stop the enumeration
                return True  # continue the enumeration

            npp_hwnd = FindWindowW(u"Notepad++", None)
            EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
            if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
            assert False

        St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )

        if St_bar == 'ANSI':  #  =>  Conversion to 'UTF-8', without BOM, RECOMMENDED !

            time.sleep(0.5)

            notepad.runMenuCommand("Encoding", "Convert to UTF-8")
            notepad.save()

            time.sleep(0.5)

        notepad.close()

REMARK :

As I was a bit anxious about the needed time to get the encoding change and the save action, for each file, I preferred to use timers to properly ensure the entire process but, may be, these timers are not necessary !

So, after the various modifications, I got a list of 11,578 files whose distribution, according to their encoding, is as follows :

                                UTF-8 BOM      :        208               |
                                UTF-16 LE BOM  :         39               |
                                UTF-16 BE BOM  :          4               |
                                UTF-8          :        540  ( 0 byte )   |   =>  10,737 with UNICODE encoding ( 92,7 % )
                                UTF-8          :      9,946               |
                                ANSI           :        841
                                                   ----------
                                TOTAL                11,578

You certainly note that there still are a lot of ANSI files, but most of them are lang or configuration files for which the change of the encoding is rather forbidden or, at least, not welcome !

Best Regards,

guy038