Community
    • Login

    Old WhatsApp conversations with legacy encoded emojis - find and replace?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 4 Posters 1.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J
      jakobeig
      last edited by

      Hello. I am trying to export and archive some old WhatsApp conversations from 2012-2013.
      However… some of these conversations use legacy encoded emojis from IOS 4.0, pre the Unicode 6.0 Standard. They display like this in the exported WhatsApp text file: 

      For example  This is U+E058, which is disappointed face in the legacy standard - but the Unicode disappointed face is U+1F61E. If I create a test WhatsApp conversation put in a disappointed face and export the chat, the basic text file renders the disappointed face correctly (😞) as it uses the most recent Unicode.

      In Notepad++, might it be possible to display the hex code points of my old chats, and then perhaps do some kind of find and replace on U+E058 to make it U+1F61E? How would I go about doing something like this? I am approaching the limits of my understanding.

      Many thanks.

      PeterJonesP 1 Reply Last reply Reply Quote 1
      • PeterJonesP
        PeterJones @jakobeig
        last edited by

        @jakobeig ,

        For such things, I use a script for the PythonScript plugin WhatUniChar.py, which I then assign a keyboard shortcut to (see our FAQ: install and run a script in PythonScript for how to use that plugin to run the downloaded script)

        If you want to replace it with a Unicode character that’s not in the BMP (ie, U+10000 or higher), you have to use surrogate codes (as mentioned here in the User Manual).

        A site like fileformat.info can tell you what they are – for example, it’s entry for U+1F61E: look at the UTF-16 (0xD83D 0xDE1E) or “C/C++/Java source code” ("\uD83D\uDE1E") values. (Sorry for the ads there; I have my adblocker prevent its horrendous number of ads, so it doesn’t bother me, and I often forget to warn people about the ads when I post a link there; I remembered this time ;-).)

        Once you know that you want to convert U+E058 into the surrogate pair U+D83D U+DE1E , then
        FIND = \x{E058}
        REPLACE = \x{D83D}\x{DE1E}

        5a52baa6-c1e5-46fb-849b-d92a73de1cbf-image.png
        : a5cbbb03-7602-48bf-a0a5-58032f155a90-image.png => 567c3c6f-b7c2-4eaf-b826-4e14607c8a56-image.png

        J PeterJonesP 2 Replies Last reply Reply Quote 3
        • J
          jakobeig @PeterJones
          last edited by jakobeig

          @PeterJones I did it - it worked! Thank you so much for going through the process in such detail. I knew it must be possible, and this whole process has taught me a great deal - not least an increased knowledge of Unicode characters!
          Now I will look into how to semi-automate this process…

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hello, @jakobeig, @peterjones and All,

            Here are two links to easily get the surrogate values :

            https://russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm

            You may enter, either, from up to bottom :

            • The code point of the character

            • The surrogate pair of the character

            • The character itself

            https://onlinetools.com/unicode/convert-unicode-to-utf16

            • You enter the character, itself, on the left panel

            • You get the surrogate pair on the right panel

            And, from the same site, the link :

            https://onlinetools.com/unicode/convert-unicode-to-code-points

            Would return the code-point of the character


            Now, for me, the best way is to use the compart.com site and add, either, the character, its code-point, its UTF-16 encoding or its UTF-8 encoding :

            So, type in, in the Google search zone, either :

            • site:compart.com 🚂

            • site:compart.com 1F682

            • site:compart.com 0xd83d 0xde82

            • site:compart.com 0xf0 0x9f 0x9a 0x82

            And click, generally, on the first or second proposed link !


            The good news is that I also created a macro which calculates, from within N++, the surrogate pair of any char over the BMP, so with code-point >= \x{10000} !!

            This macro changes any \x{[1]#####} string, of a stream selection, into its surrogate pair \x{####}\x{####} :

                    <Macro name="Surrogates Pairs in Selection" Ctrl="no" Alt="yes" Shift="yes" Key="83">
                        <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
                        <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?-i)\\x\{(10|[[:xdigit:]])[[:xdigit:]]{4}" />
                        <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
                        <Action type="3" message="1602" wParam="0" lParam="0" sParam="$0\x1F" />
                        <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
                        <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            
                        <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
                        <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?i)(?:(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(A)|(B)|(C)|(D)|(E)|(F)|(10))(?=[[:xdigit:]]{4}\x1F\})|(?:(0)|(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(A)|(B)|(C)|(D)|(E)|(F))(?=[[:xdigit:]]{0,3}\x1F\})" />
                        <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
                        <Action type="3" message="1602" wParam="0" lParam="0" sParam="(?{1}0000)(?{2}0001)(?{3}0010)(?{4}0011)(?{5}0100)(?{6}0101)(?{7}0110)(?{8}0111)(?{9}1000)(?{10}1001)(?{11}1010)(?{12}1011)(?{13}1100)(?{14}1101)(?{15}1110)(?{16}1111)(?{17}0000)(?{18}0001)(?{19}0010)(?{20}0011)(?{21}0100)(?{22}0101)(?{23}0110)(?{24}0111)(?{25}1000)(?{26}1001)(?{27}1010)(?{28}1011)(?{29}1100)(?{30}1101)(?{31}1110)(?{32}1111)" />
                        <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
                        <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            
                        <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
                        <Action type="3" message="1601" wParam="0" lParam="0" sParam="([01]{10})([01]{10})(?=\x1F)" />
                        <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
                        <Action type="3" message="1602" wParam="0" lParam="0" sParam="110110\1\x1F}\\x{110111\2" />
                        <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
                        <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            
                        <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
                        <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?:(0000)|(0001)|(0010)|(0011)|(0100)|(0101)|(0110)|(0111)|(1000)|(1001)|(1010)|(1011)|(1100)|(1101)|(1110)|(1111))(?=[[:xdigit:]]*\x1F\})|\x1F" />
                        <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
                        <Action type="3" message="1602" wParam="0" lParam="0" sParam="(?{1}0)(?{2}1)(?{3}2)(?{4}3)(?{5}4)(?{6}5)(?{7}6)(?{8}7)(?{9}8)(?{10}9)(?11A)(?12B)(?13C)(?14D)(?15E)(?16F)" />
                        <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
                        <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
                    </Macro>
            
            • Place it within the <macros>..........</macros>. section of your active shortcuts.wml file

            • Save your shortcuts.wml file

            • Stop and restart Notepad++


            Thus, for example, from the selected text, below, located in any file :

            \x{10000}     #  The FIRST Unicode character of PLANE 1 ( SMP )
            ....
            ....
            ....
            
            \x{1F682}     #  The 'STREAM LOCOMOTIVE' character
            ....
            ....
            ....
            \x{10FFFF}    #  The LAST Unicode character of PLANE 16 ( SPUA-B)
            

            After execution of this macro, you would get the expected text :

            \x{D800}\x{DC00}     #  The FIRST Unicode character of PLANE 1 ( SMP )
            ....
            ....
            ....
            
            \x{D83D}\x{DE82}     #  The 'STREAM LOCOMOTIVE' character
            ....
            ....
            ....
            \x{DBFF}\x{DFFF}    #  The LAST Unicode character of PLANE 16 ( SPUA-B)
            

            Remark : I, personally, chose the Alt + Shift + S shortcut, but, of course, you may use any other one by changing the first line of the macro !

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 1
            • PeterJonesP
              PeterJones @PeterJones
              last edited by PeterJones

              @PeterJones said in Old WhatsApp conversations with legacy encoded emojis - find and replace?:

              For such things, I use a script for the PythonScript plugin WhatUniChar.py , which I then assign a keyboard shortcut to (see our FAQ: install and run a script in PythonScript for how to use that plugin to run the downloaded script)

              … and I have another script that might not be helpful for your exact use case (because you have the actual character, not the text of the codepoint), but my pyscReplaceBackslashSequence.py, which I assign to Alt+\, which allows me to type something like U+1F61E or \x{1F61E} or &#x1F61E; then hit my Alt+\ shortcut, and it will convert it from the codepoint / entity into the actual character (and it handles the surrogate pair calculation inside the script) … so given the same “input” data that Guy showed in his example, instead of running Guy’s macro to show you the surrogate pair, you could use my script to actually convert that hex-notation into the actual character.

              Hmm, and I’ve just added a “TODO” for me to update my WhatUniChar script to show surrogate pairs for those not in the BMP… I’ll post an update here once I’ve got that script updated.

              UPDATE: WhatUniChar.py has been updated to show surrogate pair (it of course won’t help with your early-IOS-specific codepoints)

              1 Reply Last reply Reply Quote 2
              • guy038G
                guy038
                last edited by guy038

                Hello @peterjones and All,

                Peter, I’m afraid that your excellent and improved Python script WhatUniChar.py does not work at all for any NON-ASCII character found in an ANSI encoded file :-(

                Luckily, everything is allright if we deal with Unicode encoded files ( UTF-8, UTF-8-BOM, UTF-16 BE and UTF-16 LE )


                I suppose that it will not be easy to find out an acceptable solution, because :

                • The well-known Windows-125x ANSI encodings contains some NON-defined characters

                • For characters between U+0080 and U+009F, they usually match some unicode characters with code-point over U+00FF

                • For characters between U+00A0 and U+00FF, they can be searched with an unique syntaxe in ANSI files and with two syntaxes in an Unicode file !

                Refer, for example, to the list of the Win-1252 characters, over U+007F, below :

                          •---------------------•-------------------------•------------•
                          |    SEARCH in an     |      SEARCH in an       | SEARCHED   |
                          |  ANSI encoded file  |  UNICODE encoded file   | character  |
                          •---------------------•----------•--------------•------------•
                          |        \x80         |          :   \x{20AC}   |     €      |
                          |        \x81         |   \x81   :   \x{0081}   |         |
                          |        \x82         |          :   \x{201A}   |     ‚      |
                          |        \x83         |          :   \x{0192}   |     ƒ      |
                          |        \x84         |          :   \x{201E}   |     „      |
                          |        \x85         |          :   \x{2026}   |     …      |
                          |        \x86         |          :   \x{2020}   |     †      |
                          |        \x87         |          :   \x{2021}   |     ‡      |
                          |        \x88         |          :   \x{02C6}   |     ˆ      |
                          |        \x89         |          :   \x{2030}   |     ‰      |
                          |        \x8A         |          :   \x{0160}   |     Š      |
                          |        \x8B         |          :   \x{2039}   |     ‹      |
                          |        \x8C         |          :   \x{0152}   |     Œ      |
                          |        \x8D         |   \x8D   :   \x{008D}   |          |
                          |        \x8E         |          :   \x{017D}   |     Ž      |
                          |        \x8F         |   \X8F   :   \x{008F}   |         |
                          •---------------------•----------•--------------•------------•
                          |        \x90         |   \x90   :   \x{0090}   |         |
                          |        \x91         |          :   \x{2018}   |     ‘      |
                          |        \x92         |          :   \x{2019}   |     ’      |
                          |        \x93         |          :   \x{201C}   |     “      |
                          |        \x94         |          :   \x{201D}   |     ”      |
                          |        \x95         |          :   \x{2022}   |     •      |
                          |        \x96         |          :   \x{2013}   |     –      |
                          |        \x97         |          :   \x{2014}   |     —      |
                          |        \x98         |          :   \x{02DC}   |     ˜      |
                          |        \x99         |          :   \x{2122}   |     ™      |
                          |        \x9A         |          :   \x{0161}   |     š      |
                          |        \x9B         |          :   \x{203A}   |     ›      |
                          |        \x9C         |          :   \x{0153}   |     œ      |
                          |        \x9D         |   \x9D   :   \x{009D}   |         |
                          |        \x9E         |          :   \x{017E}   |     ž      |
                          |        \x9F         |          :   \x{0178}   |     Ÿ      |
                          •---------------------•----------•--------------•------------•
                          |        \xA0         |   \xA0   :   \x{00A0}   |            |
                          |        \xA1         |   \xA1   :   \x{00A1}   |     ¡      |
                          |        \xA2         |   \xA2   :   \x{00A2}   |     ¢      |
                          |        \xA3         |   \xA3   :   \x{00A3}   |     £      |
                          |        \xA4         |   \xA4   :   \x{00A4}   |     ¤      |
                          |        \xA5         |   \xA5   :   \x{00A5}   |     ¥      |
                          |        \xA6         |   \xA6   :   \x{00A6}   |     ¦      |
                          |        \xA7         |   \xA7   :   \x{00A7}   |     §      |
                          |        \xA8         |   \xA8   :   \x{00A8}   |     ¨      |
                          |        \xA9         |   \xA9   :   \x{00A9}   |     ©      |
                          |        \xAA         |   \xAA   :   \x{00AA}   |     ª      |
                          |        \xAB         |   \xAB   :   \x{00AB}   |     «      |
                          |        \xAC         |   \xAC   :   \x{00AC}   |     ¬      |
                          |        \xAD         |   \xAD   :   \x{00AD}   |     ­      |
                          |        \xAE         |   \xAE   :   \x{00AE}   |     ®      |
                          |        \xAF         |   \xAF   :   \x{00AF}   |     ¯      |
                          •---------------------•----------•--------------•------------•
                          |        \xB0         |   \xB0   :   \x{00B0}   |     °      |
                          |        \xB1         |   \xB1   :   \x{00B1}   |     ±      |
                          |        \xB2         |   \xB2   :   \x{00B2}   |     ²      |
                          |        \xB3         |   \xB3   :   \x{00B3}   |     ³      |
                          |        \xB4         |   \xB4   :   \x{00B4}   |     ´      |
                          |        \xB5         |   \xB5   :   \x{00B5}   |     µ      |
                          |        \xB6         |   \xB6   :   \x{00B6}   |     ¶      |
                          |        \xB7         |   \xB7   :   \x{00B7}   |     ·      |
                          |        \xB8         |   \xB8   :   \x{00B8}   |     ¸      |
                          |        \xB9         |   \xB9   :   \x{00B9}   |     ¹      |
                          |        \xBA         |   \xBA   :   \x{00BA}   |     º      |
                          |        \xBB         |   \xBB   :   \x{00BB}   |     »      |
                          |        \xBC         |   \xBC   :   \x{00BC}   |     ¼      |
                          |        \xBD         |   \xBD   :   \x{00BD}   |     ½      |
                          |        \xBE         |   \xBE   :   \x{00BE}   |     ¾      |
                          |        \xBF         |   \xBF   :   \x{00BF}   |     ¿      |
                          •---------------------•----------•--------------•------------•
                          |        \xC0         |   \xC0   :   \x{00C0}   |     À      |
                          |        \xC1         |   \xC1   :   \x{00C1}   |     Á      |
                          |        \xC2         |   \xC2   :   \x{00C2}   |     Â      |
                          |        \xC3         |   \xC3   :   \x{00C3}   |     Ã      |
                          |        \xC4         |   \xC4   :   \x{00C4}   |     Ä      |
                          |        \xC5         |   \xC5   :   \x{00C5}   |     Å      |
                          |        \xC6         |   \xC6   :   \x{00C6}   |     Æ      |
                          |        \xC7         |   \xC7   :   \x{00C7}   |     Ç      |
                          |        \xC8         |   \xC8   :   \x{00C8}   |     È      |
                          |        \xC9         |   \xC9   :   \x{00C9}   |     É      |
                          |        \xCA         |   \xCA   :   \x{00CA}   |     Ê      |
                          |        \xCB         |   \xCB   :   \x{00CB}   |     Ë      |
                          |        \xCC         |   \xCC   :   \x{00CC}   |     Ì      |
                          |        \xCD         |   \xCD   :   \x{00CD}   |     Í      |
                          |        \xCE         |   \xCE   :   \x{00CE}   |     Î      |
                          |        \xCF         |   \xCF   :   \x{00CF}   |     Ï      |
                          •---------------------•----------•--------------•------------•
                          |        \xD0         |   \xD0   :   \x{00D0}   |     Ð      |
                          |        \xD1         |   \xD1   :   \x{00D1}   |     Ñ      |
                          |        \xD2         |   \xD2   :   \x{00D2}   |     Ò      |
                          |        \xD3         |   \xD3   :   \x{00D3}   |     Ó      |
                          |        \xD4         |   \xD4   :   \x{00D4}   |     Ô      |
                          |        \xD5         |   \xD5   :   \x{00D5}   |     Õ      |
                          |        \xD6         |   \xD6   :   \x{00D6}   |     Ö      |
                          |        \xD7         |   \xD7   :   \x{00D7}   |     ×      |
                          |        \xD8         |   \xD8   :   \x{00D8}   |     Ø      |
                          |        \xD9         |   \xD9   :   \x{00D9}   |     Ù      |
                          |        \xDA         |   \xDA   :   \x{00DA}   |     Ú      |
                          |        \xDB         |   \xDB   :   \x{00DB}   |     Û      |
                          |        \xDC         |   \xDC   :   \x{00DC}   |     Ü      |
                          |        \xDD         |   \xDD   :   \x{00DD}   |     Ý      |
                          |        \xDE         |   \xDE   :   \x{00DE}   |     Þ      |
                          |        \xDF         |   \xDF   :   \x{00DF}   |     ß      |
                          •---------------------•----------•--------------•------------•
                          |        \xE0         |   \xE0   :   \x{00E0}   |     à      |
                          |        \xE1         |   \xE1   :   \x{00E1}   |     á      |
                          |        \xE2         |   \xE2   :   \x{00E2}   |     â      |
                          |        \xE3         |   \xE3   :   \x{00E3}   |     ã      |
                          |        \xE4         |   \xE4   :   \x{00E4}   |     ä      |
                          |        \xE5         |   \xE5   :   \x{00E5}   |     å      |
                          |        \xE6         |   \xE6   :   \x{00E6}   |     æ      |
                          |        \xE7         |   \xE7   :   \x{00E7}   |     ç      |
                          |        \xE8         |   \xE8   :   \x{00E8}   |     è      |
                          |        \xE9         |   \xE9   :   \x{00E9}   |     é      |
                          |        \xEA         |   \xEA   :   \x{00EA}   |     ê      |
                          |        \xEB         |   \xEB   :   \x{00EB}   |     ë      |
                          |        \xEC         |   \xEC   :   \x{00EC}   |     ì      |
                          |        \xED         |   \xED   :   \x{00ED}   |     í      |
                          |        \xEE         |   \xEE   :   \x{00EE}   |     î      |
                          |        \xEF         |   \xEF   :   \x{00EF}   |     ï      |
                          •---------------------•----------•--------------•------------•
                          |        \xF0         |   \xF0   :   \x{00F0}   |     ð      |
                          |        \xF1         |   \xF1   :   \x{00F1}   |     ñ      |
                          |        \xF2         |   \xF2   :   \x{00F2}   |     ò      |
                          |        \xF3         |   \xF3   :   \x{00F3}   |     ó      |
                          |        \xF4         |   \xF4   :   \x{00F4}   |     ô      |
                          |        \xF5         |   \xF5   :   \x{00F5}   |     õ      |
                          |        \xF6         |   \xF6   :   \x{00F6}   |     ö      |
                          |        \xF7         |   \xF7   :   \x{00F7}   |     ÷      |
                          |        \xF8         |   \xF8   :   \x{00F8}   |     ø      |
                          |        \xF9         |   \xF9   :   \x{00F9}   |     ù      |
                          |        \xFA         |   \xFA   :   \x{00FA}   |     ú      |
                          |        \xFB         |   \xFB   :   \x{00FB}   |     û      |
                          |        \xFC         |   \xFC   :   \x{00FC}   |     ü      |
                          |        \xFD         |   \xFD   :   \x{00FD}   |     ý      |
                          |        \xFE         |   \xFE   :   \x{00FE}   |     þ      |
                          |        \xFF         |   \xFF   :   \x{00FF}   |     ÿ      |
                          •---------------------•----------•--------------•------------•
                

                Best Regards,

                guy038

                Alan KilbornA PeterJonesP 2 Replies Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @guy038
                  last edited by Alan Kilborn

                  @guy038 said in Old WhatsApp conversations with legacy encoded emojis - find and replace?:

                  WhatUniChar.py does not work at all for any NON-ASCII character found in an ANSI encoded file

                  The shocking thing is is that you seem to expect that to actually work. :-)
                  Aside from historical points of interest (perhaps this is one of those), isn’t it time to let go of non-Unicode encodings?

                  1 Reply Last reply Reply Quote 1
                  • PeterJonesP
                    PeterJones @guy038
                    last edited by

                    @guy038 said in Old WhatsApp conversations with legacy encoded emojis - find and replace?:

                    I’m afraid that your excellent and improved Python script WhatUniChar.py does not work at all for any NON-ASCII character found in an ANSI encoded file

                    Of course it doesn’t. That’s why I named it WhatUniChar, where Uni is short for Unicode. If the characters are not encoded as Unicode codepoints, that script is the wrong tool for the job.

                    The ANSI character sets were useful in the 80s, when there was no single, agreed-upon international encoding, but since Unicode was developed in the 90s and popularized in the 00’s, there’s been no convincing reason for me to continue to use those ancient charsets. (And the fact that Notepad++ still has major, daily-use features – such as UDL and auto-completions – that are not Unicode-enabled is a huge embarrassment to an otherwise-excellent application.)

                    1 Reply Last reply Reply Quote 3
                    • guy038G
                      guy038
                      last edited by guy038

                      Hello, @peterjones, @alan-kilborn and All,

                      From, your two last posts, Alan and Peter, I asked myself : which is the distribution of all my text files, regarding their encoding ?

                      I considered, as a text file, all the files with the main following extensions, by importance level :

                      txt, .py, html, htm, xml, ini, msg, csv, log as well as few other files with rare extension


                      Now, using the iconv.exe utility to get all the NON-UTF8 files and, then, the xxd.exe software to omit the UFT-16 encoded files, I was able, little by little, to restrict my list to 360 files, about, for which I possibly could change the encoding from ANSI to UTF-8 !

                      Of course, opening all the files, one at a time, in N++, changing their encoding and saving them seemed rather tedious. Thus, I used a simple python script to achieve this task easily :

                      '''
                      NAME   : Move_to_UTF8_encoding.py
                      
                      REMARK : The fonction 'npp_get_statusbar' is an idea of @alan-kilborn
                      
                      
                      This script :
                      
                            - Opens a file which contains a list of ABSOLUTE file-paths
                      
                            - Read, successively, the file-paths from that list
                      
                            - Open EACH file in N++
                      
                            - Perform the 'Convert to UTD-8' action on the CURRENT opened ANSI file
                      
                            - Save and close EACH file, one at a time
                      
                      
                      NOTES :
                      
                            - The file, containing the list of ABSOLUTE file-paths to OPEN, is an UTF-8 encoded file, with 'Windows' EOL
                      
                            - This list must NOT contain EMPTY or BLANK lines
                      
                            - But, any line beginning with the '#' character is simply IGNORED ( So begin any EMPTY line or COMMENT line with a '#' char ! )
                      
                            - The PATHS are designated by a SIMPLE character ANTI-SLASH ( Ex : D:\Dir_1\Dir_2\Name.txt ). NO need to DOUBLE the ANTISLASH ( \\ )
                      
                            - On the same way, NO need to SURROUND the file-paths, containing SPACE characters, with DOUBLE-QUOTES
                      
                            - This list may contain some ACCENTUATED characters
                      '''
                      
                      from Npp import *
                      
                      import time
                      
                      import ctypes
                      
                      from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT
                      
                      console.show()
                      
                      console.clear()
                      
                      with open('D:\\Verif.txt') as file:
                      
                          for file_path in file:
                      
                              file_path = file_path.strip('\n')
                      
                              if file_path[0] == "#":
                                  continue
                      
                              notepad.open(file_path)
                      
                              # ----------------------------------------------------------------------------------------------------------------------------------------------------
                              #  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
                              # ----------------------------------------------------------------------------------------------------------------------------------------------------
                      
                              def npp_get_statusbar(statusbar_item_number):
                      
                                  WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
                                  FindWindowW = ctypes.windll.user32.FindWindowW
                                  FindWindowExW = ctypes.windll.user32.FindWindowExW
                                  SendMessageW = ctypes.windll.user32.SendMessageW
                                  LRESULT = LPARAM
                                  SendMessageW.restype = LRESULT
                                  SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
                                  EnumChildWindows = ctypes.windll.user32.EnumChildWindows
                                  GetClassNameW = ctypes.windll.user32.GetClassNameW
                                  create_unicode_buffer = ctypes.create_unicode_buffer
                      
                                  SBT_OWNERDRAW = 0x1000
                                  WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13
                      
                                  npp_get_statusbar.STATUSBAR_HANDLE = None
                      
                                  def get_result_from_statusbar(statusbar_item_number):
                                      assert statusbar_item_number <= 5
                                      retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
                                      length = retcode & 0xFFFF
                                      type = (retcode >> 16) & 0xFFFF
                                      assert (type != SBT_OWNERDRAW)
                                      text_buffer = create_unicode_buffer(length)
                                      retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
                                      retval = '{}'.format(text_buffer[:length])
                                      return retval
                      
                                  def EnumCallback(hwnd, lparam):
                                      curr_class = create_unicode_buffer(256)
                                      GetClassNameW(hwnd, curr_class, 256)
                                      if curr_class.value.lower() == "msctls_statusbar32":
                                          npp_get_statusbar.STATUSBAR_HANDLE = hwnd
                                          return False  # stop the enumeration
                                      return True  # continue the enumeration
                      
                                  npp_hwnd = FindWindowW(u"Notepad++", None)
                                  EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
                                  if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
                                  assert False
                      
                              St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )
                      
                              if St_bar == 'ANSI':  #  =>  Conversion to 'UTF-8', without BOM, RECOMMENDED !
                      
                                  time.sleep(0.5)
                      
                                  notepad.runMenuCommand("Encoding", "Convert to UTF-8")
                                  notepad.save()
                      
                                  time.sleep(0.5)
                      
                              notepad.close()
                      

                      REMARK :

                      • As I was a bit anxious about the needed time to get the encoding change and the save action, for each file, I preferred to use timers to properly ensure the entire process but, may be, these timers are not necessary !

                      So, after the various modifications, I got a list of 11,578 files whose distribution, according to their encoding, is as follows :

                                                      UTF-8 BOM      :        208               |
                                                      UTF-16 LE BOM  :         39               |
                                                      UTF-16 BE BOM  :          4               |
                                                      UTF-8          :        540  ( 0 byte )   |   =>  10,737 with UNICODE encoding ( 92,7 % )
                                                      UTF-8          :      9,946               |
                                                      ANSI           :        841
                                                                         ----------
                                                      TOTAL                11,578
                      

                      You certainly note that there still are a lot of ANSI files, but most of them are lang or configuration files for which the change of the encoding is rather forbidden or, at least, not welcome !

                      Best Regards,

                      guy038

                      1 Reply Last reply Reply Quote 1
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors