Community
    • Login

    How to identify true "ANSI" files for further change as an "UTF-8" files ?

    Scheduled Pinned Locked Moved General Discussion
    2 Posts 1 Posters 1.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G Offline
      guy038
      last edited by guy038

      Hello, All,

      When I began to “follow” Notepad++, in 2012, I was using N++ ANSI versions. At this time, N++'s forums were located on SourceForge.Net and, of course, my first replies to users were stored as ANSI files.

      Recently, I was wondering how many files, among my 1,800 posts about, stored on my laptop, have a true ANSI encoding, in order to re-encode them as UTF-8 with the two commands :

      • Encoding > Convert to UTF-8

      • File > Save

      I suceeded to find out a way to achieve it ! You just need two DOS tools : iconv.exe and xxd.exe


      Note that the sole use of the xxd.exe program, which is a hexdump utility, didn’t be useful because it could identify, without error :

      • True UTF-8-BOM files with the first three bytes EFBBBF

      • True UTF-16 BE BOM with the first two bytes FEFF

      • True UTF-16 LE BOM with the first two bytes FFFE

      But it cannot display, at the first sight, any difference between an ANSI and a UTF-8 file without a BOM, unless scanning the files completely !


      Thus,

      • Download the binaries, dependencies and documentation archives, relative to the iconv.exe program, from :

      https://gnuwin32.sourceforge.net/packages/libiconv.htm

      • Move to the directory where you extract the iconv.exe program and its 3 dependencies : libcharset1.dll, libiconv2.dll and libintl3.dll

      • Now, in this same directory, download the xxd archive from :

      https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download

      • Extract the xxd.exe file

      • Open a DOS window

      • Move to the folder containing, both, iconv.exe and xxd.exe

      • Run the command iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt

      Of course, you must replace the <Full_Directory_Path> zone by the folder location containing all the files to scan for encoding !

      => With this command, you are certain that the OUTPUT file ( results.txt ) contains only files which are not presently encoded in UTF-8 ( so they can be true ANSI files or UTF-16 files )

      • Now, open the results.txt file in N++

      • Open the Replace dialog ( Ctrl + H )

      • SEARCH (?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$

      • REPLACE Leave EMPTY

      • Tick the Wrap around option

      • Select the Regular expression search mode

      • Click on the Replace All button

      => We keep the full pathnames, only

      • Now, rename this file as results.bat

      • Save the results.bat file

      • Then, still in the results.bat file :

      • Open the Replace dialog ( Ctrl + H )

      • SEARCH (?-s)^.+

      • REPLACE \( echo | set /p="$0 " & xxd.exe -l2 -u "$0" \) >> results.txt

      • Tick the Wrap around option

      • Select the Regular expression search mode

      • Click on the Replace All button

      • Add the command @echo OFF at the very beginning of the results.bat file

      • Re-save the results.bat file

      • Run the results.bat file, in the DOS window

      • In notepad++, open the new results.txt file

      • Open the Replace dialog ( Ctrl + H )

      • SEARCH (?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+

      • REPLACE Leave EMPTY

      • Tick the Wrap around option

      • Select the Regular expression search mode

      • Click on the Replace All button

      • Save the results.txt file

      => You should get, only, the pure ANSI files that need a further UTF-8 or UTF-8-BOM encoding !

      • Finally, delete the results.bat file

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 1
      • guy038G Offline
        guy038
        last edited by guy038

        Hello, All,

        In my last post, I forgot some important points ! So, refer to this correct version, below and forget the previous one !


        When I began to “follow” Notepad++, in 2012, I was using N++ ANSI versions. At this time, N++'s forums were located on SourceForge.Net and, of course, my first replies to users were stored as ANSI files.

        Recently, I was wondering how many files, among my 1,800 posts about, stored on my laptop, have a true ANSI encoding, in order to re-encode them as UTF-8 with the two commands :

        • Encoding > Convert to UTF-8

        • File > Save

        I suceeded to find out a way to achieve it ! You just need two DOS tools : iconv.exe and xxd.exe


        Notes :

        • Using the xxd.exe program alone, which is a hexdump utility, is not enough to determine the correct encoding. In fact, it can identify, without error :

          • True UTF-8-BOM files with the first three bytes EFBBBF

          • True UTF-16 BE BOM with the first two bytes FEFF

          • True UTF-16 LE BOM with the first two bytes FFFE

          • But it cannot display, at the first sight, any difference between an ANSI and a UTF-8 file without a BOM, unless scanning the files completely !

        • Rememver also that, if all bytes of a file have a value below \x80, Notepad++ considers that it’s a pseudo UTF-8 file, by default !


        Thus,

        • Download the binaries, dependencies and documentation archives, relative to the iconv.exe program, from :

        https://gnuwin32.sourceforge.net/packages/libiconv.htm

        • Move to the directory where you extract the iconv.exe program and its 3 dependencies : libcharset1.dll, libiconv2.dll and libintl3.dll

        • Now, in this same directory, download the xxd archive from :

        https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download

        • Extract the xxd.exe file

        • Open a DOS window

        • Run, first, the command chcp 1252 ( in order to get an ANSI output for this DOS window ! )

        • Move to the folder containing, both, iconv.exe and xxd.exe

        • Then, run the command iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt

        Of course, you must replace the <Full_Directory_Path> zone by the folder location containing all the files to scan for encoding !

        => With this command, you are certain that the OUTPUT file ( results.txt ) contains only files which are not presently encoded in UTF-8 ( so, in N++, they can be true ANSI files, with some bytes between \x80 and \xFF or true UTF-16 files with a BOM )

        • Now, open the results.txt file in N++

        • Open the Replace dialog ( Ctrl + H )

        • SEARCH (?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$

        • REPLACE Leave EMPTY

        • Tick the Wrap around option

        • Select the Regular expression search mode

        • Click on the Replace All button

        => We keep the full pathnames, only

        • Save the results.txt file

        • Open, again, the Replace dialog ( Ctrl + H )

        • SEARCH (?-s)^.+

        • REPLACE \( echo | set /p="$0 " & xxd.exe -l2 -u "$0" \) >> results.txt

        • Tick the Wrap around option

        • Select the Regular expression search mode

        • Click on the Replace All button

        • Now, add the line @echo OFF at the very beginning of the file

        • And rename this file as results.bat

        • If the current encoding of this batch file is UTF-8, run the Encoding > Convert to ANSI option ( IMPORTANT, as the DOS window is ANSI too )

        • Save this new results.bat file

        • Run the results.bat file, in the DOS window

        • Back to Notepad++, open the new results.txt file

        • Open the Replace dialog ( Ctrl + H )

        • SEARCH (?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+

        • REPLACE Leave EMPTY

        • Tick the Wrap around option

        • Select the Regular expression search mode

        • Click on the Replace All button

        => We get rid of the UTF-16 files which have a BOM equal to FEFF or FFFE

        • Save the results.txt file

        => Thus, you should get, only, the true ANSI files, containing some bytes between \x80 and \xFF, that need a further UTF-8 or UTF-8-BOM encoding !

        • Delete the results.bat file

        • Finally, close the DOS window

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 2

        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

        With your input, this post could be even better 💗

        Register Login
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors