Community
    • Login

    How to identify true "ANSI" files for further change as an "UTF-8" files ?

    Scheduled Pinned Locked Moved General Discussion
    2 Posts 1 Posters 829 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hello, All,

      When I began to “follow” Notepad++, in 2012, I was using N++ ANSI versions. At this time, N++'s forums were located on SourceForge.Net and, of course, my first replies to users were stored as ANSI files.

      Recently, I was wondering how many files, among my 1,800 posts about, stored on my laptop, have a true ANSI encoding, in order to re-encode them as UTF-8 with the two commands :

      • Encoding > Convert to UTF-8

      • File > Save

      I suceeded to find out a way to achieve it ! You just need two DOS tools : iconv.exe and xxd.exe


      Note that the sole use of the xxd.exe program, which is a hexdump utility, didn’t be useful because it could identify, without error :

      • True UTF-8-BOM files with the first three bytes EFBBBF

      • True UTF-16 BE BOM with the first two bytes FEFF

      • True UTF-16 LE BOM with the first two bytes FFFE

      But it cannot display, at the first sight, any difference between an ANSI and a UTF-8 file without a BOM, unless scanning the files completely !


      Thus,

      • Download the binaries, dependencies and documentation archives, relative to the iconv.exe program, from :

      https://gnuwin32.sourceforge.net/packages/libiconv.htm

      • Move to the directory where you extract the iconv.exe program and its 3 dependencies : libcharset1.dll, libiconv2.dll and libintl3.dll

      • Now, in this same directory, download the xxd archive from :

      https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download

      • Extract the xxd.exe file

      • Open a DOS window

      • Move to the folder containing, both, iconv.exe and xxd.exe

      • Run the command iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt

      Of course, you must replace the <Full_Directory_Path> zone by the folder location containing all the files to scan for encoding !

      => With this command, you are certain that the OUTPUT file ( results.txt ) contains only files which are not presently encoded in UTF-8 ( so they can be true ANSI files or UTF-16 files )

      • Now, open the results.txt file in N++

      • Open the Replace dialog ( Ctrl + H )

      • SEARCH (?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$

      • REPLACE Leave EMPTY

      • Tick the Wrap around option

      • Select the Regular expression search mode

      • Click on the Replace All button

      => We keep the full pathnames, only

      • Now, rename this file as results.bat

      • Save the results.bat file

      • Then, still in the results.bat file :

      • Open the Replace dialog ( Ctrl + H )

      • SEARCH (?-s)^.+

      • REPLACE \( echo | set /p="$0 " & xxd.exe -l2 -u "$0" \) >> results.txt

      • Tick the Wrap around option

      • Select the Regular expression search mode

      • Click on the Replace All button

      • Add the command @echo OFF at the very beginning of the results.bat file

      • Re-save the results.bat file

      • Run the results.bat file, in the DOS window

      • In notepad++, open the new results.txt file

      • Open the Replace dialog ( Ctrl + H )

      • SEARCH (?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+

      • REPLACE Leave EMPTY

      • Tick the Wrap around option

      • Select the Regular expression search mode

      • Click on the Replace All button

      • Save the results.txt file

      => You should get, only, the pure ANSI files that need a further UTF-8 or UTF-8-BOM encoding !

      • Finally, delete the results.bat file

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 1
      • guy038G
        guy038
        last edited by guy038

        Hello, All,

        In my last post, I forgot some important points ! So, refer to this correct version, below and forget the previous one !


        When I began to “follow” Notepad++, in 2012, I was using N++ ANSI versions. At this time, N++'s forums were located on SourceForge.Net and, of course, my first replies to users were stored as ANSI files.

        Recently, I was wondering how many files, among my 1,800 posts about, stored on my laptop, have a true ANSI encoding, in order to re-encode them as UTF-8 with the two commands :

        • Encoding > Convert to UTF-8

        • File > Save

        I suceeded to find out a way to achieve it ! You just need two DOS tools : iconv.exe and xxd.exe


        Notes :

        • Using the xxd.exe program alone, which is a hexdump utility, is not enough to determine the correct encoding. In fact, it can identify, without error :

          • True UTF-8-BOM files with the first three bytes EFBBBF

          • True UTF-16 BE BOM with the first two bytes FEFF

          • True UTF-16 LE BOM with the first two bytes FFFE

          • But it cannot display, at the first sight, any difference between an ANSI and a UTF-8 file without a BOM, unless scanning the files completely !

        • Rememver also that, if all bytes of a file have a value below \x80, Notepad++ considers that it’s a pseudo UTF-8 file, by default !


        Thus,

        • Download the binaries, dependencies and documentation archives, relative to the iconv.exe program, from :

        https://gnuwin32.sourceforge.net/packages/libiconv.htm

        • Move to the directory where you extract the iconv.exe program and its 3 dependencies : libcharset1.dll, libiconv2.dll and libintl3.dll

        • Now, in this same directory, download the xxd archive from :

        https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download

        • Extract the xxd.exe file

        • Open a DOS window

        • Run, first, the command chcp 1252 ( in order to get an ANSI output for this DOS window ! )

        • Move to the folder containing, both, iconv.exe and xxd.exe

        • Then, run the command iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt

        Of course, you must replace the <Full_Directory_Path> zone by the folder location containing all the files to scan for encoding !

        => With this command, you are certain that the OUTPUT file ( results.txt ) contains only files which are not presently encoded in UTF-8 ( so, in N++, they can be true ANSI files, with some bytes between \x80 and \xFF or true UTF-16 files with a BOM )

        • Now, open the results.txt file in N++

        • Open the Replace dialog ( Ctrl + H )

        • SEARCH (?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$

        • REPLACE Leave EMPTY

        • Tick the Wrap around option

        • Select the Regular expression search mode

        • Click on the Replace All button

        => We keep the full pathnames, only

        • Save the results.txt file

        • Open, again, the Replace dialog ( Ctrl + H )

        • SEARCH (?-s)^.+

        • REPLACE \( echo | set /p="$0 " & xxd.exe -l2 -u "$0" \) >> results.txt

        • Tick the Wrap around option

        • Select the Regular expression search mode

        • Click on the Replace All button

        • Now, add the line @echo OFF at the very beginning of the file

        • And rename this file as results.bat

        • If the current encoding of this batch file is UTF-8, run the Encoding > Convert to ANSI option ( IMPORTANT, as the DOS window is ANSI too )

        • Save this new results.bat file

        • Run the results.bat file, in the DOS window

        • Back to Notepad++, open the new results.txt file

        • Open the Replace dialog ( Ctrl + H )

        • SEARCH (?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+

        • REPLACE Leave EMPTY

        • Tick the Wrap around option

        • Select the Regular expression search mode

        • Click on the Replace All button

        => We get rid of the UTF-16 files which have a BOM equal to FEFF or FFFE

        • Save the results.txt file

        => Thus, you should get, only, the true ANSI files, containing some bytes between \x80 and \xFF, that need a further UTF-8 or UTF-8-BOM encoding !

        • Delete the results.bat file

        • Finally, close the DOS window

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 2
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors