How to identify true "ANSI" files for further change as an "UTF-8" files ?

guy038

Hello, All,

When I began to “follow” Notepad++, in 2012, I was using N++ ANSI versions. At this time, N++'s forums were located on SourceForge.Net and, of course, my first replies to users were stored as ANSI files.

Recently, I was wondering how many files, among my 1,800 posts about, stored on my laptop, have a true ANSI encoding, in order to re-encode them as UTF-8 with the two commands :

Encoding > Convert to UTF-8
File > Save

I suceeded to find out a way to achieve it ! You just need two DOS tools : iconv.exe and xxd.exe

Note that the sole use of the xxd.exe program, which is a hexdump utility, didn’t be useful because it could identify, without error :

True UTF-8-BOM files with the first three bytes EFBBBF
True UTF-16 BE BOM with the first two bytes FEFF
True UTF-16 LE BOM with the first two bytes FFFE

But it cannot display, at the first sight, any difference between an ANSI and a UTF-8 file without a BOM, unless scanning the files completely !

Thus,

Download the binaries, dependencies and documentation archives, relative to the iconv.exe program, from :

https://gnuwin32.sourceforge.net/packages/libiconv.htm

Move to the directory where you extract the iconv.exe program and its 3 dependencies : libcharset1.dll, libiconv2.dll and libintl3.dll
Now, in this same directory, download the xxd archive from :

https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download

Extract the xxd.exe file
Open a DOS window
Move to the folder containing, both, iconv.exe and xxd.exe
Run the command iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt

Of course, you must replace the <Full_Directory_Path> zone by the folder location containing all the files to scan for encoding !

=> With this command, you are certain that the OUTPUT file ( results.txt ) contains only files which are not presently encoded in UTF-8 ( so they can be true ANSI files or UTF-16 files )

Now, open the results.txt file in N++
Open the Replace dialog ( Ctrl + H )
SEARCH (?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$
REPLACE Leave EMPTY
Tick the Wrap around option
Select the Regular expression search mode
Click on the Replace All button

=> We keep the full pathnames, only

Now, rename this file as results.bat
Save the results.bat file
Then, still in the results.bat file :
Open the Replace dialog ( Ctrl + H )
SEARCH (?-s)^.+
REPLACE $ echo | set /p="$0 " & xxd.exe -l2 -u "$0" $ >> results.txt
Tick the Wrap around option
Select the Regular expression search mode
Click on the Replace All button
Add the command @echo OFF at the very beginning of the results.bat file
Re-save the results.bat file
Run the results.bat file, in the DOS window
In notepad++, open the new results.txt file
Open the Replace dialog ( Ctrl + H )
SEARCH (?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+
REPLACE Leave EMPTY
Tick the Wrap around option
Select the Regular expression search mode
Click on the Replace All button
Save the results.txt file

=> You should get, only, the pure ANSI files that need a further UTF-8 or UTF-8-BOM encoding !

Finally, delete the results.bat file

Best Regards,

guy038

guy038

Hello, All,

In my last post, I forgot some important points ! So, refer to this correct version, below and forget the previous one !

When I began to “follow” Notepad++, in 2012, I was using N++ ANSI versions. At this time, N++'s forums were located on SourceForge.Net and, of course, my first replies to users were stored as ANSI files.

Recently, I was wondering how many files, among my 1,800 posts about, stored on my laptop, have a true ANSI encoding, in order to re-encode them as UTF-8 with the two commands :

Encoding > Convert to UTF-8
File > Save

I suceeded to find out a way to achieve it ! You just need two DOS tools : iconv.exe and xxd.exe

Notes :

Using the xxd.exe program alone, which is a hexdump utility, is not enough to determine the correct encoding. In fact, it can identify, without error :
- True UTF-8-BOM files with the first three bytes EFBBBF
- True UTF-16 BE BOM with the first two bytes FEFF
- True UTF-16 LE BOM with the first two bytes FFFE
- But it cannot display, at the first sight, any difference between an ANSI and a UTF-8 file without a BOM, unless scanning the files completely !
Rememver also that, if all bytes of a file have a value below \x80, Notepad++ considers that it’s a pseudo UTF-8 file, by default !

Thus,

Download the binaries, dependencies and documentation archives, relative to the iconv.exe program, from :

https://gnuwin32.sourceforge.net/packages/libiconv.htm

Move to the directory where you extract the iconv.exe program and its 3 dependencies : libcharset1.dll, libiconv2.dll and libintl3.dll
Now, in this same directory, download the xxd archive from :

https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download

Extract the xxd.exe file
Open a DOS window
Run, first, the command chcp 1252 ( in order to get an ANSI output for this DOS window ! )
Move to the folder containing, both, iconv.exe and xxd.exe
Then, run the command iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt

Of course, you must replace the <Full_Directory_Path> zone by the folder location containing all the files to scan for encoding !

=> With this command, you are certain that the OUTPUT file ( results.txt ) contains only files which are not presently encoded in UTF-8 ( so, in N++, they can be true ANSI files, with some bytes between \x80 and \xFF or true UTF-16 files with a BOM )

Now, open the results.txt file in N++
Open the Replace dialog ( Ctrl + H )
SEARCH (?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$
REPLACE Leave EMPTY
Tick the Wrap around option
Select the Regular expression search mode
Click on the Replace All button

=> We keep the full pathnames, only

Save the results.txt file
Open, again, the Replace dialog ( Ctrl + H )
SEARCH (?-s)^.+
REPLACE $ echo | set /p="$0 " & xxd.exe -l2 -u "$0" $ >> results.txt
Tick the Wrap around option
Select the Regular expression search mode
Click on the Replace All button
Now, add the line @echo OFF at the very beginning of the file
And rename this file as results.bat
If the current encoding of this batch file is UTF-8, run the Encoding > Convert to ANSI option ( IMPORTANT, as the DOS window is ANSI too )
Save this new results.bat file
Run the results.bat file, in the DOS window
Back to Notepad++, open the new results.txt file
Open the Replace dialog ( Ctrl + H )
SEARCH (?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+
REPLACE Leave EMPTY
Tick the Wrap around option
Select the Regular expression search mode
Click on the Replace All button

=> We get rid of the UTF-16 files which have a BOM equal to FEFF or FFFE

Save the results.txt file

=> Thus, you should get, only, the true ANSI files, containing some bytes between \x80 and \xFF, that need a further UTF-8 or UTF-8-BOM encoding !

Delete the results.bat file
Finally, close the DOS window

Best Regards,

guy038