How to identify true "ANSI" files for further change as an "UTF-8" files ?
-
Hello, All,
When I began to “follow” Notepad++, in
2012, I was using N++ANSIversions. At this time, N++'s forums were located onSourceForge.Netand, of course, my first replies to users were stored asANSIfiles.Recently, I was wondering how many files, among my
1,800posts about, stored on my laptop, have a trueANSIencoding, in order to re-encode them asUTF-8with the two commands :-
Encoding > Convert to UTF-8 -
File > Save
I suceeded to find out a way to achieve it ! You just need two
DOStools :iconv.exeandxxd.exe
Note that the sole use of the
xxd.exeprogram, which is a hexdump utility, didn’t be useful because it could identify, without error :-
True
UTF-8-BOMfiles with the first three bytesEFBBBF -
True
UTF-16 BE BOMwith the first two bytesFEFF -
True
UTF-16 LE BOMwith the first two bytesFFFE
But it cannot display, at the first sight, any difference between an
ANSIand aUTF-8file without a BOM, unless scanning the files completely !
Thus,
- Download the binaries, dependencies and documentation archives, relative to the
iconv.exeprogram, from :
https://gnuwin32.sourceforge.net/packages/libiconv.htm
-
Move to the directory where you extract the
iconv.exeprogram and its3dependencies :libcharset1.dll,libiconv2.dllandlibintl3.dll -
Now, in this same directory, download the
xxdarchive from :
https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download
-
Extract the
xxd.exefile -
Open a DOS window
-
Move to the folder containing, both,
iconv.exeandxxd.exe -
Run the command
iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt
Of course, you must replace the
<Full_Directory_Path>zone by the folder location containing all the files to scan for encoding !=> With this command, you are certain that the OUTPUT file (
results.txt) contains only files which are not presently encoded inUTF-8( so they can be trueANSIfiles orUTF-16files )-
Now, open the
results.txtfile in N++ -
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$ -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton
=> We keep the full pathnames, only
-
Now, rename this file as
results.bat -
Save the
results.batfile -
Then, still in the
results.batfile : -
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+ -
REPLACE
\( echo | set /p="$0 " & xxd.exe -l2 -u "$0" \) >> results.txt -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton -
Add the command
@echo OFFat the very beginning of theresults.batfile -
Re-save the
results.batfile -
Run the
results.batfile, in the DOS window -
In notepad++, open the new
results.txtfile -
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+ -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton -
Save the
results.txtfile
=> You should get, only, the pure
ANSIfiles that need a furtherUTF-8orUTF-8-BOMencoding !- Finally, delete the
results.batfile
Best Regards,
guy038
-
-
Hello, All,
In my last post, I forgot some important points ! So, refer to this correct version, below and forget the previous one !
When I began to “follow” Notepad++, in
2012, I was using N++ANSIversions. At this time, N++'s forums were located onSourceForge.Netand, of course, my first replies to users were stored asANSIfiles.Recently, I was wondering how many files, among my
1,800posts about, stored on my laptop, have a trueANSIencoding, in order to re-encode them asUTF-8with the two commands :-
Encoding > Convert to UTF-8 -
File > Save
I suceeded to find out a way to achieve it ! You just need two
DOStools :iconv.exeandxxd.exe
Notes :
-
Using the
xxd.exeprogram alone, which is a hexdump utility, is not enough to determine the correct encoding. In fact, it can identify, without error :-
True
UTF-8-BOMfiles with the first three bytesEFBBBF -
True
UTF-16 BE BOMwith the first two bytesFEFF -
True
UTF-16 LE BOMwith the first two bytesFFFE -
But it cannot display, at the first sight, any difference between an
ANSIand aUTF-8file without a BOM, unless scanning the files completely !
-
-
Rememver also that, if all bytes of a file have a value below
\x80, Notepad++ considers that it’s a pseudoUTF-8file, by default !
Thus,
- Download the binaries, dependencies and documentation archives, relative to the
iconv.exeprogram, from :
https://gnuwin32.sourceforge.net/packages/libiconv.htm
-
Move to the directory where you extract the
iconv.exeprogram and its3dependencies :libcharset1.dll,libiconv2.dllandlibintl3.dll -
Now, in this same directory, download the
xxdarchive from :
https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download
-
Extract the
xxd.exefile -
Open a DOS window
-
Run, first, the command
chcp 1252( in order to get anANSIoutput for this DOS window ! ) -
Move to the folder containing, both,
iconv.exeandxxd.exe -
Then, run the command
iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt
Of course, you must replace the
<Full_Directory_Path>zone by the folder location containing all the files to scan for encoding !=> With this command, you are certain that the OUTPUT file (
results.txt) contains only files which are not presently encoded inUTF-8( so, in N++, they can be trueANSIfiles, with some bytes between\x80and\xFFor trueUTF-16files with a BOM )-
Now, open the
results.txtfile in N++ -
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$ -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton
=> We keep the full pathnames, only
-
Save the
results.txtfile -
Open, again, the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+ -
REPLACE
\( echo | set /p="$0 " & xxd.exe -l2 -u "$0" \) >> results.txt -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton -
Now, add the line
@echo OFFat the very beginning of the file -
And rename this file as
results.bat -
If the current encoding of this batch file is
UTF-8, run theEncoding > Convert to ANSIoption ( IMPORTANT, as the DOS window isANSItoo ) -
Save this new
results.batfile -
Run the
results.batfile, in the DOS window -
Back to Notepad++, open the new
results.txtfile -
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+ -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton
=> We get rid of the
UTF-16files which have a BOM equal toFEFForFFFE- Save the
results.txtfile
=> Thus, you should get, only, the true
ANSIfiles, containing some bytes between\x80and\xFF, that need a furtherUTF-8orUTF-8-BOMencoding !-
Delete the
results.batfile -
Finally, close the
DOSwindow
Best Regards,
guy038
-