How to identify true "ANSI" files for further change as an "UTF-8" files ?
-
Hello, All,
When I began to “follow” Notepad++, in
2012
, I was using N++ANSI
versions. At this time, N++'s forums were located onSourceForge.Net
and, of course, my first replies to users were stored asANSI
files.Recently, I was wondering how many files, among my
1,800
posts about, stored on my laptop, have a trueANSI
encoding, in order to re-encode them asUTF-8
with the two commands :-
Encoding > Convert to UTF-8
-
File > Save
I suceeded to find out a way to achieve it ! You just need two
DOS
tools :iconv.exe
andxxd.exe
Note that the sole use of the
xxd.exe
program, which is a hexdump utility, didn’t be useful because it could identify, without error :-
True
UTF-8-BOM
files with the first three bytesEFBBBF
-
True
UTF-16 BE BOM
with the first two bytesFEFF
-
True
UTF-16 LE BOM
with the first two bytesFFFE
But it cannot display, at the first sight, any difference between an
ANSI
and aUTF-8
file without a BOM, unless scanning the files completely !
Thus,
- Download the binaries, dependencies and documentation archives, relative to the
iconv.exe
program, from :
https://gnuwin32.sourceforge.net/packages/libiconv.htm
-
Move to the directory where you extract the
iconv.exe
program and its3
dependencies :libcharset1.dll
,libiconv2.dll
andlibintl3.dll
-
Now, in this same directory, download the
xxd
archive from :
https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download
-
Extract the
xxd.exe
file -
Open a DOS window
-
Move to the folder containing, both,
iconv.exe
andxxd.exe
-
Run the command
iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt
Of course, you must replace the
<Full_Directory_Path>
zone by the folder location containing all the files to scan for encoding !=> With this command, you are certain that the OUTPUT file (
results.txt
) contains only files which are not presently encoded inUTF-8
( so they can be trueANSI
files orUTF-16
files )-
Now, open the
results.txt
file in N++ -
Open the Replace dialog (
Ctrl + H
) -
SEARCH
(?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$
-
REPLACE
Leave EMPTY
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button
=> We keep the full pathnames, only
-
Now, rename this file as
results.bat
-
Save the
results.bat
file -
Then, still in the
results.bat
file : -
Open the Replace dialog (
Ctrl + H
) -
SEARCH
(?-s)^.+
-
REPLACE
\( echo | set /p="$0 " & xxd.exe -l2 -u "$0" \) >> results.txt
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button -
Add the command
@echo OFF
at the very beginning of theresults.bat
file -
Re-save the
results.bat
file -
Run the
results.bat
file, in the DOS window -
In notepad++, open the new
results.txt
file -
Open the Replace dialog (
Ctrl + H
) -
SEARCH
(?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+
-
REPLACE
Leave EMPTY
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button -
Save the
results.txt
file
=> You should get, only, the pure
ANSI
files that need a furtherUTF-8
orUTF-8-BOM
encoding !- Finally, delete the
results.bat
file
Best Regards,
guy038
-
-
Hello, All,
In my last post, I forgot some important points ! So, refer to this correct version, below and forget the previous one !
When I began to “follow” Notepad++, in
2012
, I was using N++ANSI
versions. At this time, N++'s forums were located onSourceForge.Net
and, of course, my first replies to users were stored asANSI
files.Recently, I was wondering how many files, among my
1,800
posts about, stored on my laptop, have a trueANSI
encoding, in order to re-encode them asUTF-8
with the two commands :-
Encoding > Convert to UTF-8
-
File > Save
I suceeded to find out a way to achieve it ! You just need two
DOS
tools :iconv.exe
andxxd.exe
Notes :
-
Using the
xxd.exe
program alone, which is a hexdump utility, is not enough to determine the correct encoding. In fact, it can identify, without error :-
True
UTF-8-BOM
files with the first three bytesEFBBBF
-
True
UTF-16 BE BOM
with the first two bytesFEFF
-
True
UTF-16 LE BOM
with the first two bytesFFFE
-
But it cannot display, at the first sight, any difference between an
ANSI
and aUTF-8
file without a BOM, unless scanning the files completely !
-
-
Rememver also that, if all bytes of a file have a value below
\x80
, Notepad++ considers that it’s a pseudoUTF-8
file, by default !
Thus,
- Download the binaries, dependencies and documentation archives, relative to the
iconv.exe
program, from :
https://gnuwin32.sourceforge.net/packages/libiconv.htm
-
Move to the directory where you extract the
iconv.exe
program and its3
dependencies :libcharset1.dll
,libiconv2.dll
andlibintl3.dll
-
Now, in this same directory, download the
xxd
archive from :
https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download
-
Extract the
xxd.exe
file -
Open a DOS window
-
Run, first, the command
chcp 1252
( in order to get anANSI
output for this DOS window ! ) -
Move to the folder containing, both,
iconv.exe
andxxd.exe
-
Then, run the command
iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt
Of course, you must replace the
<Full_Directory_Path>
zone by the folder location containing all the files to scan for encoding !=> With this command, you are certain that the OUTPUT file (
results.txt
) contains only files which are not presently encoded inUTF-8
( so, in N++, they can be trueANSI
files, with some bytes between\x80
and\xFF
or trueUTF-16
files with a BOM )-
Now, open the
results.txt
file in N++ -
Open the Replace dialog (
Ctrl + H
) -
SEARCH
(?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$
-
REPLACE
Leave EMPTY
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button
=> We keep the full pathnames, only
-
Save the
results.txt
file -
Open, again, the Replace dialog (
Ctrl + H
) -
SEARCH
(?-s)^.+
-
REPLACE
\( echo | set /p="$0 " & xxd.exe -l2 -u "$0" \) >> results.txt
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button -
Now, add the line
@echo OFF
at the very beginning of the file -
And rename this file as
results.bat
-
If the current encoding of this batch file is
UTF-8
, run theEncoding > Convert to ANSI
option ( IMPORTANT, as the DOS window isANSI
too ) -
Save this new
results.bat
file -
Run the
results.bat
file, in the DOS window -
Back to Notepad++, open the new
results.txt
file -
Open the Replace dialog (
Ctrl + H
) -
SEARCH
(?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+
-
REPLACE
Leave EMPTY
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button
=> We get rid of the
UTF-16
files which have a BOM equal toFEFF
orFFFE
- Save the
results.txt
file
=> Thus, you should get, only, the true
ANSI
files, containing some bytes between\x80
and\xFF
, that need a furtherUTF-8
orUTF-8-BOM
encoding !-
Delete the
results.bat
file -
Finally, close the
DOS
window
Best Regards,
guy038
-