How to identify true "ANSI" files for further change as an "UTF-8" files ?
-
Hello, All,
When I began to “follow” Notepad++, in
2012, I was using N++ANSIversions. At this time, N++'s forums were located onSourceForge.Netand, of course, my first replies to users were stored asANSIfiles.Recently, I was wondering how many files, among my
1,800posts about, stored on my laptop, have a trueANSIencoding, in order to re-encode them asUTF-8with the two commands :-
Encoding > Convert to UTF-8 -
File > Save
I suceeded to find out a way to achieve it ! You just need two
DOStools :iconv.exeandxxd.exe
Note that the sole use of the
xxd.exeprogram, which is a hexdump utility, didn’t be useful because it could identify, without error :-
True
UTF-8-BOMfiles with the first three bytesEFBBBF -
True
UTF-16 BE BOMwith the first two bytesFEFF -
True
UTF-16 LE BOMwith the first two bytesFFFE
But it cannot display, at the first sight, any difference between an
ANSIand aUTF-8file without a BOM, unless scanning the files completely !
Thus,
- Download the binaries, dependencies and documentation archives, relative to the
iconv.exeprogram, from :
https://gnuwin32.sourceforge.net/packages/libiconv.htm
-
Move to the directory where you extract the
iconv.exeprogram and its3dependencies :libcharset1.dll,libiconv2.dllandlibintl3.dll -
Now, in this same directory, download the
xxdarchive from :
https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download
-
Extract the
xxd.exefile -
Open a DOS window
-
Move to the folder containing, both,
iconv.exeandxxd.exe -
Run the command
iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt
Of course, you must replace the
<Full_Directory_Path>zone by the folder location containing all the files to scan for encoding !=> With this command, you are certain that the OUTPUT file (
results.txt) contains only files which are not presently encoded inUTF-8( so they can be trueANSIfiles orUTF-16files )-
Now, open the
results.txtfile in N++ -
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$ -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton
=> We keep the full pathnames, only
-
Now, rename this file as
results.bat -
Save the
results.batfile -
Then, still in the
results.batfile : -
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+ -
REPLACE
\( echo | set /p="$0 " & xxd.exe -l2 -u "$0" \) >> results.txt -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton -
Add the command
@echo OFFat the very beginning of theresults.batfile -
Re-save the
results.batfile -
Run the
results.batfile, in the DOS window -
In notepad++, open the new
results.txtfile -
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+ -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton -
Save the
results.txtfile
=> You should get, only, the pure
ANSIfiles that need a furtherUTF-8orUTF-8-BOMencoding !- Finally, delete the
results.batfile
Best Regards,
guy038
-
-
Hello, All,
In my last post, I forgot some important points ! So, refer to this correct version, below and forget the previous one !
When I began to “follow” Notepad++, in
2012, I was using N++ANSIversions. At this time, N++'s forums were located onSourceForge.Netand, of course, my first replies to users were stored asANSIfiles.Recently, I was wondering how many files, among my
1,800posts about, stored on my laptop, have a trueANSIencoding, in order to re-encode them asUTF-8with the two commands :-
Encoding > Convert to UTF-8 -
File > Save
I suceeded to find out a way to achieve it ! You just need two
DOStools :iconv.exeandxxd.exe
Notes :
-
Using the
xxd.exeprogram alone, which is a hexdump utility, is not enough to determine the correct encoding. In fact, it can identify, without error :-
True
UTF-8-BOMfiles with the first three bytesEFBBBF -
True
UTF-16 BE BOMwith the first two bytesFEFF -
True
UTF-16 LE BOMwith the first two bytesFFFE -
But it cannot display, at the first sight, any difference between an
ANSIand aUTF-8file without a BOM, unless scanning the files completely !
-
-
Rememver also that, if all bytes of a file have a value below
\x80, Notepad++ considers that it’s a pseudoUTF-8file, by default !
Thus,
- Download the binaries, dependencies and documentation archives, relative to the
iconv.exeprogram, from :
https://gnuwin32.sourceforge.net/packages/libiconv.htm
-
Move to the directory where you extract the
iconv.exeprogram and its3dependencies :libcharset1.dll,libiconv2.dllandlibintl3.dll -
Now, in this same directory, download the
xxdarchive from :
https://sourceforge.net/projects/xxd-for-windows/files/xxd-1.11_win32(static).zip/download
-
Extract the
xxd.exefile -
Open a DOS window
-
Run, first, the command
chcp 1252( in order to get anANSIoutput for this DOS window ! ) -
Move to the folder containing, both,
iconv.exeandxxd.exe -
Then, run the command
iconv -f UTF-8 -t UTF-8 <Full_Directory_Path>\*.* 1> NUL 2>> results.txt
Of course, you must replace the
<Full_Directory_Path>zone by the folder location containing all the files to scan for encoding !=> With this command, you are certain that the OUTPUT file (
results.txt) contains only files which are not presently encoded inUTF-8( so, in N++, they can be trueANSIfiles, with some bytes between\x80and\xFFor trueUTF-16files with a BOM )-
Now, open the
results.txtfile in N++ -
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-si)^Permission denied\R|^.*iconv:\x20|: cannot convert$ -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton
=> We keep the full pathnames, only
-
Save the
results.txtfile -
Open, again, the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+ -
REPLACE
\( echo | set /p="$0 " & xxd.exe -l2 -u "$0" \) >> results.txt -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton -
Now, add the line
@echo OFFat the very beginning of the file -
And rename this file as
results.bat -
If the current encoding of this batch file is
UTF-8, run theEncoding > Convert to ANSIoption ( IMPORTANT, as the DOS window isANSItoo ) -
Save this new
results.batfile -
Run the
results.batfile, in the DOS window -
Back to Notepad++, open the new
results.txtfile -
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+(FFFE|FEFF).+\R|\x2000000000:.+ -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton
=> We get rid of the
UTF-16files which have a BOM equal toFEFForFFFE- Save the
results.txtfile
=> Thus, you should get, only, the true
ANSIfiles, containing some bytes between\x80and\xFF, that need a furtherUTF-8orUTF-8-BOMencoding !-
Delete the
results.batfile -
Finally, close the
DOSwindow
Best Regards,
guy038
-
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login