Hello, @peterjones, @alan-kilborn and All,
From, your two last posts, Alan and Peter, I asked myself : which is the distribution of all my text files, regarding their encoding ?
I considered, as a text file, all the files with the main following extensions, by importance level :
txt, .py, html, htm, xml, ini, msg, csv, log as well as few other files with rare extension
Now, using the iconv.exe utility to get all the NON-UTF8 files and, then, the xxd.exe software to omit the UFT-16 encoded files, I was able, little by little, to restrict my list to 360 files, about, for which I possibly could change the encoding from ANSI to UTF-8 !
Of course, opening all the files, one at a time, in N++, changing their encoding and saving them seemed rather tedious. Thus, I used a simple python script to achieve this task easily :
'''
NAME : Move_to_UTF8_encoding.py
REMARK : The fonction 'npp_get_statusbar' is an idea of @alan-kilborn
This script :
- Opens a file which contains a list of ABSOLUTE file-paths
- Read, successively, the file-paths from that list
- Open EACH file in N++
- Perform the 'Convert to UTD-8' action on the CURRENT opened ANSI file
- Save and close EACH file, one at a time
NOTES :
- The file, containing the list of ABSOLUTE file-paths to OPEN, is an UTF-8 encoded file, with 'Windows' EOL
- This list must NOT contain EMPTY or BLANK lines
- But, any line beginning with the '#' character is simply IGNORED ( So begin any EMPTY line or COMMENT line with a '#' char ! )
- The PATHS are designated by a SIMPLE character ANTI-SLASH ( Ex : D:\Dir_1\Dir_2\Name.txt ). NO need to DOUBLE the ANTISLASH ( \\ )
- On the same way, NO need to SURROUND the file-paths, containing SPACE characters, with DOUBLE-QUOTES
- This list may contain some ACCENTUATED characters
'''
from Npp import *
import time
import ctypes
from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT
console.show()
console.clear()
with open('D:\\Verif.txt') as file:
for file_path in file:
file_path = file_path.strip('\n')
if file_path[0] == "#":
continue
notepad.open(file_path)
# ----------------------------------------------------------------------------------------------------------------------------------------------------
# From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
# ----------------------------------------------------------------------------------------------------------------------------------------------------
def npp_get_statusbar(statusbar_item_number):
WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
FindWindowW = ctypes.windll.user32.FindWindowW
FindWindowExW = ctypes.windll.user32.FindWindowExW
SendMessageW = ctypes.windll.user32.SendMessageW
LRESULT = LPARAM
SendMessageW.restype = LRESULT
SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
EnumChildWindows = ctypes.windll.user32.EnumChildWindows
GetClassNameW = ctypes.windll.user32.GetClassNameW
create_unicode_buffer = ctypes.create_unicode_buffer
SBT_OWNERDRAW = 0x1000
WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13
npp_get_statusbar.STATUSBAR_HANDLE = None
def get_result_from_statusbar(statusbar_item_number):
assert statusbar_item_number <= 5
retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
length = retcode & 0xFFFF
type = (retcode >> 16) & 0xFFFF
assert (type != SBT_OWNERDRAW)
text_buffer = create_unicode_buffer(length)
retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
retval = '{}'.format(text_buffer[:length])
return retval
def EnumCallback(hwnd, lparam):
curr_class = create_unicode_buffer(256)
GetClassNameW(hwnd, curr_class, 256)
if curr_class.value.lower() == "msctls_statusbar32":
npp_get_statusbar.STATUSBAR_HANDLE = hwnd
return False # stop the enumeration
return True # continue the enumeration
npp_hwnd = FindWindowW(u"Notepad++", None)
EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
assert False
St_bar = npp_get_statusbar(4) # Zone 4 ( STATUSBARSECTION.UNICODETYPE )
if St_bar == 'ANSI': # => Conversion to 'UTF-8', without BOM, RECOMMENDED !
time.sleep(0.5)
notepad.runMenuCommand("Encoding", "Convert to UTF-8")
notepad.save()
time.sleep(0.5)
notepad.close()
REMARK :
As I was a bit
anxious about the needed time to get the
encoding change and the
save action, for each file, I preferred to use
timers to properly ensure the
entire process but, may be, these
timers are
not necessary !
So, after the various modifications, I got a list of 11,578 files whose distribution, according to their encoding, is as follows :
UTF-8 BOM : 208 |
UTF-16 LE BOM : 39 |
UTF-16 BE BOM : 4 |
UTF-8 : 540 ( 0 byte ) | => 10,737 with UNICODE encoding ( 92,7 % )
UTF-8 : 9,946 |
ANSI : 841
----------
TOTAL 11,578
You certainly note that there still are a lot of ANSI files, but most of them are lang or configuration files for which the change of the encoding is rather forbidden or, at least, not welcome !
Best Regards,
guy038