Hello, @peterjones, @alan-kilborn and All,
From, your two last posts, Alan and Peter, I asked myself : which is the distribution of all my text files, regarding their encoding ?
I considered, as a text file, all the files with the main following extensions, by importance level :
txt, .py, html, htm, xml, ini, msg, csv, log as well as few other files with rare extension
Now, using the iconv.exe utility to get all the NON-UTF8 files and, then, the xxd.exe software to omit the UFT-16 encoded files, I was able, little by little, to restrict my list to 360 files, about, for which I possibly could change the encoding from ANSI to UTF-8 !
Of course, opening all the files, one at a time, in N++, changing their encoding and saving them seemed rather tedious. Thus, I used a simple python script to achieve this task easily :
''' NAME : Move_to_UTF8_encoding.py REMARK : The fonction 'npp_get_statusbar' is an idea of @alan-kilborn This script : - Opens a file which contains a list of ABSOLUTE file-paths - Read, successively, the file-paths from that list - Open EACH file in N++ - Perform the 'Convert to UTD-8' action on the CURRENT opened ANSI file - Save and close EACH file, one at a time NOTES : - The file, containing the list of ABSOLUTE file-paths to OPEN, is an UTF-8 encoded file, with 'Windows' EOL - This list must NOT contain EMPTY or BLANK lines - But, any line beginning with the '#' character is simply IGNORED ( So begin any EMPTY line or COMMENT line with a '#' char ! ) - The PATHS are designated by a SIMPLE character ANTI-SLASH ( Ex : D:\Dir_1\Dir_2\Name.txt ). NO need to DOUBLE the ANTISLASH ( \\ ) - On the same way, NO need to SURROUND the file-paths, containing SPACE characters, with DOUBLE-QUOTES - This list may contain some ACCENTUATED characters ''' from Npp import * import time import ctypes from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT console.show() console.clear() with open('D:\\Verif.txt') as file: for file_path in file: file_path = file_path.strip('\n') if file_path[0] == "#": continue notepad.open(file_path) # ---------------------------------------------------------------------------------------------------------------------------------------------------- # From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4 # ---------------------------------------------------------------------------------------------------------------------------------------------------- def npp_get_statusbar(statusbar_item_number): WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM) FindWindowW = ctypes.windll.user32.FindWindowW FindWindowExW = ctypes.windll.user32.FindWindowExW SendMessageW = ctypes.windll.user32.SendMessageW LRESULT = LPARAM SendMessageW.restype = LRESULT SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ] EnumChildWindows = ctypes.windll.user32.EnumChildWindows GetClassNameW = ctypes.windll.user32.GetClassNameW create_unicode_buffer = ctypes.create_unicode_buffer SBT_OWNERDRAW = 0x1000 WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13 npp_get_statusbar.STATUSBAR_HANDLE = None def get_result_from_statusbar(statusbar_item_number): assert statusbar_item_number <= 5 retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0) length = retcode & 0xFFFF type = (retcode >> 16) & 0xFFFF assert (type != SBT_OWNERDRAW) text_buffer = create_unicode_buffer(length) retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer)) retval = '{}'.format(text_buffer[:length]) return retval def EnumCallback(hwnd, lparam): curr_class = create_unicode_buffer(256) GetClassNameW(hwnd, curr_class, 256) if curr_class.value.lower() == "msctls_statusbar32": npp_get_statusbar.STATUSBAR_HANDLE = hwnd return False # stop the enumeration return True # continue the enumeration npp_hwnd = FindWindowW(u"Notepad++", None) EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0) if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number) assert False St_bar = npp_get_statusbar(4) # Zone 4 ( STATUSBARSECTION.UNICODETYPE ) if St_bar == 'ANSI': # => Conversion to 'UTF-8', without BOM, RECOMMENDED ! time.sleep(0.5) notepad.runMenuCommand("Encoding", "Convert to UTF-8") notepad.save() time.sleep(0.5) notepad.close()REMARK :
As I was a bit anxious about the needed time to get the encoding change and the save action, for each file, I preferred to use timers to properly ensure the entire process but, may be, these timers are not necessary !So, after the various modifications, I got a list of 11,578 files whose distribution, according to their encoding, is as follows :
UTF-8 BOM : 208 | UTF-16 LE BOM : 39 | UTF-16 BE BOM : 4 | UTF-8 : 540 ( 0 byte ) | => 10,737 with UNICODE encoding ( 92,7 % ) UTF-8 : 9,946 | ANSI : 841 ---------- TOTAL 11,578You certainly note that there still are a lot of ANSI files, but most of them are lang or configuration files for which the change of the encoding is rather forbidden or, at least, not welcome !
Best Regards,
guy038