Python: Multiple files ANSI to utf-8 converter

Hellena Crainicu

hello, I want to use “Python Script” Plugin as to convert multiple files to UTF-8 (not UTF-8-BOM), on a particular folder.

Can be this done?

Mark Olson

@Hellena-Crainicu
This script (not a PythonScript plugin script, because that’s not really the most effective solution) should do what you want:

'''
This should be used as a script from the terminal.
Relevant documentation:
* https://docs.python.org/3/howto/unicode.html
* https://docs.python.org/3/library/glob.html
* https://docs.python.org/3/library/os.html#module-os
* https://docs.python.org/3/library/argparse.html
example usage:
>python -m encoding_conversion . utf-16 utf-8 *.txt
changing encoding of example2.txt from utf-16 to utf-8
changing encoding of example4.txt from utf-16 to utf-8
changing encoding of example3.txt from utf-16 to utf-8
changing encoding of example1.txt from utf-16 to utf-8
changing encoding of example5.txt from utf-16 to utf-8
>python -m encoding_conversion "example directory" utf-8 utf-16 *.md
changing encoding of new 2.md from utf-8 to utf-16
changing encoding of new 3.md from utf-8 to utf-16
changing encoding of new 1.md from utf-8 to utf-16
'''
import os

def change_encoding(fname, from_encoding, to_encoding='utf-8') -> None:
    '''
    Read the file at path fname with its original encoding (from_encoding)
    and rewrites it with to_encoding.
    '''
    with open(fname, encoding=from_encoding) as f:
        text = f.read()
    with open(fname, 'w', encoding=to_encoding) as f:
        f.write(text)


if __name__ == '__main__':
    import argparse
    import glob
    parser = argparse.ArgumentParser()
    parser.add_argument('dirname',
        help='name of directory in which you want to change file encodings')
    parser.add_argument('old_encoding',
        help='the previous encoding of files found')
    parser.add_argument('new_encoding', nargs='?', default='utf-8',
        help='the new encoding that you want to change to')
    parser.add_argument('include_files', nargs='*',
        help='filename patterns using glob syntax to choose')
    args = parser.parse_args()
    include_files = args.include_files
    if not include_files:
        include_files = ['*.*']
    fnames = set()
    curdir = os.getcwd()
    try:
        os.chdir(args.dirname)
        for glb in include_files:
            fnames.update(glob.glob(glb))
        for fname in fnames:
            print((f'changing encoding of {fname} from '
                  f'{args.old_encoding} to {args.new_encoding}'))
            change_encoding(fname, args.old_encoding, args.new_encoding)
    finally:
        os.chdir(curdir)

Note that this doesn’t use Notepad++ for anything, because it is simpler to get the job done with pure Python.

You could probably modify this script to try to guess the encoding of files, but I’ve tried using automatic encoding detection in Python and it’s pretty hit-or-miss. If you’re really determined to try guessing encoding, try looking at codecs.

Hellena Crainicu

@Mark-Olson said in Python: Multiple files ANSI to utf-8 converter:

Ok, I change the lines. If I understand well enough:

help='d:\\2022_12_02\\word 2\\1') # name of directory in which you want to change file encodings

help='ANSI') # the previous encoding of files found

help='*.txt') # filename patterns using glob syntax to choose

Ok, I run the code in Python directly. Nothing happens…

import os

def change_encoding(fname, from_encoding, to_encoding='utf-8') -> None:
    '''
    Read the file at path fname with its original encoding (from_encoding)
    and rewrites it with to_encoding.
    '''
    with open(fname, encoding=from_encoding) as f:
        text = f.read()
    with open(fname, 'w', encoding=to_encoding) as f:
        f.write(text)


if __name__ == '__main__':
    import argparse
    import glob
    parser = argparse.ArgumentParser()
    parser.add_argument('dirname',
        help='d:\\2022_12_02\\word 2\\1')  # name of directory in which you want to change file encodings
    parser.add_argument('old_encoding',
        help='ANSI') # the previous encoding of files found
    parser.add_argument('new_encoding', nargs='?', default='utf-8',
        help='UTF-8')
    parser.add_argument('include_files', nargs='*',
        help='*.txt')  # filename patterns using glob syntax to choose
    args = parser.parse_args()
    include_files = args.include_files
    if not include_files:
        include_files = ['*.*']
    fnames = set()
    curdir = os.getcwd()
    try:
        os.chdir(args.dirname)
        for glb in include_files:
            fnames.update(glob.glob(glb))
        for fname in fnames:
            print((f'changing encoding of {fname} from '
                  f'{args.old_encoding} to {args.new_encoding}'))
            change_encoding(fname, args.old_encoding, args.new_encoding)
    finally:
        os.chdir(curdir)

Mark Olson

@Hellena-Crainicu
Correct, the intended use of the script (and pretty much any Python script with the line import argparse in it) is not to be modified directly, but rather to be used from the command line with arguments. The changes you made don’t alter the functionality at all, but rather change the help message displayed.

I’ll just repeat the usage examples in my docstring at the beginning of the script.

>python -m encoding_conversion . utf-16 utf-8 *.txt
>python -m encoding_conversion "example directory" utf-8 utf-16 *.md

The former changes all utf-16 encoded text files to utf-8, the latter changes all utf-8 encoded markdown (.md) files to utf-16.

Hellena Crainicu

@Mark-Olson I still have a problem. I change everything on your code, as I post yesterday. I run again today, but I get thie error.

So, this is the code I run today, trying to change txt files from UTF-8-BOM to UTF-8. The error above. Doesn’t work the conversion. Why ? I put the dir name, the encoding, etc…

import os
import glob

def change_encoding(fname, from_encoding, to_encoding='utf-8') -> None:
    '''
    Read the file at path fname with its original encoding (from_encoding)
    and rewrites it with to_encoding.
    '''
    with open(fname, encoding=from_encoding) as f:
        text = f.read()
    with open(fname, 'w', encoding=to_encoding) as f:
        f.write(text)


if __name__ == '__main__':
    import argparse
    import glob  # pip install glob2
    parser = argparse.ArgumentParser()
    parser.add_argument('dirname',
        help='d:\\2022_12_02\\word 2\\1')  # name of directory in which you want to change file encodings
    parser.add_argument('old_encoding',
        help='UTF-8-BOM') # the previous encoding of files found ANSI
    parser.add_argument('new_encoding', nargs='?', default='utf-8',
        help='UTF-8')
    parser.add_argument('include_files', nargs='*',
        help='*')  # filename patterns using glob syntax to choose
    args = parser.parse_args()
    include_files = args.include_files
    if not include_files:
        include_files = ['*.*']
    fnames = set()
    curdir = os.getcwd()
    try:
        os.chdir(args.dirname)
        for glb in include_files:
            fnames.update(glob.glob(glb))
        for fname in fnames:
            print((f'changing encoding of {fname} from '
                  f'{args.old_encoding} to {args.new_encoding}'))
            change_encoding(fname, args.old_encoding, args.new_encoding)
    finally:
        os.chdir(curdir)

Hellena Crainicu

@Mark-Olson Mark Olson: please check my code, and the replacements I made, and tell me what is wrong.

Alan Kilborn

@Hellena-Crainicu

This topic has delved into off-topic land for Notepad++ discussion. I doubt anybody wants to debug your code, but on the offhand chance that Mark does, why don’t you two take this discussion off into a private chat?

Hellena Crainicu

@Alan-Kilborn yes, sure, I will use chat. thanks

PeterJones

@Mark-Olson ,

Note that this doesn’t use Notepad++ for anything,

I appreciate your willingness to help. However, we need to focus this Forum on Notepad++. If it’s something that can be done in PythonScript, and you are interested in providing the solution, please make it compatible with PythonScript. This forum isn’t for “generic” Python code-writing.