Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    Python: Multiple files ANSI to utf-8 converter

    Help wanted · · · – – – · · ·
    4
    9
    122
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Hellena Crainicu
      Hellena Crainicu last edited by

      hello, I want to use “Python Script” Plugin as to convert multiple files to UTF-8 (not UTF-8-BOM), on a particular folder.

      Can be this done?

      Mark Olson 1 Reply Last reply Reply Quote 0
      • Mark Olson
        Mark Olson @Hellena Crainicu last edited by

        @Hellena-Crainicu
        This script (not a PythonScript plugin script, because that’s not really the most effective solution) should do what you want:

        '''
        This should be used as a script from the terminal.
        Relevant documentation:
        * https://docs.python.org/3/howto/unicode.html
        * https://docs.python.org/3/library/glob.html
        * https://docs.python.org/3/library/os.html#module-os
        * https://docs.python.org/3/library/argparse.html
        example usage:
        >python -m encoding_conversion . utf-16 utf-8 *.txt
        changing encoding of example2.txt from utf-16 to utf-8
        changing encoding of example4.txt from utf-16 to utf-8
        changing encoding of example3.txt from utf-16 to utf-8
        changing encoding of example1.txt from utf-16 to utf-8
        changing encoding of example5.txt from utf-16 to utf-8
        >python -m encoding_conversion "example directory" utf-8 utf-16 *.md
        changing encoding of new 2.md from utf-8 to utf-16
        changing encoding of new 3.md from utf-8 to utf-16
        changing encoding of new 1.md from utf-8 to utf-16
        '''
        import os
        
        def change_encoding(fname, from_encoding, to_encoding='utf-8') -> None:
            '''
            Read the file at path fname with its original encoding (from_encoding)
            and rewrites it with to_encoding.
            '''
            with open(fname, encoding=from_encoding) as f:
                text = f.read()
            with open(fname, 'w', encoding=to_encoding) as f:
                f.write(text)
        
        
        if __name__ == '__main__':
            import argparse
            import glob
            parser = argparse.ArgumentParser()
            parser.add_argument('dirname',
                help='name of directory in which you want to change file encodings')
            parser.add_argument('old_encoding',
                help='the previous encoding of files found')
            parser.add_argument('new_encoding', nargs='?', default='utf-8',
                help='the new encoding that you want to change to')
            parser.add_argument('include_files', nargs='*',
                help='filename patterns using glob syntax to choose')
            args = parser.parse_args()
            include_files = args.include_files
            if not include_files:
                include_files = ['*.*']
            fnames = set()
            curdir = os.getcwd()
            try:
                os.chdir(args.dirname)
                for glb in include_files:
                    fnames.update(glob.glob(glb))
                for fname in fnames:
                    print((f'changing encoding of {fname} from '
                          f'{args.old_encoding} to {args.new_encoding}'))
                    change_encoding(fname, args.old_encoding, args.new_encoding)
            finally:
                os.chdir(curdir)
        

        Note that this doesn’t use Notepad++ for anything, because it is simpler to get the job done with pure Python.

        You could probably modify this script to try to guess the encoding of files, but I’ve tried using automatic encoding detection in Python and it’s pretty hit-or-miss. If you’re really determined to try guessing encoding, try looking at codecs.

        1 Reply Last reply Reply Quote 1
        • Hellena Crainicu
          Hellena Crainicu last edited by

          @Mark-Olson said in Python: Multiple files ANSI to utf-8 converter:

          Ok, I change the lines. If I understand well enough:

          help='d:\\2022_12_02\\word 2\\1') # name of directory in which you want to change file encodings

          help='ANSI') # the previous encoding of files found

          help='*.txt') # filename patterns using glob syntax to choose

          Ok, I run the code in Python directly. Nothing happens…

          import os
          
          def change_encoding(fname, from_encoding, to_encoding='utf-8') -> None:
              '''
              Read the file at path fname with its original encoding (from_encoding)
              and rewrites it with to_encoding.
              '''
              with open(fname, encoding=from_encoding) as f:
                  text = f.read()
              with open(fname, 'w', encoding=to_encoding) as f:
                  f.write(text)
          
          
          if __name__ == '__main__':
              import argparse
              import glob
              parser = argparse.ArgumentParser()
              parser.add_argument('dirname',
                  help='d:\\2022_12_02\\word 2\\1')  # name of directory in which you want to change file encodings
              parser.add_argument('old_encoding',
                  help='ANSI') # the previous encoding of files found
              parser.add_argument('new_encoding', nargs='?', default='utf-8',
                  help='UTF-8')
              parser.add_argument('include_files', nargs='*',
                  help='*.txt')  # filename patterns using glob syntax to choose
              args = parser.parse_args()
              include_files = args.include_files
              if not include_files:
                  include_files = ['*.*']
              fnames = set()
              curdir = os.getcwd()
              try:
                  os.chdir(args.dirname)
                  for glb in include_files:
                      fnames.update(glob.glob(glb))
                  for fname in fnames:
                      print((f'changing encoding of {fname} from '
                            f'{args.old_encoding} to {args.new_encoding}'))
                      change_encoding(fname, args.old_encoding, args.new_encoding)
              finally:
                  os.chdir(curdir)
          
          Mark Olson 1 Reply Last reply Reply Quote 0
          • Mark Olson
            Mark Olson @Hellena Crainicu last edited by

            @Hellena-Crainicu
            Correct, the intended use of the script (and pretty much any Python script with the line import argparse in it) is not to be modified directly, but rather to be used from the command line with arguments. The changes you made don’t alter the functionality at all, but rather change the help message displayed.

            I’ll just repeat the usage examples in my docstring at the beginning of the script.

            >python -m encoding_conversion . utf-16 utf-8 *.txt
            >python -m encoding_conversion "example directory" utf-8 utf-16 *.md
            

            The former changes all utf-16 encoded text files to utf-8, the latter changes all utf-8 encoded markdown (.md) files to utf-16.

            Hellena Crainicu PeterJones 2 Replies Last reply Reply Quote 1
            • Hellena Crainicu
              Hellena Crainicu @Mark Olson last edited by

              @Mark-Olson I still have a problem. I change everything on your code, as I post yesterday. I run again today, but I get thie error.

              77176334-bdea-4bd2-9579-1d1e7ec8f7b3-image.png

              So, this is the code I run today, trying to change txt files from UTF-8-BOM to UTF-8. The error above. Doesn’t work the conversion. Why ? I put the dir name, the encoding, etc…

              import os
              import glob
              
              def change_encoding(fname, from_encoding, to_encoding='utf-8') -> None:
                  '''
                  Read the file at path fname with its original encoding (from_encoding)
                  and rewrites it with to_encoding.
                  '''
                  with open(fname, encoding=from_encoding) as f:
                      text = f.read()
                  with open(fname, 'w', encoding=to_encoding) as f:
                      f.write(text)
              
              
              if __name__ == '__main__':
                  import argparse
                  import glob  # pip install glob2
                  parser = argparse.ArgumentParser()
                  parser.add_argument('dirname',
                      help='d:\\2022_12_02\\word 2\\1')  # name of directory in which you want to change file encodings
                  parser.add_argument('old_encoding',
                      help='UTF-8-BOM') # the previous encoding of files found ANSI
                  parser.add_argument('new_encoding', nargs='?', default='utf-8',
                      help='UTF-8')
                  parser.add_argument('include_files', nargs='*',
                      help='*')  # filename patterns using glob syntax to choose
                  args = parser.parse_args()
                  include_files = args.include_files
                  if not include_files:
                      include_files = ['*.*']
                  fnames = set()
                  curdir = os.getcwd()
                  try:
                      os.chdir(args.dirname)
                      for glb in include_files:
                          fnames.update(glob.glob(glb))
                      for fname in fnames:
                          print((f'changing encoding of {fname} from '
                                f'{args.old_encoding} to {args.new_encoding}'))
                          change_encoding(fname, args.old_encoding, args.new_encoding)
                  finally:
                      os.chdir(curdir)
              
              Hellena Crainicu 1 Reply Last reply Reply Quote 0
              • Hellena Crainicu
                Hellena Crainicu @Hellena Crainicu last edited by

                @Mark-Olson Mark Olson: please check my code, and the replacements I made, and tell me what is wrong.

                Alan Kilborn 1 Reply Last reply Reply Quote 0
                • Alan Kilborn
                  Alan Kilborn @Hellena Crainicu last edited by

                  @Hellena-Crainicu

                  This topic has delved into off-topic land for Notepad++ discussion. I doubt anybody wants to debug your code, but on the offhand chance that Mark does, why don’t you two take this discussion off into a private chat?

                  Hellena Crainicu 1 Reply Last reply Reply Quote 1
                  • Hellena Crainicu
                    Hellena Crainicu @Alan Kilborn last edited by

                    @Alan-Kilborn yes, sure, I will use chat. thanks

                    1 Reply Last reply Reply Quote 0
                    • PeterJones
                      PeterJones @Mark Olson last edited by

                      @Mark-Olson ,

                      Note that this doesn’t use Notepad++ for anything,

                      I appreciate your willingness to help. However, we need to focus this Forum on Notepad++. If it’s something that can be done in PythonScript, and you are interested in providing the solution, please make it compatible with PythonScript. This forum isn’t for “generic” Python code-writing.

                      1 Reply Last reply Reply Quote 2
                      • First post
                        Last post
                      Copyright © 2014 NodeBB Forums | Contributors