Community
    • Login

    Notepad++ encoding auto-detect potential problems

    Scheduled Pinned Locked Moved General Discussion
    7 Posts 4 Posters 3.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Alan KilbornA
      Alan Kilborn
      last edited by

      So there was an ISSUE that came up recently that I did a bit of trolling on. :-)

      What came out of it was that Notepad++, when loading files, can get a file’s encoding wrong. Of course I’d heard of this issue before, but I never gave it a deeper dive. The issue does seem relatively rare, but apparently detection of a file’s encoding is a difficult thing, with no clear roadmap of a fix in Notepad++.

      Basically what happens is that a user can create and save a UTF-8 file (BOM-less), and when this file is closed and later reloaded, its encoding is detected by Notepad++ as something other than UTF-8. This causes (silent) corruption of the file when further edits/saves happen. It is easy for this to go unnoticed if you are opening a file to make changes in an area of the file that is far away from the point of corruption, and that area contains data that is simply normal “A-z” type text. (I mention UTF-8, but I suppose it is possible for this poor encoding autodetect to happen with other encodings as well.)

      Being someone that primarily works with UTF-8 files, but occasionally has a need to work with files with older encodings, I like Notepad++'s auto-detection of file encoding. But…not so much if it is subject to getting it wrong.

      So I thought, what if I have Notepad++ inform me when it loads a file that is NOT UTF-8? Thus normally I won’t be “bothered”, but if a file encoding is detected as non-UTF-8, I’ll get a popup when the file is loaded. At that point I can ask myself, is this file supposed to be in the encoding it is indicating? If so, all is good, and I dismiss the popup and continue on. If not, well, then I can take appropriate measures to get my file and Notepad++ on the same “page” (“codepage” pun fully intended) concerning its encoding.

      I implemented this with a Pythonscript, I call it EncodingDifferentThanUtf8Alert.py:

      # -*- coding: utf-8 -*-
      from __future__ import print_function
      
      from Npp import *
      import inspect
      import os
      import NotepadGetStatusBar
      
      class EDTU8A(object):
      
          def __init__(self):
              self.already_warned_paths_list = []  # only warn once per file
              self.this_script_name = inspect.getframeinfo(inspect.currentframe()).filename.split(os.sep)[-1].rsplit('.', 1)[0]
              self.ngsb = NotepadGetStatusBar.NGSB()
              notepad.callback(self.bufferactivated_notify, [NOTIFICATION.BUFFERACTIVATED])
              notepad.callback(self.filebeforeclose_notify, [NOTIFICATION.FILEBEFORECLOSE])
      
          def bufferactivated_notify(self, args):
              f = notepad.getBufferFilename(args['bufferID'])
              # can't call notepad.getEncoding() here because it will return Npp.BUFFERENCODING.COOKIE for UTF-8 files and some non UTF-8 files!!
              #  for more info, see https://community.notepad-plus-plus.org/topic/20142/what-is-bufferencoding-cookie
              #  to solve this problem, read what N++ thinks the encoding is off of the status bar
              encoding_on_status_bar = self.ngsb.get_statusbar_by_section(STATUSBARSECTION.UNICODETYPE)
              if encoding_on_status_bar != 'UTF-8' and f not in self.already_warned_paths_list:
                  notepad.messageBox('The encoding for:\r\n\r\n{f}\r\n\r\nis:\r\n\r\n{e}'.format(f=f, e=encoding_on_status_bar),
                      self.this_script_name)
                  self.already_warned_paths_list.append(f)
      
          def filebeforeclose_notify(self, args):
              f = notepad.getBufferFilename(args['bufferID'])
              if f in self.already_warned_paths_list: self.already_warned_paths_list.remove(f)
      
      if __name__ == '__main__': EDTU8A()
      

      I linked the script to auto-load when Notepad++ starts by putting the following in my user startup.py file:

      import EncodingDifferentThanUtf8Alert
      edtu8a = EncodingDifferentThanUtf8Alert.EDTU8A()
      

      Note that this script uses another script, referenced HERE, so that script needs to be present as well.

      Now when I open a file, if its encoding is not detected as UTF-8, I’ll get an alert popup such as the following:

      79db44c6-7580-4c8c-8012-d611c5c788f2-image.png

      Michael VincentM gstaviG 2 Replies Last reply Reply Quote 3
      • Michael VincentM
        Michael Vincent @Alan Kilborn
        last edited by

        @alan-kilborn said in Notepad++ encoding auto-detect potential problems:

        Now when I open a file, if its encoding is not detected as UTF-8, I’ll get an alert popup such as the following:

        ABSOLUTELY BRILLIANT!

        Alan KilbornA 1 Reply Last reply Reply Quote 1
        • Alan KilbornA
          Alan Kilborn @Michael Vincent
          last edited by Alan Kilborn

          @michael-vincent said in Notepad++ encoding auto-detect potential problems:

          ABSOLUTELY BRILLIANT!

          I don’t know about that. :-)
          Desperate times call for desperate measures.
          I’d much rather be able to count on N++ always getting it right !! :-(

          1 Reply Last reply Reply Quote 0
          • gstaviG
            gstavi @Alan Kilborn
            last edited by

            @alan-kilborn said:

            but apparently detection of a file’s encoding is a difficult thing

            Impossible is more accurate. Especially for codepage (character sets) it is a pure guess. It works better for html and other internet files which may explicitly define the encoding within the content. Despite the obvious “chicken and egg” problem.

            As was said many times before, Notepad++ should have an option to be utf-8 biased. First try to decode a file as UTF-8 and only if it fails, because file contains byte sequences that are invalid for utf-8, move to auto detection. Perhaps the auto detection already works that way. I don’t know.

            The next thing should be for a user to define acceptable encodings that will limit auto detection space and warn for files that do not match anything in the set. This would be a generalization of @Alan-Kilborn neat python trick.

            It is unhealthy to work with multiple encodings.
            Working with utf-8 and some other 16 bit encoding when switching between Linux and Windows is quite common. Auto detection is unlikely to confuse 8 bit encoding with 16 bit encoding.
            Working with multiple languages (spoken not software) can unfortunately happen with multiple similar encodings although it shouldn’t.

            Alan KilbornA 2 Replies Last reply Reply Quote 3
            • Alan KilbornA Alan Kilborn referenced this topic on
            • Alan KilbornA Alan Kilborn referenced this topic on
            • Alan KilbornA
              Alan Kilborn @gstavi
              last edited by

              @gstavi said in Notepad++ encoding auto-detect potential problems:

              Auto detection is unlikely to confuse 8 bit encoding with 16 bit encoding.

              Yes, I don’t really have concerns about Notepad++ confusing these… the concern is confusion between BOMless (the usual case) UTF-8 and “older” character set encodings.

              1 Reply Last reply Reply Quote 1
              • Alan KilbornA
                Alan Kilborn @gstavi
                last edited by Alan Kilborn

                @gstavi said in Notepad++ encoding auto-detect potential problems:

                Notepad++ should have an option to be utf-8 biased. First try to decode a file as UTF-8 and only if it fails, because file contains byte sequences that are invalid for utf-8, move to auto detection.

                I like that. :-)

                Perhaps the auto detection already works that way. I don’t know.

                My guess is that it isn’t currently that way, or it is that way and there’s a bug with it.

                The next thing should be for a user to define acceptable encodings that will limit auto detection space and warn for files that do not match anything in the set.

                This is also a very good idea.

                1 Reply Last reply Reply Quote 1
                • Alan KilbornA Alan Kilborn referenced this topic on
                • Robert CarnegieR
                  Robert Carnegie
                  last edited by

                  I see at Github that Notepad++ can have a preference setting stored not to “auto detect encoding”. The effect seems to be that a file is treated as “UTF-8 without BOM”, including your “test.txt” file.

                  Also in https://npp-user-manual.org/docs/command-prompt/

                  launch settings now exist to choose a different location of stored settings (allowing alternative settings,), to launch a second or third Notepad++ window when one is open already, which may or may not have independent settings, and to set a “language” (programming language) and “localization” (whose effect I don’t know). So you could perhaps run “normal” Notepad++ alongside “only UTF-8” Notepad++.

                  Also, option: -qf=“D:\path to\file”
                  “Launch ghost typing to display a file content”
                  types the file into the open or “new 1” tab possibly treating any file as UTF-8 - it does not mutilate your “test.txt”. However, this is slow and it does not really solve any problem. The “Encoding” menu does not seem to help, either.

                  And none of this means that there isn’t a bug. On the other hand… I expect it’s possible to create a text file which really can’t have its encoding auto detected. Strictly, without assuming an encoding first, you can’t interpret a file.

                  1 Reply Last reply Reply Quote 0
                  • Robert CarnegieR Robert Carnegie referenced this topic on
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors