Community
    • Login

    Cannot change Encoding to correct encoding of UTF-8

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    4 Posts 3 Posters 42 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Tom SassonT
      Tom Sasson
      last edited by

      Hello fellow NPP’ers,
      I read many posts of this type, but still haven’t found a solution to my issue.
      I have a .py file, which is saved in Pycharm using UTF-8 encoding.
      The file has code + Hebrew within some string assignments, like:
      DOWNLOAD_FOLDER = r"C:\Users\MYUSER\Documents\עבודה\SOMFOLDER\תלושים"
      When I open this file in NPP, I get chinese letters instead of the Hebrew letters, and the default encoding is GB2312 (Simplified).

      When I try to change encoding by: Encoding->UTF-8, nothing happens, and when I try to change to: Encoding->Character Sets->Hebrew->Windows-1255/OEM 862/ISO 8859-8, I get Gibberish.

      Under Settings->Style Configurator->Default Style, my font is “Courier New”.

      Please assist me in solving this issue.

      Notepad++ v8.8.2 (32-bit)
      Build time : Jun 26 2025 - 01:11:16
      Scintilla/Lexilla included : 5.5.7/5.4.5
      Boost Regex included : 1_85
      Path : C:\Program Files (x86)\Notepad++\notepad++.exe
      Command Line : “C:\Users\MYUSER\Documents\עבודה\SOMFOLDER\my_file.py”
      Admin mode : OFF
      Local Conf mode : OFF
      Cloud Config : OFF
      Periodic Backup : ON
      Placeholders : OFF
      Scintilla Rendering Mode : SC_TECHNOLOGY_DEFAULT (0)
      Multi-instance Mode : monoInst
      File Status Auto-Detection : cdEnabledNew (for current file/tab only)
      Dark Mode : OFF
      Display Info :
      primary monitor: 1920x1080, scaling 100%
      visible monitors count: 2
      installed Display Class adapters:
      0001: Description - Intel® HD Graphics 520
      0001: DriverVersion - 27.20.100.8477
      OS Name : Windows 10 Enterprise (64-bit)
      OS Version : 22H2
      OS Build : 19045.6036
      Current ANSI codepage : 1252
      Plugins :
      DSpellCheck (1.4.6)
      mimeTools (3.1)
      NppConverter (4.6)
      NppExport (0.4)

      CoisesC PeterJonesP 2 Replies Last reply Reply Quote 0
      • CoisesC
        Coises
        last edited by Coises

        @Tom-Sasson:

        I am not sure this will work, but it’s easy enough to try:

        Open Settings | Preferences… | MISC. and look for Autodetect character encoding. If that box is checked, un-check it and then try opening the file.

        Also (in case that doesn’t do it), what is your selection at Settings | Preferences… | New Document: Encoding? In particular, if you have selected UTF-8 but have not checked Apply to opened ANSI files, try checking that.

        1 Reply Last reply Reply Quote 0
        • CoisesC
          Coises @Tom Sasson
          last edited by

          @Tom-Sasson said in Cannot change Encoding to correct encoding of UTF-8:

          When I open this file in NPP, I get chinese letters instead of the Hebrew letters, and the default encoding is GB2312 (Simplified).

          When I try to change encoding by: Encoding->UTF-8, nothing happens

          When I think about it… this doesn’t seem like it should be. Some questions:

          1. Are you selecting Encoding | UTF-8 or Encoding | Convert to UTF-8? You said the former, but I want to be sure, because there is a difference.

          2. When you attempt to change the encoding, I assume you mean the text display does not change. In the status bar near the bottom right, where it shows GB2312 after opening the file, does that change to UTF-8 when you try to change the encoding?

          3. Is it possible for you to upload a copy of the file somewhere? (Understood that might not be acceptable to you for privacy reasons, or might be inconvenient for you.) I could be confused, but it seems to me like a bug if the file is not being re-interpreted as UTF-8 when you select that option, even if it was mis-identified as GB2312. I’d like to attempt to reproduce it and maybe see if I can figure out why it’s not changing. (Someone else with better knowledge might jump in here and clarify, though.)

          1 Reply Last reply Reply Quote 0
          • PeterJonesP
            PeterJones @Tom Sasson
            last edited by

            @Tom-Sasson,

            DOWNLOAD_FOLDER = r"C:\Users\MYUSER\Documents\עבודה\SOMFOLDER\תלושים"
            

            Unfortunately, that alone isn’t enough to replicate the problem for us. If I save that line of code into a UTF-8-encoded file, and open it in Notepad++, it is properly recognized and interpreted as UTF-8.

            As @Coises suggested, if you could share enough code that we could replicate it, it would be great. (See below for a way to hexdump the file, to make sure we can create the same file as you, even if we don’t have pycharm available.)

            If there is proprietary information in your code, could you pare it down to the minimum UTF-8 file that you can create in Pycharm that shows the same problem when loaded in Notepad++? If not, you will have to do more investigation on your side with us randomly guessing as to how to direct you: as @Coises suggested, my first steps would be to turn off the autodetect, as no encoding autodetect works 100% of the time; once that’s off, you might also want to see if changing Settings > Preferences > New Document to choose UTF-8 and ☑ Apply to opened ANSI files in the Encoding box on that page of settings.

            Normally, with the symptoms you gave, I would guess that Notepad++ saw too many “invalid” UTF-8 sequences: UTF-8 has limited sets of high-order byte sequences that are allowed – for example, after a normal ASCII character, a high-order byte will never be less than 0xC0 – and a bunch more rules like that, where this table in the Wikipedia UTF-8 article gives the rules for valid UTF-8 high-byte sequences.

            Whether you are providing us with a minimal file to replicate, or whether you are investigating proprietary files on your own, the next thing I would suggest is a way to hexdump the file – so you can see what bytes pycharm is writing to disk, vs what characters Notepad++ is interpreting those characters. Since you mentioned pycharm, I am assuming you have Python3 installed on your computer, so save the following as hexdump.py:

            import sys;
            data=open(sys.argv[1],'rb').read();
            offset=0;
            while offset < len(data):
                chunk = data[offset : offset + 16];
                hex_values = ' '.join(f'{byte:02x}' for byte in chunk);
                ascii_values = ''.join(chr(byte) if 32 <= byte <= 126 else '.' for byte in chunk);
                print(f'{offset:08x} {hex_values:<48} |{ascii_values}|');
                offset += 16;
            

            Then, if your file created by pycharm is called tom.py, run python hexdump.py tom.py

            C:\usr\local\share\TempData\Npp>python hexdump.py tom.py
            00000000 44 4f 57 4e 4c 4f 41 44 5f 46 4f 4c 44 45 52 20  |DOWNLOAD_FOLDER |
            00000010 3d 20 72 22 43 3a 5c 55 73 65 72 73 5c 4d 59 55  |= r"C:\Users\MYU|
            00000020 53 45 52 5c 44 6f 63 75 6d 65 6e 74 73 5c d7 a2  |SER\Documents\..|
            00000030 d7 91 d7 95 d7 93 d7 94 5c 53 4f 4d 46 4f 4c 44  |........\SOMFOLD|
            00000040 45 52 5c d7 aa d7 9c d7 95 d7 a9 d7 99 d7 9d 22  |ER\............"|
            

            If this is the “minimal file” with no proprietary info, then paste that hexdump along with the raw text of the file when you reply.

            If you are just investigating on your own, then look for the first character where you have a Hebrew character in pycharm but it shows up as Chinese (or whatever) in Notepad++: for example, the end of line 3 in the dump is 5c (the \ backslash), followed by d7 a2 (which is ע: mixed LTR and RTL is really confusing to me, so that surprised me at first). Anyway, you can look for “problematic” sequences, like anytime a small byte (00-7F) is followed by a “medium” byte (80-BF), which isn’t allowed; or anytime a high byte (C0-F7) is followed by anything other than a “medium byte”. Maybe you’ll find that pycharm is saving an invalid UTF-8 (though I doubt it).

            1 Reply Last reply Reply Quote 0
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors