• Login
Community
  • Login

Encoding says file is UTF8 but it's not

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
5 Posts 3 Posters 3.9k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A
    asbyonejj
    last edited by Apr 6, 2020, 6:17 AM

    I’m using 7.8.5 (32bit). Encoding info always says file is utf8.
    But if I check using Linux file command, it isn’t.

    # file aaa.sh
    aaa.sh: Bourne-Again shell script, ASCII text executable
    

    It seems a bug.

    1 Reply Last reply Reply Quote 0
    • A
      asbyonejj
      last edited by Apr 6, 2020, 6:39 AM

      Even if I’ve done “Convert to ANSI”, but it stills UTF8.
      Does other guys see this problem with version 7.8.5?
      I want to know if this is found only to me…

      1 Reply Last reply Reply Quote 0
      • G
        gstavi
        last edited by Apr 6, 2020, 7:03 AM

        It is not clear what is your goal, I assume you want a file to be ANSI and for Notepad++ to show it.

        You need some background about unicode encoding .

        UTF8 extends ANSI in a backward compatible fashion.
        Every ANSI file is also UTF8 by definition but NOT vice versa.
        Rest assured that if your UTF8 file only contains symbols in the range of 0-127 than it is also ANSI.
        Just make sure it does not have BOM in case it need to be consumed by older applications that fail to skip it.

        Basically the term ANSI, in the context of text encoding, should stop being used. People should be educated that requirement for “ANSI” is in fact requirement for “UTF8”.

        1 Reply Last reply Reply Quote 3
        • G
          guy038
          last edited by guy038 Apr 6, 2020, 6:46 PM Apr 6, 2020, 11:20 AM

          Hello, @asbyonejj, @gstavi and All,

          Reminders :

          • An ANSI encoded file is generally a file with an encoding, from Windows-1250 to Windows-1258, and codes 256 characters, divided in two parts :

            • Characters with Unicode code-point between \x00 and \x7F ( from 0 to 127 ), coded with 1 byte, which belongs to the old US-ASCII encoding

            • Characters with Unicode code-point between \x80 and \xFF ( from 128 to 255 ) which are specific to a one-bit encoding, like, for instance, the ISO-... encodings family, the OEM-.... encodings family and the Windows-.... encodings family, for the best-known

          For an almost exhaustive list of encodings, refer to the table, at :

          https://en.wikipedia.org/wiki/Windows-1252#External_links

          • An UT8 encoded file, due to the clever UTF-8 encoding ( Universal Character Set Transformation Format - 8 bits ), can encode any of the 1,114,112 possible Unicode characters with :

            • 1 byte if the Unicode code-point of this character is between x{0000} and \x{007F}

            • 2 bytes if the Unicode code-point of this character is between x{0080} and \x{07FF}

            • 3 bytes if the Unicode code-point of this character is between x{0800} and \x{FFFF}

            • 4 bytes if the Unicode code-point of this character is between x{10000} and \x{10FFF}

          However, note that the last Unicode v13.0 release contains only 283,506 assigned characters and there still are 830,606 non-assigned characters, so reserved for future use !

          https://www.unicode.org/versions/stats/charcountv13_0.html


          You may have noticed than the encoding of the first 128 characters, between x{0000} and \x{007F}, is quite identical in an ANSI and in an UTF-8 encoded file

          So, if your file contain only characters, with Unicode code-point lower than \x{0080} and that your file is not UTF-8-BOM encoded, it’s impossible to any editor, including Notepad++, to guess that the user assumes an ANSI or an UTF-8 encoding. See :

          http://www.unicode.org/charts/PDF/U0000.pdf


          Thus, in order to remove the ambiguity and help N++ to decide of the right user-encoding, you have two possibilities :

          • Include in your file, at least, 1 character, with code-point above \x{007F}

          • UN-tick the option Apply to opened ANSI files in Settings > Preferences... > New Document > Encoding, if the UTF-8 option is ticked

          This way, after saving your file with the ANSI encoding ( Encoding > Convert to ANSI ), it will now keep this ANSI encoding for future N++ startups ;-))

          Best Regards,

          guy038

          A 1 Reply Last reply Apr 6, 2020, 11:43 PM Reply Quote 2
          • A
            asbyonejj @guy038
            last edited by Apr 6, 2020, 11:43 PM

            @guy038 said in Encoding says file is UTF8 but it's not:

            UN-tick the option Apply to opened ANSI files in Settings > Preferences… > New Document > Encoding, if the UTF-8 option is ticked

            This was my case. I didn’t know this option. Thank you.

            1 Reply Last reply Reply Quote 1
            3 out of 5
            • First post
              3/5
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors