Community
    • Login

    Encoding says file is UTF8 but it's not

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    5 Posts 3 Posters 3.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • asbyonejjA
      asbyonejj
      last edited by

      I’m using 7.8.5 (32bit). Encoding info always says file is utf8.
      But if I check using Linux file command, it isn’t.

      # file aaa.sh
      aaa.sh: Bourne-Again shell script, ASCII text executable
      

      It seems a bug.

      1 Reply Last reply Reply Quote 0
      • asbyonejjA
        asbyonejj
        last edited by

        Even if I’ve done “Convert to ANSI”, but it stills UTF8.
        Does other guys see this problem with version 7.8.5?
        I want to know if this is found only to me…

        1 Reply Last reply Reply Quote 0
        • gstaviG
          gstavi
          last edited by

          It is not clear what is your goal, I assume you want a file to be ANSI and for Notepad++ to show it.

          You need some background about unicode encoding.

          UTF8 extends ANSI in a backward compatible fashion.
          Every ANSI file is also UTF8 by definition but NOT vice versa.
          Rest assured that if your UTF8 file only contains symbols in the range of 0-127 than it is also ANSI.
          Just make sure it does not have BOM in case it need to be consumed by older applications that fail to skip it.

          Basically the term ANSI, in the context of text encoding, should stop being used. People should be educated that requirement for “ANSI” is in fact requirement for “UTF8”.

          1 Reply Last reply Reply Quote 3
          • guy038G
            guy038
            last edited by guy038

            Hello, @asbyonejj, @gstavi and All,

            Reminders :

            • An ANSI encoded file is generally a file with an encoding, from Windows-1250 to Windows-1258, and codes 256 characters, divided in two parts :

              • Characters with Unicode code-point between \x00 and \x7F ( from 0 to 127 ), coded with 1 byte, which belongs to the old US-ASCII encoding

              • Characters with Unicode code-point between \x80 and \xFF ( from 128 to 255 ) which are specific to a one-bit encoding, like, for instance, the ISO-... encodings family, the OEM-.... encodings family and the Windows-.... encodings family, for the best-known

            For an almost exhaustive list of encodings, refer to the table, at :

            https://en.wikipedia.org/wiki/Windows-1252#External_links

            • An UT8 encoded file, due to the clever UTF-8 encoding ( Universal Character Set Transformation Format - 8 bits ), can encode any of the 1,114,112 possible Unicode characters with :

              • 1 byte if the Unicode code-point of this character is between x{0000} and \x{007F}

              • 2 bytes if the Unicode code-point of this character is between x{0080} and \x{07FF}

              • 3 bytes if the Unicode code-point of this character is between x{0800} and \x{FFFF}

              • 4 bytes if the Unicode code-point of this character is between x{10000} and \x{10FFF}

            However, note that the last Unicode v13.0 release contains only 283,506 assigned characters and there still are 830,606 non-assigned characters, so reserved for future use !

            https://www.unicode.org/versions/stats/charcountv13_0.html


            You may have noticed than the encoding of the first 128 characters, between x{0000} and \x{007F}, is quite identical in an ANSI and in an UTF-8 encoded file

            So, if your file contain only characters, with Unicode code-point lower than \x{0080} and that your file is not UTF-8-BOM encoded, it’s impossible to any editor, including Notepad++, to guess that the user assumes an ANSI or an UTF-8 encoding. See :

            http://www.unicode.org/charts/PDF/U0000.pdf


            Thus, in order to remove the ambiguity and help N++ to decide of the right user-encoding, you have two possibilities :

            • Include in your file, at least, 1 character, with code-point above \x{007F}

            • UN-tick the option Apply to opened ANSI files in Settings > Preferences... > New Document > Encoding, if the UTF-8 option is ticked

            This way, after saving your file with the ANSI encoding ( Encoding > Convert to ANSI ), it will now keep this ANSI encoding for future N++ startups ;-))

            Best Regards,

            guy038

            asbyonejjA 1 Reply Last reply Reply Quote 2
            • asbyonejjA
              asbyonejj @guy038
              last edited by

              @guy038 said in Encoding says file is UTF8 but it's not:

              UN-tick the option Apply to opened ANSI files in Settings > Preferences… > New Document > Encoding, if the UTF-8 option is ticked

              This was my case. I didn’t know this option. Thank you.

              1 Reply Last reply Reply Quote 1
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors