Community
    • Login

    utf8mb4 characters e.g. 🆔 are not detected as utf8 unless you include a BOM

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    2 Posts 2 Posters 2.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Gary RowswellG
      Gary Rowswell
      last edited by

      The presence of any utf8mb4 character and Notepad++ opens the file with ANSI encoding.

      If it’s technically not possible to correctly detect as UTF8 without using a BOM (which I can’t include), is it possible to define a default/assumed encoding?

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @gary-rowswell, and All,

        To begin with, I would strongly advice anyone, to use the UTF-8 BOM encoding, in all cases. Indeed, compared to the UTF-8 encoding, current file size is just 3 bytes more, which are invisible and stands for the UTF-8 representation of the Byte Order Mark, of Unicode code point \x{FEFF}.

        As any decent editor or browser recognizes BOM, you are absolutely sure that your UTF-8 encoded text will be correctly displayed, whatever the Unicode code-point of characters, between 0 to 10FFFD ( except for the surrogates area ), assuming, of course, that the current font used can handle all the characters of your text and displays their glyphs, properly !

        For additional information, refer to :

        https://en.wikipedia.org/wiki/Byte_order_mark


        Gary, what you call utf8mb4 seems to be a MySQL encoding ( The mb4 probably means MultiBytes-4 ) and, as well as UTF-8, allows to use Unicode characters, located outside the BMP ( Basic Multilingual Plane ), that is to say with a code point > \x{FFFF}, encoded with four bytes !

        Your “🆔” character is part of the Unicode block "Enclosed alphanumeric Supplement", between 1F100 and 1F1FF. See the PDF file, below :

        http://www.unicode.org/charts/PDF/U1F100.pdf


        Now, if you persist to use the UTF-8 ( so, without BOM ), here is a work-around :

        • Start Notepad++ ( I personally used the last 7.5.8 version )

        • Go to Settings > Preferences… > MISC. and check the Autodetect character encoding option

        • Open a new document ( Ctrl + N )

        • If its current encoding is different from UTF-8, choose the option Encoding > Convert to UTF-8

        • Then, insert, preferably in a comment, at least 3 NON-ASCII characters, with code-point > \x{007F} ( or > 127 in decimal ). For this matter, if you can’t type them easily, with your keyboard, you may use the Edit > Character Panel dialog, in N++

        • Now, add your text containing characters, located outside the BMP, with code-point > \x{FFFF}

        • Save your UTF-8 encoded file

        • Close and restart N++

        => The UTF-8 encoding should have been kept ;-))

        VoilĂ  !

        Notes :

        • During tests, I noticed that these 3 chars must be inserted BEFORE any character as yours ( “🆔” ) !

        • In theory, 2NON-ASCII characters seems enough to get the right behaviour !

        Best Regards,

        guy038

        P.S. :

        I should have explained why we need to add some NON-pure ASCII characters, in current text. This is because, when text contains characters with code-point > \x{007f}, it is always encoded with 1 byte, in ANSI whereas it is encoded in 2, 3 or 4 bytes, in UTF-8. So, this helps N++ to correctly detect the present encoding, even without any BOM !

        1 Reply Last reply Reply Quote 3
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors