Community
    • Login

    utf8mb4 characters e.g. 🆔 are not detected as utf8 unless you include a BOM

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    2 Posts 2 Posters 2.9k Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Gary RowswellG Offline
      Gary Rowswell
      last edited by

      The presence of any utf8mb4 character and Notepad++ opens the file with ANSI encoding.

      If it’s technically not possible to correctly detect as UTF8 without using a BOM (which I can’t include), is it possible to define a default/assumed encoding?

      1 Reply Last reply Reply Quote 0
      • guy038G Online
        guy038
        last edited by guy038

        Hello, @gary-rowswell, and All,

        To begin with, I would strongly advice anyone, to use the UTF-8 BOM encoding, in all cases. Indeed, compared to the UTF-8 encoding, current file size is just 3 bytes more, which are invisible and stands for the UTF-8 representation of the Byte Order Mark, of Unicode code point \x{FEFF}.

        As any decent editor or browser recognizes BOM, you are absolutely sure that your UTF-8 encoded text will be correctly displayed, whatever the Unicode code-point of characters, between 0 to 10FFFD ( except for the surrogates area ), assuming, of course, that the current font used can handle all the characters of your text and displays their glyphs, properly !

        For additional information, refer to :

        https://en.wikipedia.org/wiki/Byte_order_mark


        Gary, what you call utf8mb4 seems to be a MySQL encoding ( The mb4 probably means MultiBytes-4 ) and, as well as UTF-8, allows to use Unicode characters, located outside the BMP ( Basic Multilingual Plane ), that is to say with a code point > \x{FFFF}, encoded with four bytes !

        Your “🆔” character is part of the Unicode block "Enclosed alphanumeric Supplement", between 1F100 and 1F1FF. See the PDF file, below :

        http://www.unicode.org/charts/PDF/U1F100.pdf


        Now, if you persist to use the UTF-8 ( so, without BOM ), here is a work-around :

        • Start Notepad++ ( I personally used the last 7.5.8 version )

        • Go to Settings > Preferences… > MISC. and check the Autodetect character encoding option

        • Open a new document ( Ctrl + N )

        • If its current encoding is different from UTF-8, choose the option Encoding > Convert to UTF-8

        • Then, insert, preferably in a comment, at least 3 NON-ASCII characters, with code-point > \x{007F} ( or > 127 in decimal ). For this matter, if you can’t type them easily, with your keyboard, you may use the Edit > Character Panel dialog, in N++

        • Now, add your text containing characters, located outside the BMP, with code-point > \x{FFFF}

        • Save your UTF-8 encoded file

        • Close and restart N++

        => The UTF-8 encoding should have been kept ;-))

        VoilĂ  !

        Notes :

        • During tests, I noticed that these 3 chars must be inserted BEFORE any character as yours ( “🆔” ) !

        • In theory, 2NON-ASCII characters seems enough to get the right behaviour !

        Best Regards,

        guy038

        P.S. :

        I should have explained why we need to add some NON-pure ASCII characters, in current text. This is because, when text contains characters with code-point > \x{007f}, it is always encoded with 1 byte, in ANSI whereas it is encoded in 2, 3 or 4 bytes, in UTF-8. So, this helps N++ to correctly detect the present encoding, even without any BOM !

        1 Reply Last reply Reply Quote 3

        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

        With your input, this post could be even better đź’—

        Register Login
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors