utf8mb4 characters e.g. 🆔 are not detected as utf8 unless you include a BOM



  • The presence of any utf8mb4 character and Notepad++ opens the file with ANSI encoding.

    If it’s technically not possible to correctly detect as UTF8 without using a BOM (which I can’t include), is it possible to define a default/assumed encoding?



  • Hello, @gary-rowswell, and All,

    To begin with, I would strongly advice anyone, to use the UTF-8 BOM encoding, in all cases. Indeed, compared to the UTF-8 encoding, current file size is just 3 bytes more, which are invisible and stands for the UTF-8 representation of the Byte Order Mark, of Unicode code point \x{FEFF}.

    As any decent editor or browser recognizes BOM, you are absolutely sure that your UTF-8 encoded text will be correctly displayed, whatever the Unicode code-point of characters, between 0 to 10FFFD ( except for the surrogates area ), assuming, of course, that the current font used can handle all the characters of your text and displays their glyphs, properly !

    For additional information, refer to :

    https://en.wikipedia.org/wiki/Byte_order_mark


    Gary, what you call utf8mb4 seems to be a MySQL encoding ( The mb4 probably means MultiBytes-4 ) and, as well as UTF-8, allows to use Unicode characters, located outside the BMP ( Basic Multilingual Plane ), that is to say with a code point > \x{FFFF}, encoded with four bytes !

    Your “🆔” character is part of the Unicode block "Enclosed alphanumeric Supplement", between 1F100 and 1F1FF. See the PDF file, below :

    http://www.unicode.org/charts/PDF/U1F100.pdf


    Now, if you persist to use the UTF-8 ( so, without BOM ), here is a work-around :

    • Start Notepad++ ( I personally used the last 7.5.8 version )

    • Go to Settings > Preferences… > MISC. and check the Autodetect character encoding option

    • Open a new document ( Ctrl + N )

    • If its current encoding is different from UTF-8, choose the option Encoding > Convert to UTF-8

    • Then, insert, preferably in a comment, at least 3 NON-ASCII characters, with code-point > \x{007F} ( or > 127 in decimal ). For this matter, if you can’t type them easily, with your keyboard, you may use the Edit > Character Panel dialog, in N++

    • Now, add your text containing characters, located outside the BMP, with code-point > \x{FFFF}

    • Save your UTF-8 encoded file

    • Close and restart N++

    => The UTF-8 encoding should have been kept ;-))

    Voilà !

    Notes :

    • During tests, I noticed that these 3 chars must be inserted BEFORE any character as yours ( “🆔” ) !

    • In theory, 2NON-ASCII characters seems enough to get the right behaviour !

    Best Regards,

    guy038

    P.S. :

    I should have explained why we need to add some NON-pure ASCII characters, in current text. This is because, when text contains characters with code-point > \x{007f}, it is always encoded with 1 byte, in ANSI whereas it is encoded in 2, 3 or 4 bytes, in UTF-8. So, this helps N++ to correctly detect the present encoding, even without any BOM !


Log in to reply