utf8mb4 characters e.g. 🆔 are not detected as utf8 unless you include a BOM

Gary Rowswell

The presence of any utf8mb4 character and Notepad++ opens the file with ANSI encoding.

If it’s technically not possible to correctly detect as UTF8 without using a BOM (which I can’t include), is it possible to define a default/assumed encoding?

guy038

Hello, @gary-rowswell, and All,

To begin with, I would strongly advice anyone, to use the UTF-8 BOM encoding, in all cases. Indeed, compared to the UTF-8 encoding, current file size is just 3 bytes more, which are invisible and stands for the UTF-8 representation of the Byte Order Mark, of Unicode code point \x{FEFF}.

As any decent editor or browser recognizes BOM, you are absolutely sure that your UTF-8 encoded text will be correctly displayed, whatever the Unicode code-point of characters, between 0 to 10FFFD ( except for the surrogates area ), assuming, of course, that the current font used can handle all the characters of your text and displays their glyphs, properly !

For additional information, refer to :

https://en.wikipedia.org/wiki/Byte_order_mark

Gary, what you call utf8mb4 seems to be a MySQL encoding ( The mb4 probably means MultiBytes-4 ) and, as well as UTF-8, allows to use Unicode characters, located outside the BMP ( Basic Multilingual Plane ), that is to say with a code point > \x{FFFF}, encoded with four bytes !

Your “🆔” character is part of the Unicode block "Enclosed alphanumeric Supplement", between 1F100 and 1F1FF. See the PDF file, below :

http://www.unicode.org/charts/PDF/U1F100.pdf

Now, if you persist to use the UTF-8 ( so, without BOM ), here is a work-around :

Start Notepad++ ( I personally used the last 7.5.8 version )
Go to Settings > Preferences… > MISC. and check the Autodetect character encoding option
Open a new document ( Ctrl + N )
If its current encoding is different from UTF-8, choose the option Encoding > Convert to UTF-8
Then, insert, preferably in a comment, at least 3 NON-ASCII characters, with code-point > \x{007F} ( or > 127 in decimal ). For this matter, if you can’t type them easily, with your keyboard, you may use the Edit > Character Panel dialog, in N++
Now, add your text containing characters, located outside the BMP, with code-point > \x{FFFF}
Save your UTF-8 encoded file
Close and restart N++

=> The UTF-8 encoding should have been kept ;-))

Voilà !

Notes :

During tests, I noticed that these 3 chars must be inserted BEFORE any character as yours ( “🆔” ) !
In theory, 2NON-ASCII characters seems enough to get the right behaviour !

Best Regards,

guy038

P.S. :

I should have explained why we need to add some NON-pure ASCII characters, in current text. This is because, when text contains characters with code-point > \x{007f}, it is always encoded with 1 byte, in ANSI whereas it is encoded in 2, 3 or 4 bytes, in UTF-8. So, this helps N++ to correctly detect the present encoding, even without any BOM !