utf8mb4 characters e.g. 🆔 are not detected as utf8 unless you include a BOM
-
The presence of any utf8mb4 character and Notepad++ opens the file with ANSI encoding.
If it’s technically not possible to correctly detect as UTF8 without using a BOM (which I can’t include), is it possible to define a default/assumed encoding?
-
Hello, @gary-rowswell, and All,
To begin with, I would strongly advice anyone, to use the
UTF-8 BOMencoding, in all cases. Indeed, compared to theUTF-8encoding, current file size is just3bytes more, which are invisible and stands for theUTF-8representation of the Byte Order Mark, of Unicode code point\x{FEFF}.As any decent editor or browser recognizes
BOM, you are absolutely sure that yourUTF-8encoded text will be correctly displayed, whatever the Unicode code-point of characters, between0to10FFFD( except for the surrogates area ), assuming, of course, that the current font used can handle all the characters of your text and displays their glyphs, properly !For additional information, refer to :
https://en.wikipedia.org/wiki/Byte_order_mark
Gary, what you call
utf8mb4seems to be a MySQL encoding ( The mb4 probably means MultiBytes-4 ) and, as well asUTF-8, allows to use Unicode characters, located outside theBMP( Basic Multilingual Plane ), that is to say with a code point >\x{FFFF}, encoded with four bytes !Your “🆔” character is part of the Unicode block "Enclosed alphanumeric Supplement", between
1F100and1F1FF. See the PDF file, below :http://www.unicode.org/charts/PDF/U1F100.pdf
Now, if you persist to use the
UTF-8( so, withoutBOM), here is a work-around :-
Start Notepad++ ( I personally used the last
7.5.8version ) -
Go to Settings > Preferences… > MISC. and check the
Autodetect character encodingoption -
Open a new document (
Ctrl + N) -
If its current encoding is different from
UTF-8, choose the option Encoding > Convert to UTF-8 -
Then, insert, preferably in a comment, at least
3NON-ASCII characters, with code-point >\x{007F}( or >127in decimal ). For this matter, if you can’t type them easily, with your keyboard, you may use the Edit > Character Panel dialog, in N++ -
Now, add your text containing characters, located outside the
BMP, with code-point >\x{FFFF} -
Save your
UTF-8encoded file -
Close and restart N++
=> The UTF-8 encoding should have been kept ;-))
VoilĂ !
Notes :
-
During tests, I noticed that these
3chars must be inserted BEFORE any character as yours ( “🆔” ) ! -
In theory,
2NON-ASCII characters seems enough to get the right behaviour !
Best Regards,
guy038
P.S. :
I should have explained why we need to add some NON-pure ASCII characters, in current text. This is because, when text contains characters with code-point >
\x{007f}, it is always encoded with1byte, inANSIwhereas it is encoded in2,3or4bytes, inUTF-8. So, this helps N++ to correctly detect the present encoding, even without anyBOM! -