utf8mb4 characters e.g. 🆔 are not detected as utf8 unless you include a BOM
-
The presence of any utf8mb4 character and Notepad++ opens the file with ANSI encoding.
If it’s technically not possible to correctly detect as UTF8 without using a BOM (which I can’t include), is it possible to define a default/assumed encoding?
-
Hello, @gary-rowswell, and All,
To begin with, I would strongly advice anyone, to use the
UTF-8 BOM
encoding, in all cases. Indeed, compared to theUTF-8
encoding, current file size is just3
bytes more, which are invisible and stands for theUTF-8
representation of the Byte Order Mark, of Unicode code point\x{FEFF}
.As any decent editor or browser recognizes
BOM
, you are absolutely sure that yourUTF-8
encoded text will be correctly displayed, whatever the Unicode code-point of characters, between0
to10FFFD
( except for the surrogates area ), assuming, of course, that the current font used can handle all the characters of your text and displays their glyphs, properly !For additional information, refer to :
https://en.wikipedia.org/wiki/Byte_order_mark
Gary, what you call
utf8mb4
seems to be a MySQL encoding ( The mb4 probably means MultiBytes-4 ) and, as well asUTF-8
, allows to use Unicode characters, located outside theBMP
( Basic Multilingual Plane ), that is to say with a code point >\x{FFFF}
, encoded with four bytes !Your “🆔” character is part of the Unicode block "Enclosed alphanumeric Supplement", between
1F100
and1F1FF
. See the PDF file, below :http://www.unicode.org/charts/PDF/U1F100.pdf
Now, if you persist to use the
UTF-8
( so, withoutBOM
), here is a work-around :-
Start Notepad++ ( I personally used the last
7.5.8
version ) -
Go to Settings > Preferences… > MISC. and check the
Autodetect character encoding
option -
Open a new document (
Ctrl + N
) -
If its current encoding is different from
UTF-8
, choose the option Encoding > Convert to UTF-8 -
Then, insert, preferably in a comment, at least
3
NON-ASCII characters, with code-point >\x{007F}
( or >127
in decimal ). For this matter, if you can’t type them easily, with your keyboard, you may use the Edit > Character Panel dialog, in N++ -
Now, add your text containing characters, located outside the
BMP
, with code-point >\x{FFFF}
-
Save your
UTF-8
encoded file -
Close and restart N++
=> The UTF-8 encoding should have been kept ;-))
VoilĂ !
Notes :
-
During tests, I noticed that these
3
chars must be inserted BEFORE any character as yours ( “🆔” ) ! -
In theory,
2
NON-ASCII characters seems enough to get the right behaviour !
Best Regards,
guy038
P.S. :
I should have explained why we need to add some NON-pure ASCII characters, in current text. This is because, when text contains characters with code-point >
\x{007f}
, it is always encoded with1
byte, inANSI
whereas it is encoded in2
,3
or4
bytes, inUTF-8
. So, this helps N++ to correctly detect the present encoding, even without anyBOM
! -