<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[utf8mb4 characters e.g. 🆔 are not detected as utf8 unless you include a BOM]]></title><description><![CDATA[<p dir="auto">The presence of any utf8mb4 character and Notepad++ opens the file with ANSI encoding.</p>
<p dir="auto">If it’s technically not possible to correctly detect as UTF8 without using a BOM (which I can’t include), is it possible to define a default/assumed encoding?</p>
]]></description><link>https://community.notepad-plus-plus.org/topic/16114/utf8mb4-characters-e-g-are-not-detected-as-utf8-unless-you-include-a-bom</link><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 07:36:34 GMT</lastBuildDate><atom:link href="https://community.notepad-plus-plus.org/topic/16114.rss" rel="self" type="application/rss+xml"/><pubDate>Wed, 01 Aug 2018 16:52:56 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to utf8mb4 characters e.g. 🆔 are not detected as utf8 unless you include a BOM on Wed, 01 Aug 2018 21:32:35 GMT]]></title><description><![CDATA[<p dir="auto">Hello, <a class="plugin-mentions-user plugin-mentions-a" href="/user/gary-rowswell" aria-label="Profile: gary-rowswell">@<bdi>gary-rowswell</bdi></a>, and <strong>All</strong>,</p>
<p dir="auto">To begin with, I would <strong>strongly</strong> advice anyone, to use the <strong><code>UTF-8 BOM</code></strong> encoding, in <strong>all</strong> cases. Indeed, compared to the <strong><code>UTF-8</code></strong> encoding, <strong>current</strong> file size is just <strong><code>3</code> bytes</strong> more, which are <strong>invisible</strong> and stands for the <strong><code>UTF-8</code></strong> representation of the <strong>Byte Order Mark</strong>, of <strong>Unicode</strong> code point <strong><code>\x{FEFF}</code></strong>.</p>
<p dir="auto">As any <strong>decent</strong> editor or browser recognizes <strong><code>BOM</code></strong>, you are <strong>absolutely</strong> sure that your <strong><code>UTF-8</code></strong> encoded text will be <strong>correctly</strong> displayed, whatever the <strong>Unicode</strong> code-point of characters, between <strong><code>0</code></strong> to <strong><code>10FFFD</code></strong> ( except for the <strong>surrogates</strong> area ), assuming, of course, that the <strong>current</strong> font used can <strong>handle</strong> all the characters of your text and displays their <strong>glyphs</strong>, properly !</p>
<p dir="auto">For <strong>additional</strong> information, refer to :</p>
<p dir="auto"><a href="https://en.wikipedia.org/wiki/Byte_order_mark" rel="nofollow ugc">https://en.wikipedia.org/wiki/Byte_order_mark</a></p>
<hr />
<p dir="auto"><strong>Gary</strong>, what you call <strong><code>utf8mb4</code></strong> seems to be a <strong>MySQL</strong> encoding ( The <strong>mb4</strong> probably means <strong>MultiBytes-4</strong> ) and, as well as <strong><code>UTF-8</code></strong>, allows to use <strong>Unicode</strong> characters, located outside the <strong><code>BMP</code></strong> ( <strong>Basic Multilingual Plane</strong> ), that is to say with a code point &gt; <strong><code>\x{FFFF}</code></strong>, encoded with <strong>four</strong> bytes !</p>
<p dir="auto">Your “🆔” character is part of the <strong>Unicode</strong> block "<strong>Enclosed alphanumeric Supplement"</strong>, between <strong><code>1F100</code></strong> and <strong><code>1F1FF</code></strong>. See the <strong>PDF</strong> file, below :</p>
<p dir="auto"><a href="http://www.unicode.org/charts/PDF/U1F100.pdf" rel="nofollow ugc">http://www.unicode.org/charts/PDF/U1F100.pdf</a></p>
<hr />
<p dir="auto">Now, if you persist to use the <strong><code>UTF-8</code></strong> ( so, <strong>without</strong> <strong><code>BOM</code></strong> ), here is a <strong>work-around</strong> :</p>
<ul>
<li>
<p dir="auto">Start Notepad++ ( I personally used the <strong>last</strong> <strong><code>7.5.8</code></strong> version )</p>
</li>
<li>
<p dir="auto">Go to <strong>Settings &gt; Preferences… &gt; MISC.</strong> and <strong>check</strong> the <strong><code>Autodetect character encoding</code></strong> option</p>
</li>
<li>
<p dir="auto">Open a <strong>new</strong> document ( <strong><code>Ctrl + N</code></strong> )</p>
</li>
<li>
<p dir="auto">If its <strong>current</strong> encoding is <strong>different</strong> from <strong><code>UTF-8</code></strong>, choose the option <strong>Encoding &gt; Convert to UTF-8</strong></p>
</li>
<li>
<p dir="auto">Then, <strong>insert</strong>, preferably in a <strong>comment</strong>, at least <strong><code>3</code></strong> <strong>NON-ASCII</strong> characters, with code-point &gt; <strong><code>\x{007F}</code></strong> ( or &gt; <strong><code>127</code></strong> in <strong>decimal</strong> ). For this matter, if you can’t type them easily, with your <strong>keyboard</strong>, you may use the <strong>Edit &gt; Character Panel</strong> dialog, in N++</p>
</li>
<li>
<p dir="auto">Now, add <strong>your</strong> text containing characters, located <strong>outside</strong> the <strong><code>BMP</code></strong>, with code-point  &gt; <strong><code>\x{FFFF}</code></strong></p>
</li>
<li>
<p dir="auto"><strong>Save</strong> your <strong><code>UTF-8</code> encoded</strong> file</p>
</li>
<li>
<p dir="auto">Close and restart <strong>N++</strong></p>
</li>
</ul>
<p dir="auto">=&gt; The <strong>UTF-8</strong> encoding should have been <strong>kept</strong> ;-))</p>
<p dir="auto">Voilà !</p>
<p dir="auto"><strong>Notes</strong> :</p>
<ul>
<li>
<p dir="auto">During tests, I noticed that these <strong><code>3</code></strong> chars must be inserted <strong>BEFORE</strong> any character as <strong>yours</strong> ( “🆔” ) !</p>
</li>
<li>
<p dir="auto">In theory, <strong><code>2</code>NON-ASCII</strong> characters seems enough to get the <strong>right</strong> behaviour !</p>
</li>
</ul>
<p dir="auto">Best Regards,</p>
<p dir="auto">guy038</p>
<p dir="auto"><strong>P.S.</strong> :</p>
<p dir="auto">I should have explained <strong>why</strong> we need to add some <strong>NON-pure ASCII</strong> characters, in <strong>current</strong> text. This is because, when text contains characters with code-point &gt; <strong><code>\x{007f}</code></strong>, it is <strong>always encoded</strong> with <strong><code>1</code></strong> byte, in <strong><code>ANSI</code></strong> whereas it is <strong>encoded</strong> in <strong><code>2</code></strong>, <strong><code>3</code></strong> or <strong><code>4</code></strong> <strong>bytes</strong>, in <strong><code>UTF-8</code></strong>. So, this helps <strong>N++</strong> to <strong>correctly</strong> detect the <strong>present</strong> encoding, even <strong>without</strong> any <strong><code>BOM</code></strong> !</p>
]]></description><link>https://community.notepad-plus-plus.org/post/33886</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/33886</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Wed, 01 Aug 2018 21:32:35 GMT</pubDate></item></channel></rss>