<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Unicode Normalization]]></title><description><![CDATA[<p dir="auto">Hi,<br />
Which unicode normalization form is used when creating a new (or converting to) UTF8 document? Can I change this with an INI or GUI option? I want NFC standard.<br />
Thanks,<br />
Javier</p>
]]></description><link>https://community.notepad-plus-plus.org/topic/21222/unicode-normalization</link><generator>RSS for Node</generator><lastBuildDate>Sun, 12 Apr 2026 19:43:50 GMT</lastBuildDate><atom:link href="https://community.notepad-plus-plus.org/topic/21222.rss" rel="self" type="application/rss+xml"/><pubDate>Thu, 27 May 2021 14:09:26 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to Unicode Normalization on Thu, 09 Jan 2025 17:50:46 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/195">@guy038</a> Reacting late to this thread, but I think in many case when you want to normalise a unicode text, NFC isn’t enough because it doesn’t handle ligatures.<br />
So for many people, a longer table than the one here, including all the decomposition in NFKC, will be needed.</p>
<p dir="auto">To make this more concrete : NFC won’t normalise the ffi_ligature (U+FB03).<br />
So “A\uFB03n” will stay “A\uFB03n” if normalized with NFC, but will change to “Affin” if normalized with NFKC which is much more useful for many people.</p>
]]></description><link>https://community.notepad-plus-plus.org/post/99057</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/99057</guid><dc:creator><![CDATA[jmdesp]]></dc:creator><pubDate>Thu, 09 Jan 2025 17:50:46 GMT</pubDate></item><item><title><![CDATA[Reply to Unicode Normalization on Sat, 03 Jul 2021 03:50:37 GMT]]></title><description><![CDATA[<p dir="auto">Hello, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/22097">@@xaviermdq</a>,</p>
<p dir="auto">Here is the list of the <strong>transformed</strong> characters, due to the <strong><code>NFC</code></strong> <strong>normalsation</strong> form :</p>
<pre><code class="language-z">    48 characters, from LATIN, GREEK, CYRILLIC and ROMAN scripts, are REPLACED by OTHER character(s)

        when using the option "Convert &gt; Normalization From &gt; To NFC" of the "BabelPad" software


    ( NOTE : C.P. means CODE-POINT, G.C. means GENERAL CATEGORY and D.M. means DECOMPOSITION MAPPING )


18 characters with NFC = NFKC = NFD = NFKD = D.M. :

•-----•---------•-----------------------------------•------•-------------•---------------------------------•
! Chr.|   C.P.  |          Character Name           | G.C. |  Dec. Type  |  NFC = NFKC = NFD = NFKD = D.M. |
•-----•---------•-----------------------------------•------•-------------•---------------------------------•
|   ̀  |   0340  |  COMBINING GRAVE TONE MARK        |  Mn  |  canonical  |  0300                           |
|   ́  |   0341  |  COMBINING ACUTE TONE MARK        |  Mn  |  canonical  |  0301                           |
|   ̓  |   0343  |  COMBINING GREEK KORONIS          |  Mn  |  canonical  |  0313                           |
|   ̈́  |   0344  |  COMBINING GREEK DIALYTIKA TONOS  |  Mn  |  canonical  |  0308 0301                      |
|  ʹ ‎ |   0374  |  GREEK NUMERAL SIGN               |  Lm  |  canonical  |  02B9                           |
|  ; ‎ |   037E  |  GREEK QUESTION MARK              |  Po  |  canonical  |  003B                           |
|  · ‎ |   0387  |  GREEK ANO TELEIA                 |  Po  |  canonical  |  00B7                           |
|  ι ‎ |   1FBE  |  GREEK PROSGEGRAMMENI             |  Ll  |  canonical  |  03B9                           |
|  `  |   1FEF  |  GREEK VARIA                      |  Sk  |  canonical  |  0060                           |
|  Ω  |   2126  |  OHM SIGN                         |  Lu  |  canonical  |  03A9                           |
|  K  |   212A  |  KELVIN SIGN                      |  Lu  |  canonical  |  004B                           |
|  〈  |   2329  |  LEFT-POINTING ANGLE BRACKET      |  Ps  |  canonical  |  3008                           |
|  〉  |   232A  |  RIGHT-POINTING ANGLE BRACKET     |  Pe  |  canonical  |  3009                           |
|  ⫝̸  |   2ADC  |  FORKING                          |  Sm  |  canonical  |  2ADD 0338                      |
|  𝅗𝅥  |  1D15E  |  MUSICAL SYMBOL HALF NOTE         |  So  |  canonical  |  1D157 1D165                    |
|  𝅘𝅥  |  1D15F  |  MUSICAL SYMBOL QUARTER NOTE      |  So  |  canonical  |  1D158 1D165                    |
|  𝆹𝅥  |  1D1BB  |  MUSICAL SYMBOL MINIMA            |  So  |  canonical  |  1D1B9 1D165                    |
|  𝆺𝅥  |  1D1BC  |  MUSICAL SYMBOL MINIMA BLACK      |  So  |  canonical  |  1D1BA 1D165                    |
•-----•---------•-----------------------------------•------•-------------•---------------------------------•



17 characters with NFC = NFKC = D.M. and NFD = NFKD :

•-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
! Chr.|   C.P.  |                    Character Name                    | G.C. |  Dec. Type  |  NFC = NFKC = D.M.  |    NFD = NFKD    |
•-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
|  ά  |   1F71  |  GREEK SMALL LETTER ALPHA WITH OXIA                  |  Ll  |  canonical  |  03AC               |  03B1 0301       |
|  έ  |   1F73  |  GREEK SMALL LETTER EPSILON WITH OXIA                |  Ll  |  canonical  |  03AD               |  03B5 0301       |
|  ή  |   1F75  |  GREEK SMALL LETTER ETA WITH OXIA                    |  Ll  |  canonical  |  03AE               |  03B7 0301       |
|  ί  |   1F77  |  GREEK SMALL LETTER IOTA WITH OXIA                   |  Ll  |  canonical  |  03AF               |  03B9 0301       |
|  ό  |   1F79  |  GREEK SMALL LETTER OMICRON WITH OXIA                |  Ll  |  canonical  |  03CC               |  03BF 0301       |
|  ύ  |   1F7B  |  GREEK SMALL LETTER UPSILON WITH OXIA                |  Ll  |  canonical  |  03CD               |  03C5 0301       |
|  ώ  |   1F7D  |  GREEK SMALL LETTER OMEGA WITH OXIA                  |  Ll  |  canonical  |  03CE               |  03C9 0301       |
|  Ά  |   1FBB  |  GREEK CAPITAL LETTER ALPHA WITH OXIA                |  Lu  |  canonical  |  0386               |  0391 0301       |
|  Έ  |   1FC9  |  GREEK CAPITAL LETTER EPSILON WITH OXIA              |  Lu  |  canonical  |  0388               |  0395 0301       |
|  Ή  |   1FCB  |  GREEK CAPITAL LETTER ETA WITH OXIA                  |  Lu  |  canonical  |  0389               |  0397 0301       |
|  ΐ  |   1FD3  |  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA     |  Ll  |  canonical  |  0390               |  03B9 0308 0301  |
|  Ί  |   1FDB  |  GREEK CAPITAL LETTER IOTA WITH OXIA                 |  Lu  |  canonical  |  038A               |  0399 0301       |
|  ΰ  |   1FE3  |  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA  |  Ll  |  canonical  |  03B0               |  03C5 0308 0301  |
|  Ύ  |   1FEB  |  GREEK CAPITAL LETTER UPSILON WITH OXIA              |  Lu  |  canonical  |  038E               |  03A5 0301       |
|  Ό  |   1FF9  |  GREEK CAPITAL LETTER OMICRON WITH OXIA              |  Lu  |  canonical  |  038C               |  039F 0301       |
|  Ώ  |   1FFB  |  GREEK CAPITAL LETTER OMEGA WITH OXIA                |  Lu  |  canonical  |  038F               |  03A9 0301       |
|  Å  |   212B  |  ANGSTROM SIGN                                       |  Lu  |  canonical  |  00C5               |  0041 030A       |
•-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------•



9 characters with NFC = NFKC = NFD = NFKD and D.M. &lt;&gt; from OTHERS :

•-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
! Chr.|   C.P.  |                 Character Name                  | G.C. |  Dec. Type  |  Decomp. Map. |  NFC = NFKC = NFD = NFKD  |
•-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
|  𝅘𝅥𝅮 ‎ |  1D160  |  MUSICAL SYMBOL EIGHTH NOTE                     |  So  |  canonical  |  1D15F 1D16E  |  1D158 1D165 1D16E        |
|  𝅘𝅥𝅯 ‎ |  1D161  |  MUSICAL SYMBOL SIXTEENTH NOTE                  |  So  |  canonical  |  1D15F 1D16F  |  1D158 1D165 1D16F        |
|  𝅘𝅥𝅰 ‎ |  1D162  |  MUSICAL SYMBOL THIRTY-SECOND NOTE              |  So  |  canonical  |  1D15F 1D170  |  1D158 1D165 1D170        |
|  𝅘𝅥𝅱 ‎ |  1D163  |  MUSICAL SYMBOL SIXTY-FOURTH NOTE               |  So  |  canonical  |  1D15F 1D171  |  1D158 1D165 1D171        |
|  𝅘𝅥𝅲 ‎ |  1D164  |  MUSICAL SYMBOL ONE HUNDRED TWENTY-EIGHTH NOTE  |  So  |  canonical  |  1D15F 1D172  |  1D158 1D165 1D172        |
|  𝆹𝅥𝅮 ‎ |  1D1BD  |  MUSICAL SYMBOL SEMIMINIMA WHITE                |  So  |  canonical  |  1D1BB 1D16E  |  1D1B9 1D165 1D16E        |
|  𝆺𝅥𝅮 ‎ |  1D1BE  |  MUSICAL SYMBOL SEMIMINIMA BLACK                |  So  |  canonical  |  1D1BC 1D16E  |  1D1BA 1D165 1D16E        |
|  𝆹𝅥𝅯 ‎ |  1D1BF  |  MUSICAL SYMBOL FUSA WHITE                      |  So  |  canonical  |  1D1BB 1D16F  |  1D1B9 1D165 1D16F        |
|  𝆺𝅥𝅯 ‎ |  1D1C0  |  MUSICAL SYMBOL FUSA BLACK                      |  So  |  canonical  |  1D1BC 1D16F  |  1D1BA 1D165 1D16F        |
•-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•



3 characters with NFC = NFD = D.M. and NFKC = NFKD :

•-----•---------•------------------•------•-------------•-------------------•---------------•
! Chr.|   C.P.  |  Character Name  | G.C. |  Dec. Type  |  NFC = NFD = D.M. |  NFKC = NFKD  |
•-----•---------•------------------•------•-------------•-------------------•---------------•
|  ´  |   1FFD  |  GREEK OXIA      |  Sk  |  canonical  |  00B4             |  0020 0301    |
|     |   2000  |  EN QUAD         |  Zs  |  canonical  |  2002             |  0020         |
|     |   2001  |  EM QUAD         |  Zs  |  canonical  |  2003             |  0020         |
•-----•---------•------------------•------•-------------•-------------------•---------------•



1 character with NFKC = NFKD and D.M. = NFC :

•-----•---------•----------------------------•------•-------------•-------------•-------------•------------------•
! Chr.|   C.P.  |       Character Name       | G.C. |  Dec. Type  |  NFC = D.M. |     NFD     |   NFKC = NFKD    |
•-----•---------•----------------------------•------•-------------•-------------•-------------•------------------•
|  ΅  |   1FEE  |  GREEK DIALYTIKA AND OXIA  |  Sk  |  canonical  |  0385       |  00A8 0301  |  0020 0308 0301  |
•-----•---------•----------------------------•------•-------------•-------------•-------------•------------------•
</code></pre>
<hr />
<p dir="auto">Hope this <strong>helps</strong> you !</p>
<p dir="auto">Cheers,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/67653</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/67653</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Sat, 03 Jul 2021 03:50:37 GMT</pubDate></item><item><title><![CDATA[Reply to Unicode Normalization on Sat, 03 Jul 2021 11:54:54 GMT]]></title><description><![CDATA[<p dir="auto">Hi, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/22097">@xaviermdq</a>,</p>
<p dir="auto">As you are interested by the <strong><code>Latin</code></strong>, <strong><code>Greek</code></strong>, <strong><code>Cyrillic</code></strong> and <strong><code>Roman</code></strong> scripts, only, I filtered my <strong>previous</strong> list of <strong><code>16,908</code></strong> chars and obtained a <strong>smaller</strong> file, containing <strong><code>2,635</code></strong> characters !</p>
<p dir="auto">I was able to <strong>class</strong> all these characters in <strong><code>12</code></strong> categories. Below, you’ll see the <strong>first</strong> character of <strong>each</strong> class :</p>
<pre><code class="language-z">     2,635 characters, from LATIN, GREEK, CYRILLIC and ROMAN scripts, are concerned with DECOMPOSITION MAPPING


        ( NOTE : G.C. means GENERAL CATEGORY, C.P. means CODE-POINT and D.M. means DECOMPOSITION MAPPING )


563 characters with NFC = NFKC = C.P. and NFD = NFKD = D.M. :

•---------•------------------------------------------------------------•------•-------------•--------------------•--------------------•
!   C.P.  |                       Character Name                       | G.C. |  Dec. Type  |  NFC = NFKC = C.P. |  NFD = NFKD = D.M. |
•---------•------------------------------------------------------------•------•-------------•--------------------•--------------------•
|   00C0  |  LATIN CAPITAL LETTER A WITH GRAVE                         |  Lu  |  canonical  |  00C0              |  0041 0300         |
•---------•------------------------------------------------------------•------•-------------•--------------------•--------------------•



18 characters with NFC = NFKC = NFD = NFKD = D.M. :

•---------•-----------------------------------•------•-------------•---------------------------------•
!   C.P.  |          Character Name           | G.C. |  Dec. Type  |  NFC = NFKC = NFD = NFKD = D.M. |
•---------•-----------------------------------•------•-------------•---------------------------------•
|   0340  |  COMBINING GRAVE TONE MARK        |  Mn  |  canonical  |  0300                           |
•---------•-----------------------------------•------•-------------•---------------------------------•



17 characters with NFC = NFKC = D.M. and NFD = NFKD :

•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
!   C.P.  |                    Character Name                    | G.C. |  Dec. Type  |  NFC = NFKC = D.M.  |    NFD = NFKD    |
•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
|   1F71  |  GREEK SMALL LETTER ALPHA WITH OXIA                  |  Ll  |  canonical  |  03AC               |  03B1 0301       |
•---------•------------------------------------------------------•------•-------------•---------------------•------------------•



250 characters with NFC = NFKC = C.P. and NFD = NFKD and D.M. &lt;&gt; from OTHERS :

•---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------•
!   C.P.  |                               Character Name                               | G.C. |  Dec. Type  |  Dec. Map.  |  NFC = NFKC = C.P. |      NFD = NFKD       |  Code   |
•---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------•
|   01D5  |  LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON                          |  Lu  |  canonical  |  00DC 0304  |  01D5              |  0055 0308 0304       |   01D5  |
•---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------•



9 characters with NFC = NFKC = NFD = NFKD and D.M. &lt;&gt; from OTHERS :

•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
!   C.P.  |                 Character Name                  | G.C. |  Dec. Type  |  Decomp. Map. |  NFC = NFKC = NFD = NFKD  |
•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
|  1D160  |  MUSICAL SYMBOL EIGHTH NOTE                     |  So  |  canonical  |  1D15F 1D16E  |  1D158 1D165 1D16E        |
•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•



1716 characters with NFC = NFD = C.P. and NFKC = NFKD = D.M. :

•---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------•
!   C.P.  |                         Character Name                          | G.C. |  Dec. Type   |  NFC = NFD = C.P. |  NFKC = NFKD = D.M.   |  Code   |
•---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------•
|   00A0  |  NO-BREAK SPACE                                                 |  Zs  |  &lt;noBreak&gt;   |  00A0             |  0020                 |   00A0  |
•---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------•



3 characters with NFC = NFD = D.M. and NFKC = NFKD :

•---------•------------------•------•-------------•-------------------•---------------•
!   C.P.  |  Character Name  | G.C. |  Dec. Type  |  NFC = NFD = D.M. |  NFKC = NFKD  |
•---------•------------------•------•-------------•-------------------•---------------•
|   1FFD  |  GREEK OXIA      |  Sk  |  canonical  |  00B4             |  0020 0301    |
•---------•------------------•------•-------------•-------------------•---------------•



43 characters with NFC = NFD = C.P. and NFKC = NFKD and D.M. &lt;&gt; from OTHERS :

•---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------•
!   C.P.  |                       Character Name                       | G.C. |  Dec. Type   |  Dec. Map.  |  NFC = NFD = C.P. |   NFKC = NFKD    |  Code   |
•---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------•
|   FB05  |  LATIN SMALL LIGATURE LONG S T                             |  Ll  |  &lt;compat&gt;    |  017F 0074  |  FB05             |  0073 0074       |   FB05  |
•---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------•



3 characters with NFC = NFD = C.P. and NFKC = D.M. :

•---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------•
!   C.P.  |                     Character Name                      | G.C. |  Dec. Type  |  NFC = NFD = C.P.  |  NFKC = D.M. |       NFKD       |  Code   |
•---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------•
|   01C4  |  LATIN CAPITAL LETTER DZ WITH CARON                     |  Lu  |  &lt;compat&gt;   |  01C4              |  0044 017D   |  0044 005A 030C  |   01C4  |
•---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------•



9 characters with NFC = C.P. and NFD = D.M. and NFKC = NFKD :

•---------•-----------------------------------•------•-------------•-------------•-------------•------------------•
!   C.P.  |          Character Name           | G.C. |  Dec. Type  |  NFC = C.P. |  NFD = D.M. |   NFKC = NFKD    |
•---------•-----------------------------------•------•-------------•-------------•-------------•------------------•
|   0385  |  GREEK DIALYTIKA TONOS            |  Sk  |  canonical  |  0385       |  00A8 0301  |  0020 0308 0301  |
•---------•-----------------------------------•------•-------------•-------------•-------------•------------------•



1 character with NFC = D.M. and NFKC = NFKD :

•---------•----------------------------•------•-------------•-------------•-------------•------------------•
!   C.P.  |       Character Name       | G.C. |  Dec. Type  |  NFC = D.M. |     NFD     |   NFKC = NFKD    |
•---------•----------------------------•------•-------------•-------------•-------------•------------------•
|   1FEE  |  GREEK DIALYTIKA AND OXIA  |  Sk  |  canonical  |  0385       |  00A8 0301  |  0020 0308 0301  |
•---------•----------------------------•------•-------------•-------------•-------------•------------------•



3 characters with NFC = C.P. and NFD = D.M. and ALL columns DIFFERENT :

•---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------•
!   C.P.  |                Character Name                 | G.C. |  Dec. Type  |  NFC = C.P. |  NFD = D.M. |  NFKC  |    NFKD     |
•---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------•
|   03D3  |  GREEK UPSILON WITH ACUTE AND HOOK SYMBOL     |  Lu  |  canonical  |  03D3       |  03D2 0301  |  038E  |  03A5 0301  |
•---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------•
</code></pre>
<hr />
<p dir="auto">Now, as you can see, quite a <strong>lot</strong> of categories have an <strong><code>NFC</code></strong> value strictly <strong>identical</strong> to the <strong><code>C.P.</code></strong> ( code-point ) of characters. If we <strong>omit</strong> all characters of these categories, it remains, <strong>only</strong>, <strong><code>48</code></strong> characters which are <strong>changed</strong> when using their <strong><code>NFC</code></strong> value !</p>
<p dir="auto">In the <strong>next</strong> post, you’ll get the list of these <strong><code>48</code></strong> characters which are <strong>modified</strong> when using the option <strong><code>Convert &gt; Normalization From &gt; To NFC</code></strong> of the <strong>BabelPad</strong> software !</p>
<p dir="auto">Just tell me if you need the <strong>complete</strong> list ( <strong><code>2,635</code></strong> chars ) too. I could send it by <strong>e-mail</strong> !</p>
<p dir="auto">Best Regards,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/67652</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/67652</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Sat, 03 Jul 2021 11:54:54 GMT</pubDate></item><item><title><![CDATA[Reply to Unicode Normalization on Tue, 29 Jun 2021 22:15:09 GMT]]></title><description><![CDATA[<p dir="auto">Hi guy038,<br />
Sorry for delay (I missed the email notification). I am using “CYRILLIC - GREEK - LATIN - ROMAN”. But is it important to know? I ask because the normalization function that I would like Notepad++ to have, wouldn’t it be independent of the character set used? Anyway, I already did the normalization using BabelPad (menu Convert, Normalization form, To NFC). Only corrected 3 composition characters. In the future I’ll use BabelPad for normalization. Thank you very much for the explanations. They were revealing. As soon as I can, I am going to study this matter in more detail.</p>
]]></description><link>https://community.notepad-plus-plus.org/post/67536</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/67536</guid><dc:creator><![CDATA[xaviermdq]]></dc:creator><pubDate>Tue, 29 Jun 2021 22:15:09 GMT</pubDate></item><item><title><![CDATA[Reply to Unicode Normalization on Tue, 01 Jun 2021 14:16:09 GMT]]></title><description><![CDATA[<p dir="auto">Hello, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/22097">@xaviermdq</a>,</p>
<p dir="auto">I’ve begun, with the <strong>advanced</strong> search of the <strong><code>BabelMap</code></strong> software and the contents of the <strong><code>NormalisationTest.txt</code></strong> file, that you may download from <a href="https://www.unicode.org/Public/UCD/latest/ucd/" rel="nofollow ugc">here</a>, to build a <strong>complete</strong> list of <strong>Unicode</strong> characters with a <strong><code>Decomposition Maping</code></strong> property, as well as their <strong><code>NFC</code></strong>, <strong><code>NFD</code></strong>, <strong><code>NFKC</code></strong> and <strong><code>NFKD</code></strong> values !</p>
<p dir="auto">I obtained a list of <strong><code>16,908</code></strong> characters, corresponding to <strong><code>@Part1 # Character by character test</code></strong> of the <strong><code>NormalisationTest.txt</code></strong> file.</p>
<p dir="auto">It would be <strong>sensible</strong> to restrict such a list to the <strong>Unicode script(s)</strong> that you currently use ! So, could you tell me, from all the scripts <strong>list</strong>, below, which one(s) do you want to consider ?</p>
<pre><code class="language-z">CYRILLIC - GREEK - LATIN - ROMAN

ARABIC - ARMENIAN - HEBREW

CJK - HANGUL - HANGZHOU - KANGXI

HIRAGANA - KATAKANA

BALINESE

BENGALI - CHAKMA - DEVANAGARI - DIVEHI AKURU - GRANTHA - GURMUKHI - KAITHI
KANNADA - MALAYALAM - ORIYA - SIDDHAM - SINHALA - TAMIL - TELUGU - TIRHUTA

LAO - MYANMAR - THAI - TIBETAN

TIFINAGH
</code></pre>
<p dir="auto">Best Regards,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/66537</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/66537</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Tue, 01 Jun 2021 14:16:09 GMT</pubDate></item><item><title><![CDATA[Reply to Unicode Normalization on Mon, 31 May 2021 15:19:42 GMT]]></title><description><![CDATA[<p dir="auto">Hi <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/195">@guy038</a> :<br />
Sorry, I asked the wrong question because I didn’t understand what was happening. What I really want is a new option (for example “Convert to UTF-8 NFC”, or something like that, in Encoding menu) that allows me to do canonical normalization. So that you understand what I want, I will show you an example:<br />
The code points:<br />
GREEK CAPITAL LETTER OMEGA , U+03A9 , UTF-8: 0xCE 0xA9<br />
OHM SIGN , U+2126 , UTF-8: 0xE2 0x84 0xA6<br />
refer to the same character, although some fonts (like MS Arial) represent it slightly differently.<br />
If you apply canonical normalization, U+2126 transform to U+03A9 (I tested it with BabelPad).</p>
<p dir="auto">Thank you for your really comprehensive response.</p>
]]></description><link>https://community.notepad-plus-plus.org/post/66509</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/66509</guid><dc:creator><![CDATA[xaviermdq]]></dc:creator><pubDate>Mon, 31 May 2021 15:19:42 GMT</pubDate></item><item><title><![CDATA[Reply to Unicode Normalization on Wed, 30 Jun 2021 03:52:09 GMT]]></title><description><![CDATA[<p dir="auto">Hello, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/22097">@xaviermdq</a>,</p>
<p dir="auto">I don’t think that character <strong>decomposition</strong> and <strong>encodings</strong> are related, in any way !</p>
<p dir="auto">Whatever the <strong>Unicode</strong> encoding used, the <strong>encoding</strong> process simply writes the <strong>appropriate</strong> byte(s) in order to encode each <strong>individual</strong> character</p>
<p dir="auto">By contrast, the <strong>Unicode Normalization</strong> forms rather deal with :</p>
<ul>
<li>
<p dir="auto"><strong>Composition</strong> of characters into some <strong>pre-composed</strong> characters</p>
</li>
<li>
<p dir="auto"><strong>Decompostion</strong> of characters into their <strong>base</strong> letter and some <strong>combining</strong> characters in a <strong>specific</strong> order</p>
</li>
</ul>
<hr />
<p dir="auto">For instance, let consider the <em>SMALL LETTER LATIN</em>  <strong><code>e</code></strong> of code-point <strong><code>U+0065</code></strong>. Starting with this <strong>base</strong> letter, we may condiser the <strong>related</strong> characters, below :</p>
<pre><code class="language-z">•----------•-----------------•------------------------------•------•------------•------•
|  String  | Char(s) Number  |        Decomposition         |   &gt;  |      e     |   &lt;  |
•----------•-----------------•------------------------------•------•------------•------•
|   &gt;e&lt;    |        3        |  U+003E    U+0065    U+003C  |  3E  |     65     |  3C  |
•----------•-----------------•------------------------------•------•------------•------•


•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
|  String  | Char(s) Number  |                    Decomposition                     |   &gt;  |      e     |     ́    |     ̂    |   &lt;  |
•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
|   &gt;é̂&lt;   |        5        |  U+003E  U+0065 (e)  U+0301 ( ́)  U+0302 ( ̂)  U+003C  |  3E  |     65     |  CC 81  |  CC 82  |  3C  |
•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•


•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
|  String  | Char(s) Number  |                    Decomposition                     |   &gt;  |      e     |     ̂    |     ́    |   &lt;  |
•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
|   &gt;ế&lt;    |        5        |  U+003E  U+0065 (e)  U+0302 ( ̂)  U+0301 ( ́)  U+003C  |  3E  |     65     |  CC 82  |  CC 81  |  3C  |
•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•


•----------•-----------------•--------------------------------------------•------•------------•---------•------•
|  String  | Char(s) Number  |               Decomposition                |   &gt;  |      ê     |     ́    |   &lt;  |
•----------•-----------------•--------------------------------------------•------•------------•---------•------•
|   &gt;ế&lt;   |        4        |  U+003E    U+00EA (ê)    U+0301    U+003C  |  3E  |     EA     |  CC 81  |  3C  |
•----------•-----------------•--------------------------------------------•------•------------•---------•------•

•----------•-----------------•--------------------------------------------•------•------------•---------•------•
|  String  | Char(s) Number  |               Decomposition                |   &gt;  |      é     |     ̂    |   &lt;  |
•----------•-----------------•--------------------------------------------•------•------------•---------•------•
|   &gt;é̂&lt;    |        4        |  U+003E    U+00E9 (é)    U+0302    U+003C  |  3E  |     E9     |  CC 82  |  3C  |
•----------•-----------------•--------------------------------------------•------•------------•---------•------•


•----------•-----------------•------------------------------•------•------------•------•
|  String  | Char(s) Number  |        Decomposition         |   &gt;  |      ế     |   &lt;  |
•----------•-----------------•------------------------------•------•------------•------•
|   &gt;ế&lt;    |        3        |  U+003E    U+1EBF    U+003C  |  3E  |  E1 BA BF  |  3C  |
•----------•-----------------•------------------------------•------•------------•------•
</code></pre>
<p dir="auto">Note that I placed each string, composed of a <strong>base</strong> letter and <strong>possible</strong> diacritic signs, <strong>between</strong> the delimiters <strong><code>&gt;</code></strong> and <strong><code>&lt;</code></strong> for an <strong>exact</strong> search !</p>
<ul>
<li>
<p dir="auto"><strong>Paste</strong> the text, above, in a <strong>new</strong> N++ tab</p>
</li>
<li>
<p dir="auto">Open the <strong><code>Mark</code></strong> dialog</p>
</li>
<li>
<p dir="auto">SEARCH : <strong>Successively</strong> try the <strong>six</strong> regex syntaxes, below :</p>
<ul>
<li>
<p dir="auto">(<strong>A</strong>) <strong><code>(?-s)&gt;.&lt;</code></strong></p>
</li>
<li>
<p dir="auto">(<strong>B</strong>) <strong><code>(?-s)&gt;..&lt;</code></strong></p>
</li>
<li>
<p dir="auto">(<strong>C</strong>) <strong><code>(?-s)&gt;...&lt;</code></strong></p>
</li>
<li>
<p dir="auto">(<strong>D</strong>) <strong><code>&gt;&lsqb;&lsqb;=e=&rsqb;&rsqb;&lt;</code></strong></p>
</li>
<li>
<p dir="auto">(<strong>E</strong>) <strong><code>&gt;(?=e)\X&lt;</code></strong></p>
</li>
<li>
<p dir="auto">(<strong>F</strong>) <strong><code>&gt;(?=&lsqb;&lsqb;=e=&rsqb;&rsqb;)\X&lt;</code></strong></p>
</li>
</ul>
</li>
<li>
<p dir="auto">Tick the <strong><code>Purge for each each search</code></strong> and <strong><code>Wrap around</code></strong> options</p>
</li>
<li>
<p dir="auto"><strong>Un</strong>-tick <strong>all</strong> other options</p>
</li>
<li>
<p dir="auto">Select the <strong><code>Regular expression</code></strong> search <strong>mode</strong></p>
</li>
<li>
<p dir="auto">Click on the <strong><code>Mark All</code></strong> button</p>
</li>
</ul>
<hr />
<p dir="auto"><strong>Notes</strong> :</p>
<ul>
<li>
<p dir="auto">The regex <strong>A</strong>, finds the strings containing <strong>one</strong> char, between the <strong>delimiters</strong> <strong><code>&gt;</code></strong> and <strong><code>&lt;</code></strong>, so <strong><code>3</code></strong> chars in <strong>totality</strong>. It matches, of course, the string <strong><code>&gt;e&lt;</code></strong> and the string <strong><code>&gt;ế&lt;</code></strong>, containing the <strong>Vienamese</strong> letter <strong><code>ế</code></strong></p>
</li>
<li>
<p dir="auto">The regex <strong>B</strong>, finds the strings containing <strong>two</strong> chars, between the <strong>delimiters</strong> <strong><code>&gt;</code></strong> and <strong><code>&lt;</code></strong>, so <strong><code>4</code></strong> chars in <strong>totality</strong>. It matches the strings <em><code>&gt;ế&lt;</code></em>* and <strong><code>&gt;é̂&lt;</code></strong> which contain an <strong>accentuated</strong> char with an additionnal <strong>diacritic</strong> character</p>
</li>
<li>
<p dir="auto">The regex <strong>C</strong>, finds the strings containing <strong>three</strong> chars, between the <strong>delimiters</strong> <strong><code>&gt;</code></strong> and <strong><code>&lt;</code></strong>, so <strong><code>5</code></strong> chars in <strong>totality</strong>. It matches the strings <strong><code>&gt;é̂&lt;</code></strong> and <strong><code>&gt;ế&lt;</code></strong>, which contain the <strong>base</strong> letter <strong><code>e</code></strong> and <strong>two diacritic</strong> characters, in a <strong>different</strong> order</p>
</li>
<li>
<p dir="auto">The regex <strong>D</strong> find all the <strong>individual equivalent</strong> characters to the <strong>base</strong> letter <strong><code>e</code></strong> between the <strong>delimiters</strong> <strong><code>&gt;</code></strong> and <strong><code>&lt;</code></strong>, so <strong><code>3</code></strong> chars in <strong>totality</strong>. As the regex <strong>A</strong>, it matches an <strong>unique</strong> character, related to the <strong><code>e</code></strong> letter and the delimiters</p>
</li>
<li>
<p dir="auto">In the regex <strong>E</strong>, we use a specific syntax <strong><code>\X</code></strong> which matches any <strong>base</strong> character, followed with <strong>one</strong> or <strong>several combining</strong> characters ( <em>diacritical marks</em> or else ). But as we just want to <strong>focus</strong> to the letter <strong><code>e</code></strong> we place, before <strong><code>\X</code></strong>,  a <strong>look-ahead</strong> <strong><code>(?=e)</code></strong> which forces the regex engine to match this <strong>base</strong> letter <strong><code>e</code></strong> and <strong>possible combining</strong> characters, following it. So, it matches the <strong>first <code>3</code></strong> cases only !</p>
</li>
<li>
<p dir="auto">In the regex <strong>F</strong>, we use again the <strong><code>\X</code></strong> syntax which finds <strong>any</strong> char followed with <strong>possible combining</strong> characters. But, this time, we change the <strong>look-ahead</strong> as <strong><code>(?=&lsqb;&lsqb;=e=&rsqb;&rsqb;)</code></strong> which forces the regex engine to match any <strong>equivalent</strong> char to the letter <strong><code>e</code></strong>. Refer at <strong>end</strong> of this <a href="https://community.notepad-plus-plus.org/post/66324">post</a> for <strong>further</strong> explanation. As you can see, this regex does find <em>ALL</em> the above cases:-))</p>
</li>
</ul>
<p dir="auto">This regex leads to the following <strong>generic</strong> regex :    <strong><code>(?=&lsqb;&lsqb;=</code>C<code>=&rsqb;&rsqb;)\X</code></strong></p>
<p dir="auto">which matches any character <strong><code>C</code></strong>, whatever its <strong>case</strong>, <strong>followed</strong> with some <strong>combining diacritical</strong> marks</p>
<p dir="auto">For instance :</p>
<ul>
<li>
<p dir="auto">The regex <strong><code>(?=&lsqb;&lsqb;=3=&rsqb;&rsqb;)\X</code></strong> does match the character <strong><code>3̯̿</code></strong>, composed of the <strong>base</strong> digit <strong><code>3</code></strong> and <strong>two combining</strong> marks</p>
</li>
<li>
<p dir="auto">The regex <strong><code>(?=&lsqb;&lsqb;=$=&rsqb;&rsqb;)\X</code></strong> does match the character  <strong><code>$̶̳̚</code></strong> composed of the <strong>base</strong> symbol <strong><code>$</code></strong> and <strong>three combining</strong> marks</p>
</li>
</ul>
<p dir="auto">Test these <strong>two</strong> regexes against this text :</p>
<pre><code class="language-diff">3̯̿

$̶̳̚
</code></pre>
<p dir="auto">Most of the <strong>combining</strong> characters can be found in the <strong><code>Combining Diacritical Marks</code></strong> Unicode <strong>block</strong>, in the <strong>range</strong> <strong><code>[\x{0300}–\x{036F}]</code></strong>, below :</p>
<p dir="auto"><a href="https://www.unicode.org/charts/PDF/U0300.pdf" rel="nofollow ugc">https://www.unicode.org/charts/PDF/U0300.pdf</a></p>
<hr />
<p dir="auto">So, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/22097">@xaviermdq</a>, as you can see, we <strong>never</strong> worried about the <strong>exact</strong> bytes used by the <strong><code>UTF-8</code></strong> encoding !</p>
<p dir="auto">Apparently, you wish to replace some <strong>decomposed consecutive</strong> characters by a <strong>precomposed equivalent</strong> character, if any !? This goal could be achieved with <strong>regexes</strong> !</p>
<p dir="auto">Just tell me some more <strong>details</strong> about your needs, and also, your usual working Unicode <strong>script(s)</strong> : <strong><code>Latin, Cyrillic, Hebrew, Arabic, CJK, ...</code></strong> !</p>
<p dir="auto">Best Regards,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/66384</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/66384</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Wed, 30 Jun 2021 03:52:09 GMT</pubDate></item><item><title><![CDATA[Reply to Unicode Normalization on Thu, 27 May 2021 14:37:37 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/22097">@xaviermdq</a> ,</p>
<p dir="auto">I don’t know what <a href="https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization" rel="nofollow ugc">unicode normalization</a> is used.</p>
<p dir="auto">Unfortunately, the primary developer and most of the other volunteer contributors don’t regularly read this forum, so I don’t know if they’ll ever see this question.  I don’t know if any of the regulars in this Forum have studied the guts of the Notepad++/Scintilla UTF-8 handling enough to know how to answer that question.</p>
<p dir="auto">I’d suggest waiting for another reply here, in case someone has studied it more than I’d previously gathered. But if you don’t get a reply after a reasonable wait, you might consider going to the <a href="https://github.com/notepad-plus-plus/notepad-plus-plus/issues" rel="nofollow ugc">github issues location</a> and asking this question there – because there is hopefully a developer who knows enough about the guts to answer over there.</p>
]]></description><link>https://community.notepad-plus-plus.org/post/66343</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/66343</guid><dc:creator><![CDATA[PeterJones]]></dc:creator><pubDate>Thu, 27 May 2021 14:37:37 GMT</pubDate></item></channel></rss>