• Login
Community
  • Login

Unicode Normalization

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
9 Posts 4 Posters 1.5k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • X
    xaviermdq
    last edited by May 27, 2021, 2:09 PM

    Hi,
    Which unicode normalization form is used when creating a new (or converting to) UTF8 document? Can I change this with an INI or GUI option? I want NFC standard.
    Thanks,
    Javier

    P 1 Reply Last reply May 27, 2021, 2:37 PM Reply Quote 1
    • P
      PeterJones @xaviermdq
      last edited by May 27, 2021, 2:37 PM

      @xaviermdq ,

      I don’t know what unicode normalization is used.

      Unfortunately, the primary developer and most of the other volunteer contributors don’t regularly read this forum, so I don’t know if they’ll ever see this question. I don’t know if any of the regulars in this Forum have studied the guts of the Notepad++/Scintilla UTF-8 handling enough to know how to answer that question.

      I’d suggest waiting for another reply here, in case someone has studied it more than I’d previously gathered. But if you don’t get a reply after a reasonable wait, you might consider going to the github issues location and asking this question there – because there is hopefully a developer who knows enough about the guts to answer over there.

      1 Reply Last reply Reply Quote 1
      • G
        guy038
        last edited by guy038 Jun 30, 2021, 3:52 AM May 27, 2021, 10:06 PM

        Hello, @xaviermdq,

        I don’t think that character decomposition and encodings are related, in any way !

        Whatever the Unicode encoding used, the encoding process simply writes the appropriate byte(s) in order to encode each individual character

        By contrast, the Unicode Normalization forms rather deal with :

        • Composition of characters into some pre-composed characters

        • Decompostion of characters into their base letter and some combining characters in a specific order


        For instance, let consider the SMALL LETTER LATIN e of code-point U+0065. Starting with this base letter, we may condiser the related characters, below :

        •----------•-----------------•------------------------------•------•------------•------•
        |  String  | Char(s) Number  |        Decomposition         |   >  |      e     |   <  |
        •----------•-----------------•------------------------------•------•------------•------•
        |   >e<    |        3        |  U+003E    U+0065    U+003C  |  3E  |     65     |  3C  |
        •----------•-----------------•------------------------------•------•------------•------•
        
        
        •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
        |  String  | Char(s) Number  |                    Decomposition                     |   >  |      e     |     ́    |     ̂    |   <  |
        •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
        |   >é̂<   |        5        |  U+003E  U+0065 (e)  U+0301 ( ́)  U+0302 ( ̂)  U+003C  |  3E  |     65     |  CC 81  |  CC 82  |  3C  |
        •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
        
        
        •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
        |  String  | Char(s) Number  |                    Decomposition                     |   >  |      e     |     ̂    |     ́    |   <  |
        •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
        |   >ế<    |        5        |  U+003E  U+0065 (e)  U+0302 ( ̂)  U+0301 ( ́)  U+003C  |  3E  |     65     |  CC 82  |  CC 81  |  3C  |
        •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
        
        
        •----------•-----------------•--------------------------------------------•------•------------•---------•------•
        |  String  | Char(s) Number  |               Decomposition                |   >  |      ê     |     ́    |   <  |
        •----------•-----------------•--------------------------------------------•------•------------•---------•------•
        |   >ế<   |        4        |  U+003E    U+00EA (ê)    U+0301    U+003C  |  3E  |     EA     |  CC 81  |  3C  |
        •----------•-----------------•--------------------------------------------•------•------------•---------•------•
        
        •----------•-----------------•--------------------------------------------•------•------------•---------•------•
        |  String  | Char(s) Number  |               Decomposition                |   >  |      é     |     ̂    |   <  |
        •----------•-----------------•--------------------------------------------•------•------------•---------•------•
        |   >é̂<    |        4        |  U+003E    U+00E9 (é)    U+0302    U+003C  |  3E  |     E9     |  CC 82  |  3C  |
        •----------•-----------------•--------------------------------------------•------•------------•---------•------•
        
        
        •----------•-----------------•------------------------------•------•------------•------•
        |  String  | Char(s) Number  |        Decomposition         |   >  |      ế     |   <  |
        •----------•-----------------•------------------------------•------•------------•------•
        |   >ế<    |        3        |  U+003E    U+1EBF    U+003C  |  3E  |  E1 BA BF  |  3C  |
        •----------•-----------------•------------------------------•------•------------•------•
        

        Note that I placed each string, composed of a base letter and possible diacritic signs, between the delimiters > and < for an exact search !

        • Paste the text, above, in a new N++ tab

        • Open the Mark dialog

        • SEARCH : Successively try the six regex syntaxes, below :

          • (A) (?-s)>.<

          • (B) (?-s)>..<

          • (C) (?-s)>...<

          • (D) >[[=e=]]<

          • (E) >(?=e)\X<

          • (F) >(?=[[=e=]])\X<

        • Tick the Purge for each each search and Wrap around options

        • Un-tick all other options

        • Select the Regular expression search mode

        • Click on the Mark All button


        Notes :

        • The regex A, finds the strings containing one char, between the delimiters > and <, so 3 chars in totality. It matches, of course, the string >e< and the string >ế<, containing the Vienamese letter ế

        • The regex B, finds the strings containing two chars, between the delimiters > and <, so 4 chars in totality. It matches the strings >ế<* and >é̂< which contain an accentuated char with an additionnal diacritic character

        • The regex C, finds the strings containing three chars, between the delimiters > and <, so 5 chars in totality. It matches the strings >é̂< and >ế<, which contain the base letter e and two diacritic characters, in a different order

        • The regex D find all the individual equivalent characters to the base letter e between the delimiters > and <, so 3 chars in totality. As the regex A, it matches an unique character, related to the e letter and the delimiters

        • In the regex E, we use a specific syntax \X which matches any base character, followed with one or several combining characters ( diacritical marks or else ). But as we just want to focus to the letter e we place, before \X, a look-ahead (?=e) which forces the regex engine to match this base letter e and possible combining characters, following it. So, it matches the first 3 cases only !

        • In the regex F, we use again the \X syntax which finds any char followed with possible combining characters. But, this time, we change the look-ahead as (?=[[=e=]]) which forces the regex engine to match any equivalent char to the letter e. Refer at end of this post for further explanation. As you can see, this regex does find ALL the above cases:-))

        This regex leads to the following generic regex :    (?=[[=C=]])\X

        which matches any character C, whatever its case, followed with some combining diacritical marks

        For instance :

        • The regex (?=[[=3=]])\X does match the character 3̯̿, composed of the base digit 3 and two combining marks

        • The regex (?=[[=$=]])\X does match the character $̶̳̚ composed of the base symbol $ and three combining marks

        Test these two regexes against this text :

        3̯̿
        
        $̶̳̚
        

        Most of the combining characters can be found in the Combining Diacritical Marks Unicode block, in the range [\x{0300}–\x{036F}], below :

        https://www.unicode.org/charts/PDF/U0300.pdf


        So, @xaviermdq, as you can see, we never worried about the exact bytes used by the UTF-8 encoding !

        Apparently, you wish to replace some decomposed consecutive characters by a precomposed equivalent character, if any !? This goal could be achieved with regexes !

        Just tell me some more details about your needs, and also, your usual working Unicode script(s) : Latin, Cyrillic, Hebrew, Arabic, CJK, ... !

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 3
        • X
          xaviermdq
          last edited by May 31, 2021, 3:19 PM

          Hi @guy038 :
          Sorry, I asked the wrong question because I didn’t understand what was happening. What I really want is a new option (for example “Convert to UTF-8 NFC”, or something like that, in Encoding menu) that allows me to do canonical normalization. So that you understand what I want, I will show you an example:
          The code points:
          GREEK CAPITAL LETTER OMEGA , U+03A9 , UTF-8: 0xCE 0xA9
          OHM SIGN , U+2126 , UTF-8: 0xE2 0x84 0xA6
          refer to the same character, although some fonts (like MS Arial) represent it slightly differently.
          If you apply canonical normalization, U+2126 transform to U+03A9 (I tested it with BabelPad).

          Thank you for your really comprehensive response.

          1 Reply Last reply Reply Quote 0
          • G
            guy038
            last edited by Jun 1, 2021, 2:16 PM

            Hello, @xaviermdq,

            I’ve begun, with the advanced search of the BabelMap software and the contents of the NormalisationTest.txt file, that you may download from here, to build a complete list of Unicode characters with a Decomposition Maping property, as well as their NFC, NFD, NFKC and NFKD values !

            I obtained a list of 16,908 characters, corresponding to @Part1 # Character by character test of the NormalisationTest.txt file.

            It would be sensible to restrict such a list to the Unicode script(s) that you currently use ! So, could you tell me, from all the scripts list, below, which one(s) do you want to consider ?

            CYRILLIC - GREEK - LATIN - ROMAN
            
            ARABIC - ARMENIAN - HEBREW
            
            CJK - HANGUL - HANGZHOU - KANGXI
            
            HIRAGANA - KATAKANA
            
            BALINESE
            
            BENGALI - CHAKMA - DEVANAGARI - DIVEHI AKURU - GRANTHA - GURMUKHI - KAITHI
            KANNADA - MALAYALAM - ORIYA - SIDDHAM - SINHALA - TAMIL - TELUGU - TIRHUTA
            
            LAO - MYANMAR - THAI - TIBETAN
            
            TIFINAGH
            

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 1
            • X
              xaviermdq
              last edited by Jun 29, 2021, 10:15 PM

              Hi guy038,
              Sorry for delay (I missed the email notification). I am using “CYRILLIC - GREEK - LATIN - ROMAN”. But is it important to know? I ask because the normalization function that I would like Notepad++ to have, wouldn’t it be independent of the character set used? Anyway, I already did the normalization using BabelPad (menu Convert, Normalization form, To NFC). Only corrected 3 composition characters. In the future I’ll use BabelPad for normalization. Thank you very much for the explanations. They were revealing. As soon as I can, I am going to study this matter in more detail.

              1 Reply Last reply Reply Quote 0
              • G
                guy038
                last edited by guy038 Jul 3, 2021, 11:54 AM Jul 3, 2021, 3:48 AM

                Hi, @xaviermdq,

                As you are interested by the Latin, Greek, Cyrillic and Roman scripts, only, I filtered my previous list of 16,908 chars and obtained a smaller file, containing 2,635 characters !

                I was able to class all these characters in 12 categories. Below, you’ll see the first character of each class :

                     2,635 characters, from LATIN, GREEK, CYRILLIC and ROMAN scripts, are concerned with DECOMPOSITION MAPPING
                
                
                        ( NOTE : G.C. means GENERAL CATEGORY, C.P. means CODE-POINT and D.M. means DECOMPOSITION MAPPING )
                
                
                563 characters with NFC = NFKC = C.P. and NFD = NFKD = D.M. :
                
                •---------•------------------------------------------------------------•------•-------------•--------------------•--------------------•
                !   C.P.  |                       Character Name                       | G.C. |  Dec. Type  |  NFC = NFKC = C.P. |  NFD = NFKD = D.M. |
                •---------•------------------------------------------------------------•------•-------------•--------------------•--------------------•
                |   00C0  |  LATIN CAPITAL LETTER A WITH GRAVE                         |  Lu  |  canonical  |  00C0              |  0041 0300         |
                •---------•------------------------------------------------------------•------•-------------•--------------------•--------------------•
                
                
                
                18 characters with NFC = NFKC = NFD = NFKD = D.M. :
                
                •---------•-----------------------------------•------•-------------•---------------------------------•
                !   C.P.  |          Character Name           | G.C. |  Dec. Type  |  NFC = NFKC = NFD = NFKD = D.M. |
                •---------•-----------------------------------•------•-------------•---------------------------------•
                |   0340  |  COMBINING GRAVE TONE MARK        |  Mn  |  canonical  |  0300                           |
                •---------•-----------------------------------•------•-------------•---------------------------------•
                
                
                
                17 characters with NFC = NFKC = D.M. and NFD = NFKD :
                
                •---------•------------------------------------------------------•------•-------------•---------------------•------------------•
                !   C.P.  |                    Character Name                    | G.C. |  Dec. Type  |  NFC = NFKC = D.M.  |    NFD = NFKD    |
                •---------•------------------------------------------------------•------•-------------•---------------------•------------------•
                |   1F71  |  GREEK SMALL LETTER ALPHA WITH OXIA                  |  Ll  |  canonical  |  03AC               |  03B1 0301       |
                •---------•------------------------------------------------------•------•-------------•---------------------•------------------•
                
                
                
                250 characters with NFC = NFKC = C.P. and NFD = NFKD and D.M. <> from OTHERS :
                
                •---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------•
                !   C.P.  |                               Character Name                               | G.C. |  Dec. Type  |  Dec. Map.  |  NFC = NFKC = C.P. |      NFD = NFKD       |  Code   |
                •---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------•
                |   01D5  |  LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON                          |  Lu  |  canonical  |  00DC 0304  |  01D5              |  0055 0308 0304       |   01D5  |
                •---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------•
                
                
                
                9 characters with NFC = NFKC = NFD = NFKD and D.M. <> from OTHERS :
                
                •---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
                !   C.P.  |                 Character Name                  | G.C. |  Dec. Type  |  Decomp. Map. |  NFC = NFKC = NFD = NFKD  |
                •---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
                |  1D160  |  MUSICAL SYMBOL EIGHTH NOTE                     |  So  |  canonical  |  1D15F 1D16E  |  1D158 1D165 1D16E        |
                •---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
                
                
                
                1716 characters with NFC = NFD = C.P. and NFKC = NFKD = D.M. :
                
                •---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------•
                !   C.P.  |                         Character Name                          | G.C. |  Dec. Type   |  NFC = NFD = C.P. |  NFKC = NFKD = D.M.   |  Code   |
                •---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------•
                |   00A0  |  NO-BREAK SPACE                                                 |  Zs  |  <noBreak>   |  00A0             |  0020                 |   00A0  |
                •---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------•
                
                
                
                3 characters with NFC = NFD = D.M. and NFKC = NFKD :
                
                •---------•------------------•------•-------------•-------------------•---------------•
                !   C.P.  |  Character Name  | G.C. |  Dec. Type  |  NFC = NFD = D.M. |  NFKC = NFKD  |
                •---------•------------------•------•-------------•-------------------•---------------•
                |   1FFD  |  GREEK OXIA      |  Sk  |  canonical  |  00B4             |  0020 0301    |
                •---------•------------------•------•-------------•-------------------•---------------•
                
                
                
                43 characters with NFC = NFD = C.P. and NFKC = NFKD and D.M. <> from OTHERS :
                
                •---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------•
                !   C.P.  |                       Character Name                       | G.C. |  Dec. Type   |  Dec. Map.  |  NFC = NFD = C.P. |   NFKC = NFKD    |  Code   |
                •---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------•
                |   FB05  |  LATIN SMALL LIGATURE LONG S T                             |  Ll  |  <compat>    |  017F 0074  |  FB05             |  0073 0074       |   FB05  |
                •---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------•
                
                
                
                3 characters with NFC = NFD = C.P. and NFKC = D.M. :
                
                •---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------•
                !   C.P.  |                     Character Name                      | G.C. |  Dec. Type  |  NFC = NFD = C.P.  |  NFKC = D.M. |       NFKD       |  Code   |
                •---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------•
                |   01C4  |  LATIN CAPITAL LETTER DZ WITH CARON                     |  Lu  |  <compat>   |  01C4              |  0044 017D   |  0044 005A 030C  |   01C4  |
                •---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------•
                
                
                
                9 characters with NFC = C.P. and NFD = D.M. and NFKC = NFKD :
                
                •---------•-----------------------------------•------•-------------•-------------•-------------•------------------•
                !   C.P.  |          Character Name           | G.C. |  Dec. Type  |  NFC = C.P. |  NFD = D.M. |   NFKC = NFKD    |
                •---------•-----------------------------------•------•-------------•-------------•-------------•------------------•
                |   0385  |  GREEK DIALYTIKA TONOS            |  Sk  |  canonical  |  0385       |  00A8 0301  |  0020 0308 0301  |
                •---------•-----------------------------------•------•-------------•-------------•-------------•------------------•
                
                
                
                1 character with NFC = D.M. and NFKC = NFKD :
                
                •---------•----------------------------•------•-------------•-------------•-------------•------------------•
                !   C.P.  |       Character Name       | G.C. |  Dec. Type  |  NFC = D.M. |     NFD     |   NFKC = NFKD    |
                •---------•----------------------------•------•-------------•-------------•-------------•------------------•
                |   1FEE  |  GREEK DIALYTIKA AND OXIA  |  Sk  |  canonical  |  0385       |  00A8 0301  |  0020 0308 0301  |
                •---------•----------------------------•------•-------------•-------------•-------------•------------------•
                
                
                
                3 characters with NFC = C.P. and NFD = D.M. and ALL columns DIFFERENT :
                
                •---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------•
                !   C.P.  |                Character Name                 | G.C. |  Dec. Type  |  NFC = C.P. |  NFD = D.M. |  NFKC  |    NFKD     |
                •---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------•
                |   03D3  |  GREEK UPSILON WITH ACUTE AND HOOK SYMBOL     |  Lu  |  canonical  |  03D3       |  03D2 0301  |  038E  |  03A5 0301  |
                •---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------•
                

                Now, as you can see, quite a lot of categories have an NFC value strictly identical to the C.P. ( code-point ) of characters. If we omit all characters of these categories, it remains, only, 48 characters which are changed when using their NFC value !

                In the next post, you’ll get the list of these 48 characters which are modified when using the option Convert > Normalization From > To NFC of the BabelPad software !

                Just tell me if you need the complete list ( 2,635 chars ) too. I could send it by e-mail !

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 1
                • G
                  guy038
                  last edited by guy038 Jul 3, 2021, 3:50 AM Jul 3, 2021, 3:49 AM

                  Hello, @@xaviermdq,

                  Here is the list of the transformed characters, due to the NFC normalsation form :

                      48 characters, from LATIN, GREEK, CYRILLIC and ROMAN scripts, are REPLACED by OTHER character(s)
                  
                          when using the option "Convert > Normalization From > To NFC" of the "BabelPad" software
                  
                  
                      ( NOTE : C.P. means CODE-POINT, G.C. means GENERAL CATEGORY and D.M. means DECOMPOSITION MAPPING )
                  
                  
                  18 characters with NFC = NFKC = NFD = NFKD = D.M. :
                  
                  •-----•---------•-----------------------------------•------•-------------•---------------------------------•
                  ! Chr.|   C.P.  |          Character Name           | G.C. |  Dec. Type  |  NFC = NFKC = NFD = NFKD = D.M. |
                  •-----•---------•-----------------------------------•------•-------------•---------------------------------•
                  |   ̀  |   0340  |  COMBINING GRAVE TONE MARK        |  Mn  |  canonical  |  0300                           |
                  |   ́  |   0341  |  COMBINING ACUTE TONE MARK        |  Mn  |  canonical  |  0301                           |
                  |   ̓  |   0343  |  COMBINING GREEK KORONIS          |  Mn  |  canonical  |  0313                           |
                  |   ̈́  |   0344  |  COMBINING GREEK DIALYTIKA TONOS  |  Mn  |  canonical  |  0308 0301                      |
                  |  ʹ ‎ |   0374  |  GREEK NUMERAL SIGN               |  Lm  |  canonical  |  02B9                           |
                  |  ; ‎ |   037E  |  GREEK QUESTION MARK              |  Po  |  canonical  |  003B                           |
                  |  · ‎ |   0387  |  GREEK ANO TELEIA                 |  Po  |  canonical  |  00B7                           |
                  |  ι ‎ |   1FBE  |  GREEK PROSGEGRAMMENI             |  Ll  |  canonical  |  03B9                           |
                  |  `  |   1FEF  |  GREEK VARIA                      |  Sk  |  canonical  |  0060                           |
                  |  Ω  |   2126  |  OHM SIGN                         |  Lu  |  canonical  |  03A9                           |
                  |  K  |   212A  |  KELVIN SIGN                      |  Lu  |  canonical  |  004B                           |
                  |  〈  |   2329  |  LEFT-POINTING ANGLE BRACKET      |  Ps  |  canonical  |  3008                           |
                  |  〉  |   232A  |  RIGHT-POINTING ANGLE BRACKET     |  Pe  |  canonical  |  3009                           |
                  |  ⫝̸  |   2ADC  |  FORKING                          |  Sm  |  canonical  |  2ADD 0338                      |
                  |  𝅗𝅥  |  1D15E  |  MUSICAL SYMBOL HALF NOTE         |  So  |  canonical  |  1D157 1D165                    |
                  |  𝅘𝅥  |  1D15F  |  MUSICAL SYMBOL QUARTER NOTE      |  So  |  canonical  |  1D158 1D165                    |
                  |  𝆹𝅥  |  1D1BB  |  MUSICAL SYMBOL MINIMA            |  So  |  canonical  |  1D1B9 1D165                    |
                  |  𝆺𝅥  |  1D1BC  |  MUSICAL SYMBOL MINIMA BLACK      |  So  |  canonical  |  1D1BA 1D165                    |
                  •-----•---------•-----------------------------------•------•-------------•---------------------------------•
                  
                  
                  
                  17 characters with NFC = NFKC = D.M. and NFD = NFKD :
                  
                  •-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
                  ! Chr.|   C.P.  |                    Character Name                    | G.C. |  Dec. Type  |  NFC = NFKC = D.M.  |    NFD = NFKD    |
                  •-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
                  |  ά  |   1F71  |  GREEK SMALL LETTER ALPHA WITH OXIA                  |  Ll  |  canonical  |  03AC               |  03B1 0301       |
                  |  έ  |   1F73  |  GREEK SMALL LETTER EPSILON WITH OXIA                |  Ll  |  canonical  |  03AD               |  03B5 0301       |
                  |  ή  |   1F75  |  GREEK SMALL LETTER ETA WITH OXIA                    |  Ll  |  canonical  |  03AE               |  03B7 0301       |
                  |  ί  |   1F77  |  GREEK SMALL LETTER IOTA WITH OXIA                   |  Ll  |  canonical  |  03AF               |  03B9 0301       |
                  |  ό  |   1F79  |  GREEK SMALL LETTER OMICRON WITH OXIA                |  Ll  |  canonical  |  03CC               |  03BF 0301       |
                  |  ύ  |   1F7B  |  GREEK SMALL LETTER UPSILON WITH OXIA                |  Ll  |  canonical  |  03CD               |  03C5 0301       |
                  |  ώ  |   1F7D  |  GREEK SMALL LETTER OMEGA WITH OXIA                  |  Ll  |  canonical  |  03CE               |  03C9 0301       |
                  |  Ά  |   1FBB  |  GREEK CAPITAL LETTER ALPHA WITH OXIA                |  Lu  |  canonical  |  0386               |  0391 0301       |
                  |  Έ  |   1FC9  |  GREEK CAPITAL LETTER EPSILON WITH OXIA              |  Lu  |  canonical  |  0388               |  0395 0301       |
                  |  Ή  |   1FCB  |  GREEK CAPITAL LETTER ETA WITH OXIA                  |  Lu  |  canonical  |  0389               |  0397 0301       |
                  |  ΐ  |   1FD3  |  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA     |  Ll  |  canonical  |  0390               |  03B9 0308 0301  |
                  |  Ί  |   1FDB  |  GREEK CAPITAL LETTER IOTA WITH OXIA                 |  Lu  |  canonical  |  038A               |  0399 0301       |
                  |  ΰ  |   1FE3  |  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA  |  Ll  |  canonical  |  03B0               |  03C5 0308 0301  |
                  |  Ύ  |   1FEB  |  GREEK CAPITAL LETTER UPSILON WITH OXIA              |  Lu  |  canonical  |  038E               |  03A5 0301       |
                  |  Ό  |   1FF9  |  GREEK CAPITAL LETTER OMICRON WITH OXIA              |  Lu  |  canonical  |  038C               |  039F 0301       |
                  |  Ώ  |   1FFB  |  GREEK CAPITAL LETTER OMEGA WITH OXIA                |  Lu  |  canonical  |  038F               |  03A9 0301       |
                  |  Å  |   212B  |  ANGSTROM SIGN                                       |  Lu  |  canonical  |  00C5               |  0041 030A       |
                  •-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
                  
                  
                  
                  9 characters with NFC = NFKC = NFD = NFKD and D.M. <> from OTHERS :
                  
                  •-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
                  ! Chr.|   C.P.  |                 Character Name                  | G.C. |  Dec. Type  |  Decomp. Map. |  NFC = NFKC = NFD = NFKD  |
                  •-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
                  |  𝅘𝅥𝅮 ‎ |  1D160  |  MUSICAL SYMBOL EIGHTH NOTE                     |  So  |  canonical  |  1D15F 1D16E  |  1D158 1D165 1D16E        |
                  |  𝅘𝅥𝅯 ‎ |  1D161  |  MUSICAL SYMBOL SIXTEENTH NOTE                  |  So  |  canonical  |  1D15F 1D16F  |  1D158 1D165 1D16F        |
                  |  𝅘𝅥𝅰 ‎ |  1D162  |  MUSICAL SYMBOL THIRTY-SECOND NOTE              |  So  |  canonical  |  1D15F 1D170  |  1D158 1D165 1D170        |
                  |  𝅘𝅥𝅱 ‎ |  1D163  |  MUSICAL SYMBOL SIXTY-FOURTH NOTE               |  So  |  canonical  |  1D15F 1D171  |  1D158 1D165 1D171        |
                  |  𝅘𝅥𝅲 ‎ |  1D164  |  MUSICAL SYMBOL ONE HUNDRED TWENTY-EIGHTH NOTE  |  So  |  canonical  |  1D15F 1D172  |  1D158 1D165 1D172        |
                  |  𝆹𝅥𝅮 ‎ |  1D1BD  |  MUSICAL SYMBOL SEMIMINIMA WHITE                |  So  |  canonical  |  1D1BB 1D16E  |  1D1B9 1D165 1D16E        |
                  |  𝆺𝅥𝅮 ‎ |  1D1BE  |  MUSICAL SYMBOL SEMIMINIMA BLACK                |  So  |  canonical  |  1D1BC 1D16E  |  1D1BA 1D165 1D16E        |
                  |  𝆹𝅥𝅯 ‎ |  1D1BF  |  MUSICAL SYMBOL FUSA WHITE                      |  So  |  canonical  |  1D1BB 1D16F  |  1D1B9 1D165 1D16F        |
                  |  𝆺𝅥𝅯 ‎ |  1D1C0  |  MUSICAL SYMBOL FUSA BLACK                      |  So  |  canonical  |  1D1BC 1D16F  |  1D1BA 1D165 1D16F        |
                  •-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
                  
                  
                  
                  3 characters with NFC = NFD = D.M. and NFKC = NFKD :
                  
                  •-----•---------•------------------•------•-------------•-------------------•---------------•
                  ! Chr.|   C.P.  |  Character Name  | G.C. |  Dec. Type  |  NFC = NFD = D.M. |  NFKC = NFKD  |
                  •-----•---------•------------------•------•-------------•-------------------•---------------•
                  |  ´  |   1FFD  |  GREEK OXIA      |  Sk  |  canonical  |  00B4             |  0020 0301    |
                  |     |   2000  |  EN QUAD         |  Zs  |  canonical  |  2002             |  0020         |
                  |     |   2001  |  EM QUAD         |  Zs  |  canonical  |  2003             |  0020         |
                  •-----•---------•------------------•------•-------------•-------------------•---------------•
                  
                  
                  
                  1 character with NFKC = NFKD and D.M. = NFC :
                  
                  •-----•---------•----------------------------•------•-------------•-------------•-------------•------------------•
                  ! Chr.|   C.P.  |       Character Name       | G.C. |  Dec. Type  |  NFC = D.M. |     NFD     |   NFKC = NFKD    |
                  •-----•---------•----------------------------•------•-------------•-------------•-------------•------------------•
                  |  ΅  |   1FEE  |  GREEK DIALYTIKA AND OXIA  |  Sk  |  canonical  |  0385       |  00A8 0301  |  0020 0308 0301  |
                  •-----•---------•----------------------------•------•-------------•-------------•-------------•------------------•
                  

                  Hope this helps you !

                  Cheers,

                  guy038

                  J 1 Reply Last reply Jan 9, 2025, 5:50 PM Reply Quote 1
                  • J
                    jmdesp @guy038
                    last edited by Jan 9, 2025, 5:50 PM

                    @guy038 Reacting late to this thread, but I think in many case when you want to normalise a unicode text, NFC isn’t enough because it doesn’t handle ligatures.
                    So for many people, a longer table than the one here, including all the decomposition in NFKC, will be needed.

                    To make this more concrete : NFC won’t normalise the ffi_ligature (U+FB03).
                    So “A\uFB03n” will stay “A\uFB03n” if normalized with NFC, but will change to “Affin” if normalized with NFKC which is much more useful for many people.

                    1 Reply Last reply Reply Quote 2
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors