Unicode Normalization



  • Hi,
    Which unicode normalization form is used when creating a new (or converting to) UTF8 document? Can I change this with an INI or GUI option? I want NFC standard.
    Thanks,
    Javier



  • @xaviermdq ,

    I don’t know what unicode normalization is used.

    Unfortunately, the primary developer and most of the other volunteer contributors don’t regularly read this forum, so I don’t know if they’ll ever see this question. I don’t know if any of the regulars in this Forum have studied the guts of the Notepad++/Scintilla UTF-8 handling enough to know how to answer that question.

    I’d suggest waiting for another reply here, in case someone has studied it more than I’d previously gathered. But if you don’t get a reply after a reasonable wait, you might consider going to the github issues location and asking this question there – because there is hopefully a developer who knows enough about the guts to answer over there.



  • Hello, @xaviermdq,

    I don’t think that character decomposition and encodings are related, in any way !

    Whatever the Unicode encoding used, the encoding process simply writes the appropriate byte(s) in order to encode each individual character

    By contrast, the Unicode Normalization forms rather deal with :

    • Composition of characters into some pre-composed characters

    • Decompostion of characters into their base letter and some combining characters in a specific order


    For instance, let consider the SMALL LETTER LATIN e of code-point U+0065. Starting with this base letter, we may condiser the related characters, below :

    •----------•-----------------•------------------------------•------•------------•------•
    |  String  | Char(s) Number  |        Decomposition         |   >  |      e     |   <  |
    •----------•-----------------•------------------------------•------•------------•------•
    |   >e<    |        3        |  U+003E    U+0065    U+003C  |  3E  |     65     |  3C  |
    •----------•-----------------•------------------------------•------•------------•------•
    
    
    •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
    |  String  | Char(s) Number  |                    Decomposition                     |   >  |      e     |     ́    |     ̂    |   <  |
    •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
    |   >é̂<   |        5        |  U+003E  U+0065 (e)  U+0301 ( ́)  U+0302 ( ̂)  U+003C  |  3E  |     65     |  CC 81  |  CC 82  |  3C  |
    •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
    
    
    •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
    |  String  | Char(s) Number  |                    Decomposition                     |   >  |      e     |     ̂    |     ́    |   <  |
    •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
    |   >ế<    |        5        |  U+003E  U+0065 (e)  U+0302 ( ̂)  U+0301 ( ́)  U+003C  |  3E  |     65     |  CC 82  |  CC 81  |  3C  |
    •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
    
    
    •----------•-----------------•--------------------------------------------•------•------------•---------•------•
    |  String  | Char(s) Number  |               Decomposition                |   >  |      ê     |     ́    |   <  |
    •----------•-----------------•--------------------------------------------•------•------------•---------•------•
    |   >ế<   |        4        |  U+003E    U+00EA (ê)    U+0301    U+003C  |  3E  |     EA     |  CC 81  |  3C  |
    •----------•-----------------•--------------------------------------------•------•------------•---------•------•
    
    •----------•-----------------•--------------------------------------------•------•------------•---------•------•
    |  String  | Char(s) Number  |               Decomposition                |   >  |      é     |     ̂    |   <  |
    •----------•-----------------•--------------------------------------------•------•------------•---------•------•
    |   >é̂<    |        4        |  U+003E    U+00E9 (é)    U+0302    U+003C  |  3E  |     E9     |  CC 82  |  3C  |
    •----------•-----------------•--------------------------------------------•------•------------•---------•------•
    
    
    •----------•-----------------•------------------------------•------•------------•------•
    |  String  | Char(s) Number  |        Decomposition         |   >  |      ế     |   <  |
    •----------•-----------------•------------------------------•------•------------•------•
    |   >ế<    |        3        |  U+003E    U+1EBF    U+003C  |  3E  |  E1 BA BF  |  3C  |
    •----------•-----------------•------------------------------•------•------------•------•
    

    Note that I placed each string, composed of a base letter and possible diacritic signs, between the delimiters > and < for an exact search !

    • Paste the text, above, in a new N++ tab

    • Open the Mark dialog

    • SEARCH : Successively try the six regex syntaxes, below :

      • (A) (?-s)>.<

      • (B) (?-s)>..<

      • (C) (?-s)>...<

      • (D) >[[=e=]]<

      • (E) >(?=e)\X<

      • (F) >(?=[[=e=]])\X<

    • Tick the Purge for each each search and Wrap around options

    • Un-tick all other options

    • Select the Regular expression search mode

    • Click on the Mark All button


    Notes :

    • The regex A, finds the strings containing one char, between the delimiters > and <, so 3 chars in totality. It matches, of course, the string >e< and the string >ế<, containing the Vienamese letter ế

    • The regex B, finds the strings containing two chars, between the delimiters > and <, so 4 chars in totality. It matches

    • The regex C, finds the strings containing three chars, between the delimiters > and <, so 5 chars in totality. It matches the strings >é̂< and >ế<, which contain the base letter e and two diacritic characters, in a different order

    • The regex D find all the individual equivalent characters to the base letter e between the delimiters > and <, so 3 chars in totality. As the regex A, it matches an unique character, related to the e letter and the delimiters

    • In the regex E, we use a specific syntax \X which matches any base character, followed with one or several combining characters ( diacritical marks or else ). But as we just want to focus to the letter e we place, before \X, a look-ahead (?=e) which forces the regex engine to match this base letter e and possible combining characters, following it. So, it matches the first 3 cases only !

    • In the regex F, we use again the \X syntax which finds any char followed with possible combining characters. But, this time, we change the look-ahead as (?=[[=e=]]) which forces the regex engine to match any equivalent char to the letter e. Refer at end of this post for further explanation. As you can see, this regex does find ALL the above cases:-))

    This regex leads to the following generic regex :    (?=[[=C=]])\X

    which matches any character C, whatever its case, followed with some combining diacritical marks

    For instance :

    • The regex (?=[[=3=]])\X does match the character 3̯̿, composed of the base digit 3 and two combining marks

    • The regex (?=[[=$=]])\X does match the character $̶̳̚ composed of the base symbol $ and three combining marks

    Test these two regexes against this text :

    3̯̿
    
    $̶̳̚
    

    Most of the combining characters can be found in the Combining Diacritical Marks Unicode block, in the range [\x{0300}–\x{036F}], below :

    https://www.unicode.org/charts/PDF/U0300.pdf


    So, @xaviermdq, as you can see, we never worried about the exact bytes used by the UTF-8 encoding !

    Apparently, you wish to replace some decomposed consecutive characters by a precomposed equivalent character, if any !? This goal could be achieved with regexes !

    Just tell me some more details about your needs, and also, your usual working Unicode script(s) : Latin, Cyrillic, Hebrew, Arabic, CJK, ... !

    Best Regards,

    guy038



  • Hi @guy038 :
    Sorry, I asked the wrong question because I didn’t understand what was happening. What I really want is a new option (for example “Convert to UTF-8 NFC”, or something like that, in Encoding menu) that allows me to do canonical normalization. So that you understand what I want, I will show you an example:
    The code points:
    GREEK CAPITAL LETTER OMEGA , U+03A9 , UTF-8: 0xCE 0xA9
    OHM SIGN , U+2126 , UTF-8: 0xE2 0x84 0xA6
    refer to the same character, although some fonts (like MS Arial) represent it slightly differently.
    If you apply canonical normalization, U+2126 transform to U+03A9 (I tested it with BabelPad).

    Thank you for your really comprehensive response.



  • Hello, @xaviermdq,

    I’ve begun, with the advanced search of the BabelMap software and the contents of the NormalisationTest.txt file, that you may download from here, to build a complete list of Unicode characters with a Decomposition Maping property, as well as their NFC, NFD, NFKC and NFKD values !

    I obtained a list of 16,908 characters, corresponding to @Part1 # Character by character test of the NormalisationTest.txt file.

    It would be sensible to restrict such a list to the Unicode script(s) that you currently use ! So, could you tell me, from all the scripts list, below, which one(s) do you want to consider ?

    CYRILLIC - GREEK - LATIN - ROMAN
    
    ARABIC - ARMENIAN - HEBREW
    
    CJK - HANGUL - HANGZHOU - KANGXI
    
    HIRAGANA - KATAKANA
    
    BALINESE
    
    BENGALI - CHAKMA - DEVANAGARI - DIVEHI AKURU - GRANTHA - GURMUKHI - KAITHI
    KANNADA - MALAYALAM - ORIYA - SIDDHAM - SINHALA - TAMIL - TELUGU - TIRHUTA
    
    LAO - MYANMAR - THAI - TIBETAN
    
    TIFINAGH
    

    Best Regards,

    guy038


Log in to reply