Unicode Normalization
-
Hi,
Which unicode normalization form is used when creating a new (or converting to) UTF8 document? Can I change this with an INI or GUI option? I want NFC standard.
Thanks,
Javier -
I don’t know what unicode normalization is used.
Unfortunately, the primary developer and most of the other volunteer contributors don’t regularly read this forum, so I don’t know if they’ll ever see this question. I don’t know if any of the regulars in this Forum have studied the guts of the Notepad++/Scintilla UTF-8 handling enough to know how to answer that question.
I’d suggest waiting for another reply here, in case someone has studied it more than I’d previously gathered. But if you don’t get a reply after a reasonable wait, you might consider going to the github issues location and asking this question there – because there is hopefully a developer who knows enough about the guts to answer over there.
-
Hello, @xaviermdq,
I don’t think that character decomposition and encodings are related, in any way !
Whatever the Unicode encoding used, the encoding process simply writes the appropriate byte(s) in order to encode each individual character
By contrast, the Unicode Normalization forms rather deal with :
-
Composition of characters into some pre-composed characters
-
Decompostion of characters into their base letter and some combining characters in a specific order
For instance, let consider the SMALL LETTER LATIN
e
of code-pointU+0065
. Starting with this base letter, we may condiser the related characters, below :•----------•-----------------•------------------------------•------•------------•------• | String | Char(s) Number | Decomposition | > | e | < | •----------•-----------------•------------------------------•------•------------•------• | >e< | 3 | U+003E U+0065 U+003C | 3E | 65 | 3C | •----------•-----------------•------------------------------•------•------------•------• •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------• | String | Char(s) Number | Decomposition | > | e | ́ | ̂ | < | •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------• | >é̂< | 5 | U+003E U+0065 (e) U+0301 ( ́) U+0302 ( ̂) U+003C | 3E | 65 | CC 81 | CC 82 | 3C | •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------• •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------• | String | Char(s) Number | Decomposition | > | e | ̂ | ́ | < | •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------• | >ế< | 5 | U+003E U+0065 (e) U+0302 ( ̂) U+0301 ( ́) U+003C | 3E | 65 | CC 82 | CC 81 | 3C | •----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------• •----------•-----------------•--------------------------------------------•------•------------•---------•------• | String | Char(s) Number | Decomposition | > | ê | ́ | < | •----------•-----------------•--------------------------------------------•------•------------•---------•------• | >ế< | 4 | U+003E U+00EA (ê) U+0301 U+003C | 3E | EA | CC 81 | 3C | •----------•-----------------•--------------------------------------------•------•------------•---------•------• •----------•-----------------•--------------------------------------------•------•------------•---------•------• | String | Char(s) Number | Decomposition | > | é | ̂ | < | •----------•-----------------•--------------------------------------------•------•------------•---------•------• | >é̂< | 4 | U+003E U+00E9 (é) U+0302 U+003C | 3E | E9 | CC 82 | 3C | •----------•-----------------•--------------------------------------------•------•------------•---------•------• •----------•-----------------•------------------------------•------•------------•------• | String | Char(s) Number | Decomposition | > | ế | < | •----------•-----------------•------------------------------•------•------------•------• | >ế< | 3 | U+003E U+1EBF U+003C | 3E | E1 BA BF | 3C | •----------•-----------------•------------------------------•------•------------•------•
Note that I placed each string, composed of a base letter and possible diacritic signs, between the delimiters
>
and<
for an exact search !-
Paste the text, above, in a new N++ tab
-
Open the
Mark
dialog -
SEARCH : Successively try the six regex syntaxes, below :
-
(A)
(?-s)>.<
-
(B)
(?-s)>..<
-
(C)
(?-s)>...<
-
(D)
>[[=e=]]<
-
(E)
>(?=e)\X<
-
(F)
>(?=[[=e=]])\X<
-
-
Tick the
Purge for each each search
andWrap around
options -
Un-tick all other options
-
Select the
Regular expression
search mode -
Click on the
Mark All
button
Notes :
-
The regex A, finds the strings containing one char, between the delimiters
>
and<
, so3
chars in totality. It matches, of course, the string>e<
and the string>ế<
, containing the Vienamese letterế
-
The regex B, finds the strings containing two chars, between the delimiters
>
and<
, so4
chars in totality. It matches the strings>ế<
* and>é̂<
which contain an accentuated char with an additionnal diacritic character -
The regex C, finds the strings containing three chars, between the delimiters
>
and<
, so5
chars in totality. It matches the strings>é̂<
and>ế<
, which contain the base lettere
and two diacritic characters, in a different order -
The regex D find all the individual equivalent characters to the base letter
e
between the delimiters>
and<
, so3
chars in totality. As the regex A, it matches an unique character, related to thee
letter and the delimiters -
In the regex E, we use a specific syntax
\X
which matches any base character, followed with one or several combining characters ( diacritical marks or else ). But as we just want to focus to the lettere
we place, before\X
, a look-ahead(?=e)
which forces the regex engine to match this base lettere
and possible combining characters, following it. So, it matches the first3
cases only ! -
In the regex F, we use again the
\X
syntax which finds any char followed with possible combining characters. But, this time, we change the look-ahead as(?=[[=e=]])
which forces the regex engine to match any equivalent char to the lettere
. Refer at end of this post for further explanation. As you can see, this regex does find ALL the above cases:-))
This regex leads to the following generic regex :
(?=[[=
C=]])\X
which matches any character
C
, whatever its case, followed with some combining diacritical marksFor instance :
-
The regex
(?=[[=3=]])\X
does match the character3̯̿
, composed of the base digit3
and two combining marks -
The regex
(?=[[=$=]])\X
does match the character$̶̳̚
composed of the base symbol$
and three combining marks
Test these two regexes against this text :
3̯̿ $̶̳̚
Most of the combining characters can be found in the
Combining Diacritical Marks
Unicode block, in the range[\x{0300}–\x{036F}]
, below :https://www.unicode.org/charts/PDF/U0300.pdf
So, @xaviermdq, as you can see, we never worried about the exact bytes used by the
UTF-8
encoding !Apparently, you wish to replace some decomposed consecutive characters by a precomposed equivalent character, if any !? This goal could be achieved with regexes !
Just tell me some more details about your needs, and also, your usual working Unicode script(s) :
Latin, Cyrillic, Hebrew, Arabic, CJK, ...
!Best Regards,
guy038
-
-
Hi @guy038 :
Sorry, I asked the wrong question because I didn’t understand what was happening. What I really want is a new option (for example “Convert to UTF-8 NFC”, or something like that, in Encoding menu) that allows me to do canonical normalization. So that you understand what I want, I will show you an example:
The code points:
GREEK CAPITAL LETTER OMEGA , U+03A9 , UTF-8: 0xCE 0xA9
OHM SIGN , U+2126 , UTF-8: 0xE2 0x84 0xA6
refer to the same character, although some fonts (like MS Arial) represent it slightly differently.
If you apply canonical normalization, U+2126 transform to U+03A9 (I tested it with BabelPad).Thank you for your really comprehensive response.
-
Hello, @xaviermdq,
I’ve begun, with the advanced search of the
BabelMap
software and the contents of theNormalisationTest.txt
file, that you may download from here, to build a complete list of Unicode characters with aDecomposition Maping
property, as well as theirNFC
,NFD
,NFKC
andNFKD
values !I obtained a list of
16,908
characters, corresponding to@Part1 # Character by character test
of theNormalisationTest.txt
file.It would be sensible to restrict such a list to the Unicode script(s) that you currently use ! So, could you tell me, from all the scripts list, below, which one(s) do you want to consider ?
CYRILLIC - GREEK - LATIN - ROMAN ARABIC - ARMENIAN - HEBREW CJK - HANGUL - HANGZHOU - KANGXI HIRAGANA - KATAKANA BALINESE BENGALI - CHAKMA - DEVANAGARI - DIVEHI AKURU - GRANTHA - GURMUKHI - KAITHI KANNADA - MALAYALAM - ORIYA - SIDDHAM - SINHALA - TAMIL - TELUGU - TIRHUTA LAO - MYANMAR - THAI - TIBETAN TIFINAGH
Best Regards,
guy038
-
Hi guy038,
Sorry for delay (I missed the email notification). I am using “CYRILLIC - GREEK - LATIN - ROMAN”. But is it important to know? I ask because the normalization function that I would like Notepad++ to have, wouldn’t it be independent of the character set used? Anyway, I already did the normalization using BabelPad (menu Convert, Normalization form, To NFC). Only corrected 3 composition characters. In the future I’ll use BabelPad for normalization. Thank you very much for the explanations. They were revealing. As soon as I can, I am going to study this matter in more detail. -
Hi, @xaviermdq,
As you are interested by the
Latin
,Greek
,Cyrillic
andRoman
scripts, only, I filtered my previous list of16,908
chars and obtained a smaller file, containing2,635
characters !I was able to class all these characters in
12
categories. Below, you’ll see the first character of each class :2,635 characters, from LATIN, GREEK, CYRILLIC and ROMAN scripts, are concerned with DECOMPOSITION MAPPING ( NOTE : G.C. means GENERAL CATEGORY, C.P. means CODE-POINT and D.M. means DECOMPOSITION MAPPING ) 563 characters with NFC = NFKC = C.P. and NFD = NFKD = D.M. : •---------•------------------------------------------------------------•------•-------------•--------------------•--------------------• ! C.P. | Character Name | G.C. | Dec. Type | NFC = NFKC = C.P. | NFD = NFKD = D.M. | •---------•------------------------------------------------------------•------•-------------•--------------------•--------------------• | 00C0 | LATIN CAPITAL LETTER A WITH GRAVE | Lu | canonical | 00C0 | 0041 0300 | •---------•------------------------------------------------------------•------•-------------•--------------------•--------------------• 18 characters with NFC = NFKC = NFD = NFKD = D.M. : •---------•-----------------------------------•------•-------------•---------------------------------• ! C.P. | Character Name | G.C. | Dec. Type | NFC = NFKC = NFD = NFKD = D.M. | •---------•-----------------------------------•------•-------------•---------------------------------• | 0340 | COMBINING GRAVE TONE MARK | Mn | canonical | 0300 | •---------•-----------------------------------•------•-------------•---------------------------------• 17 characters with NFC = NFKC = D.M. and NFD = NFKD : •---------•------------------------------------------------------•------•-------------•---------------------•------------------• ! C.P. | Character Name | G.C. | Dec. Type | NFC = NFKC = D.M. | NFD = NFKD | •---------•------------------------------------------------------•------•-------------•---------------------•------------------• | 1F71 | GREEK SMALL LETTER ALPHA WITH OXIA | Ll | canonical | 03AC | 03B1 0301 | •---------•------------------------------------------------------•------•-------------•---------------------•------------------• 250 characters with NFC = NFKC = C.P. and NFD = NFKD and D.M. <> from OTHERS : •---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------• ! C.P. | Character Name | G.C. | Dec. Type | Dec. Map. | NFC = NFKC = C.P. | NFD = NFKD | Code | •---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------• | 01D5 | LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON | Lu | canonical | 00DC 0304 | 01D5 | 0055 0308 0304 | 01D5 | •---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------• 9 characters with NFC = NFKC = NFD = NFKD and D.M. <> from OTHERS : •---------•-------------------------------------------------•------•-------------•---------------•---------------------------• ! C.P. | Character Name | G.C. | Dec. Type | Decomp. Map. | NFC = NFKC = NFD = NFKD | •---------•-------------------------------------------------•------•-------------•---------------•---------------------------• | 1D160 | MUSICAL SYMBOL EIGHTH NOTE | So | canonical | 1D15F 1D16E | 1D158 1D165 1D16E | •---------•-------------------------------------------------•------•-------------•---------------•---------------------------• 1716 characters with NFC = NFD = C.P. and NFKC = NFKD = D.M. : •---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------• ! C.P. | Character Name | G.C. | Dec. Type | NFC = NFD = C.P. | NFKC = NFKD = D.M. | Code | •---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------• | 00A0 | NO-BREAK SPACE | Zs | <noBreak> | 00A0 | 0020 | 00A0 | •---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------• 3 characters with NFC = NFD = D.M. and NFKC = NFKD : •---------•------------------•------•-------------•-------------------•---------------• ! C.P. | Character Name | G.C. | Dec. Type | NFC = NFD = D.M. | NFKC = NFKD | •---------•------------------•------•-------------•-------------------•---------------• | 1FFD | GREEK OXIA | Sk | canonical | 00B4 | 0020 0301 | •---------•------------------•------•-------------•-------------------•---------------• 43 characters with NFC = NFD = C.P. and NFKC = NFKD and D.M. <> from OTHERS : •---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------• ! C.P. | Character Name | G.C. | Dec. Type | Dec. Map. | NFC = NFD = C.P. | NFKC = NFKD | Code | •---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------• | FB05 | LATIN SMALL LIGATURE LONG S T | Ll | <compat> | 017F 0074 | FB05 | 0073 0074 | FB05 | •---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------• 3 characters with NFC = NFD = C.P. and NFKC = D.M. : •---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------• ! C.P. | Character Name | G.C. | Dec. Type | NFC = NFD = C.P. | NFKC = D.M. | NFKD | Code | •---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------• | 01C4 | LATIN CAPITAL LETTER DZ WITH CARON | Lu | <compat> | 01C4 | 0044 017D | 0044 005A 030C | 01C4 | •---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------• 9 characters with NFC = C.P. and NFD = D.M. and NFKC = NFKD : •---------•-----------------------------------•------•-------------•-------------•-------------•------------------• ! C.P. | Character Name | G.C. | Dec. Type | NFC = C.P. | NFD = D.M. | NFKC = NFKD | •---------•-----------------------------------•------•-------------•-------------•-------------•------------------• | 0385 | GREEK DIALYTIKA TONOS | Sk | canonical | 0385 | 00A8 0301 | 0020 0308 0301 | •---------•-----------------------------------•------•-------------•-------------•-------------•------------------• 1 character with NFC = D.M. and NFKC = NFKD : •---------•----------------------------•------•-------------•-------------•-------------•------------------• ! C.P. | Character Name | G.C. | Dec. Type | NFC = D.M. | NFD | NFKC = NFKD | •---------•----------------------------•------•-------------•-------------•-------------•------------------• | 1FEE | GREEK DIALYTIKA AND OXIA | Sk | canonical | 0385 | 00A8 0301 | 0020 0308 0301 | •---------•----------------------------•------•-------------•-------------•-------------•------------------• 3 characters with NFC = C.P. and NFD = D.M. and ALL columns DIFFERENT : •---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------• ! C.P. | Character Name | G.C. | Dec. Type | NFC = C.P. | NFD = D.M. | NFKC | NFKD | •---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------• | 03D3 | GREEK UPSILON WITH ACUTE AND HOOK SYMBOL | Lu | canonical | 03D3 | 03D2 0301 | 038E | 03A5 0301 | •---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------•
Now, as you can see, quite a lot of categories have an
NFC
value strictly identical to theC.P.
( code-point ) of characters. If we omit all characters of these categories, it remains, only,48
characters which are changed when using theirNFC
value !In the next post, you’ll get the list of these
48
characters which are modified when using the optionConvert > Normalization From > To NFC
of the BabelPad software !Just tell me if you need the complete list (
2,635
chars ) too. I could send it by e-mail !Best Regards,
guy038
-
Hello, @@xaviermdq,
Here is the list of the transformed characters, due to the
NFC
normalsation form :48 characters, from LATIN, GREEK, CYRILLIC and ROMAN scripts, are REPLACED by OTHER character(s) when using the option "Convert > Normalization From > To NFC" of the "BabelPad" software ( NOTE : C.P. means CODE-POINT, G.C. means GENERAL CATEGORY and D.M. means DECOMPOSITION MAPPING ) 18 characters with NFC = NFKC = NFD = NFKD = D.M. : •-----•---------•-----------------------------------•------•-------------•---------------------------------• ! Chr.| C.P. | Character Name | G.C. | Dec. Type | NFC = NFKC = NFD = NFKD = D.M. | •-----•---------•-----------------------------------•------•-------------•---------------------------------• | ̀ | 0340 | COMBINING GRAVE TONE MARK | Mn | canonical | 0300 | | ́ | 0341 | COMBINING ACUTE TONE MARK | Mn | canonical | 0301 | | ̓ | 0343 | COMBINING GREEK KORONIS | Mn | canonical | 0313 | | ̈́ | 0344 | COMBINING GREEK DIALYTIKA TONOS | Mn | canonical | 0308 0301 | | ʹ | 0374 | GREEK NUMERAL SIGN | Lm | canonical | 02B9 | | ; | 037E | GREEK QUESTION MARK | Po | canonical | 003B | | · | 0387 | GREEK ANO TELEIA | Po | canonical | 00B7 | | ι | 1FBE | GREEK PROSGEGRAMMENI | Ll | canonical | 03B9 | | ` | 1FEF | GREEK VARIA | Sk | canonical | 0060 | | Ω | 2126 | OHM SIGN | Lu | canonical | 03A9 | | K | 212A | KELVIN SIGN | Lu | canonical | 004B | | 〈 | 2329 | LEFT-POINTING ANGLE BRACKET | Ps | canonical | 3008 | | 〉 | 232A | RIGHT-POINTING ANGLE BRACKET | Pe | canonical | 3009 | | ⫝̸ | 2ADC | FORKING | Sm | canonical | 2ADD 0338 | | 𝅗𝅥 | 1D15E | MUSICAL SYMBOL HALF NOTE | So | canonical | 1D157 1D165 | | 𝅘𝅥 | 1D15F | MUSICAL SYMBOL QUARTER NOTE | So | canonical | 1D158 1D165 | | 𝆹𝅥 | 1D1BB | MUSICAL SYMBOL MINIMA | So | canonical | 1D1B9 1D165 | | 𝆺𝅥 | 1D1BC | MUSICAL SYMBOL MINIMA BLACK | So | canonical | 1D1BA 1D165 | •-----•---------•-----------------------------------•------•-------------•---------------------------------• 17 characters with NFC = NFKC = D.M. and NFD = NFKD : •-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------• ! Chr.| C.P. | Character Name | G.C. | Dec. Type | NFC = NFKC = D.M. | NFD = NFKD | •-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------• | ά | 1F71 | GREEK SMALL LETTER ALPHA WITH OXIA | Ll | canonical | 03AC | 03B1 0301 | | έ | 1F73 | GREEK SMALL LETTER EPSILON WITH OXIA | Ll | canonical | 03AD | 03B5 0301 | | ή | 1F75 | GREEK SMALL LETTER ETA WITH OXIA | Ll | canonical | 03AE | 03B7 0301 | | ί | 1F77 | GREEK SMALL LETTER IOTA WITH OXIA | Ll | canonical | 03AF | 03B9 0301 | | ό | 1F79 | GREEK SMALL LETTER OMICRON WITH OXIA | Ll | canonical | 03CC | 03BF 0301 | | ύ | 1F7B | GREEK SMALL LETTER UPSILON WITH OXIA | Ll | canonical | 03CD | 03C5 0301 | | ώ | 1F7D | GREEK SMALL LETTER OMEGA WITH OXIA | Ll | canonical | 03CE | 03C9 0301 | | Ά | 1FBB | GREEK CAPITAL LETTER ALPHA WITH OXIA | Lu | canonical | 0386 | 0391 0301 | | Έ | 1FC9 | GREEK CAPITAL LETTER EPSILON WITH OXIA | Lu | canonical | 0388 | 0395 0301 | | Ή | 1FCB | GREEK CAPITAL LETTER ETA WITH OXIA | Lu | canonical | 0389 | 0397 0301 | | ΐ | 1FD3 | GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA | Ll | canonical | 0390 | 03B9 0308 0301 | | Ί | 1FDB | GREEK CAPITAL LETTER IOTA WITH OXIA | Lu | canonical | 038A | 0399 0301 | | ΰ | 1FE3 | GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA | Ll | canonical | 03B0 | 03C5 0308 0301 | | Ύ | 1FEB | GREEK CAPITAL LETTER UPSILON WITH OXIA | Lu | canonical | 038E | 03A5 0301 | | Ό | 1FF9 | GREEK CAPITAL LETTER OMICRON WITH OXIA | Lu | canonical | 038C | 039F 0301 | | Ώ | 1FFB | GREEK CAPITAL LETTER OMEGA WITH OXIA | Lu | canonical | 038F | 03A9 0301 | | Å | 212B | ANGSTROM SIGN | Lu | canonical | 00C5 | 0041 030A | •-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------• 9 characters with NFC = NFKC = NFD = NFKD and D.M. <> from OTHERS : •-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------• ! Chr.| C.P. | Character Name | G.C. | Dec. Type | Decomp. Map. | NFC = NFKC = NFD = NFKD | •-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------• | 𝅘𝅥𝅮 | 1D160 | MUSICAL SYMBOL EIGHTH NOTE | So | canonical | 1D15F 1D16E | 1D158 1D165 1D16E | | 𝅘𝅥𝅯 | 1D161 | MUSICAL SYMBOL SIXTEENTH NOTE | So | canonical | 1D15F 1D16F | 1D158 1D165 1D16F | | 𝅘𝅥𝅰 | 1D162 | MUSICAL SYMBOL THIRTY-SECOND NOTE | So | canonical | 1D15F 1D170 | 1D158 1D165 1D170 | | 𝅘𝅥𝅱 | 1D163 | MUSICAL SYMBOL SIXTY-FOURTH NOTE | So | canonical | 1D15F 1D171 | 1D158 1D165 1D171 | | 𝅘𝅥𝅲 | 1D164 | MUSICAL SYMBOL ONE HUNDRED TWENTY-EIGHTH NOTE | So | canonical | 1D15F 1D172 | 1D158 1D165 1D172 | | 𝆹𝅥𝅮 | 1D1BD | MUSICAL SYMBOL SEMIMINIMA WHITE | So | canonical | 1D1BB 1D16E | 1D1B9 1D165 1D16E | | 𝆺𝅥𝅮 | 1D1BE | MUSICAL SYMBOL SEMIMINIMA BLACK | So | canonical | 1D1BC 1D16E | 1D1BA 1D165 1D16E | | 𝆹𝅥𝅯 | 1D1BF | MUSICAL SYMBOL FUSA WHITE | So | canonical | 1D1BB 1D16F | 1D1B9 1D165 1D16F | | 𝆺𝅥𝅯 | 1D1C0 | MUSICAL SYMBOL FUSA BLACK | So | canonical | 1D1BC 1D16F | 1D1BA 1D165 1D16F | •-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------• 3 characters with NFC = NFD = D.M. and NFKC = NFKD : •-----•---------•------------------•------•-------------•-------------------•---------------• ! Chr.| C.P. | Character Name | G.C. | Dec. Type | NFC = NFD = D.M. | NFKC = NFKD | •-----•---------•------------------•------•-------------•-------------------•---------------• | ´ | 1FFD | GREEK OXIA | Sk | canonical | 00B4 | 0020 0301 | | | 2000 | EN QUAD | Zs | canonical | 2002 | 0020 | | | 2001 | EM QUAD | Zs | canonical | 2003 | 0020 | •-----•---------•------------------•------•-------------•-------------------•---------------• 1 character with NFKC = NFKD and D.M. = NFC : •-----•---------•----------------------------•------•-------------•-------------•-------------•------------------• ! Chr.| C.P. | Character Name | G.C. | Dec. Type | NFC = D.M. | NFD | NFKC = NFKD | •-----•---------•----------------------------•------•-------------•-------------•-------------•------------------• | ΅ | 1FEE | GREEK DIALYTIKA AND OXIA | Sk | canonical | 0385 | 00A8 0301 | 0020 0308 0301 | •-----•---------•----------------------------•------•-------------•-------------•-------------•------------------•
Hope this helps you !
Cheers,
guy038
-
@guy038 Reacting late to this thread, but I think in many case when you want to normalise a unicode text, NFC isn’t enough because it doesn’t handle ligatures.
So for many people, a longer table than the one here, including all the decomposition in NFKC, will be needed.To make this more concrete : NFC won’t normalise the ffi_ligature (U+FB03).
So “A\uFB03n” will stay “A\uFB03n” if normalized with NFC, but will change to “Affin” if normalized with NFKC which is much more useful for many people.