Unicode Normalization

Reply to Unicode Normalization on Sat, 03 Jul 2021 03:50:37 GMT

guy038 — Sat, 03 Jul 2021 03:50:37 GMT

Here is the list of the transformed characters, due to the NFC normalsation form :

    48 characters, from LATIN, GREEK, CYRILLIC and ROMAN scripts, are REPLACED by OTHER character(s)

        when using the option "Convert > Normalization From > To NFC" of the "BabelPad" software


    ( NOTE : C.P. means CODE-POINT, G.C. means GENERAL CATEGORY and D.M. means DECOMPOSITION MAPPING )


18 characters with NFC = NFKC = NFD = NFKD = D.M. :

•-----•---------•-----------------------------------•------•-------------•---------------------------------•
! Chr.|   C.P.  |          Character Name           | G.C. |  Dec. Type  |  NFC = NFKC = NFD = NFKD = D.M. |
•-----•---------•-----------------------------------•------•-------------•---------------------------------•
|   ̀  |   0340  |  COMBINING GRAVE TONE MARK        |  Mn  |  canonical  |  0300                           |
|   ́  |   0341  |  COMBINING ACUTE TONE MARK        |  Mn  |  canonical  |  0301                           |
|   ̓  |   0343  |  COMBINING GREEK KORONIS          |  Mn  |  canonical  |  0313                           |
|   ̈́  |   0344  |  COMBINING GREEK DIALYTIKA TONOS  |  Mn  |  canonical  |  0308 0301                      |
|  ʹ ‎ |   0374  |  GREEK NUMERAL SIGN               |  Lm  |  canonical  |  02B9                           |
|  ; ‎ |   037E  |  GREEK QUESTION MARK              |  Po  |  canonical  |  003B                           |
|  · ‎ |   0387  |  GREEK ANO TELEIA                 |  Po  |  canonical  |  00B7                           |
|  ι ‎ |   1FBE  |  GREEK PROSGEGRAMMENI             |  Ll  |  canonical  |  03B9                           |
|  `  |   1FEF  |  GREEK VARIA                      |  Sk  |  canonical  |  0060                           |
|  Ω  |   2126  |  OHM SIGN                         |  Lu  |  canonical  |  03A9                           |
|  K  |   212A  |  KELVIN SIGN                      |  Lu  |  canonical  |  004B                           |
|  〈  |   2329  |  LEFT-POINTING ANGLE BRACKET      |  Ps  |  canonical  |  3008                           |
|  〉  |   232A  |  RIGHT-POINTING ANGLE BRACKET     |  Pe  |  canonical  |  3009                           |
|  ⫝̸  |   2ADC  |  FORKING                          |  Sm  |  canonical  |  2ADD 0338                      |
|  𝅗𝅥  |  1D15E  |  MUSICAL SYMBOL HALF NOTE         |  So  |  canonical  |  1D157 1D165                    |
|  𝅘𝅥  |  1D15F  |  MUSICAL SYMBOL QUARTER NOTE      |  So  |  canonical  |  1D158 1D165                    |
|  𝆹𝅥  |  1D1BB  |  MUSICAL SYMBOL MINIMA            |  So  |  canonical  |  1D1B9 1D165                    |
|  𝆺𝅥  |  1D1BC  |  MUSICAL SYMBOL MINIMA BLACK      |  So  |  canonical  |  1D1BA 1D165                    |
•-----•---------•-----------------------------------•------•-------------•---------------------------------•



17 characters with NFC = NFKC = D.M. and NFD = NFKD :

•-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
! Chr.|   C.P.  |                    Character Name                    | G.C. |  Dec. Type  |  NFC = NFKC = D.M.  |    NFD = NFKD    |
•-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
|  ά  |   1F71  |  GREEK SMALL LETTER ALPHA WITH OXIA                  |  Ll  |  canonical  |  03AC               |  03B1 0301       |
|  έ  |   1F73  |  GREEK SMALL LETTER EPSILON WITH OXIA                |  Ll  |  canonical  |  03AD               |  03B5 0301       |
|  ή  |   1F75  |  GREEK SMALL LETTER ETA WITH OXIA                    |  Ll  |  canonical  |  03AE               |  03B7 0301       |
|  ί  |   1F77  |  GREEK SMALL LETTER IOTA WITH OXIA                   |  Ll  |  canonical  |  03AF               |  03B9 0301       |
|  ό  |   1F79  |  GREEK SMALL LETTER OMICRON WITH OXIA                |  Ll  |  canonical  |  03CC               |  03BF 0301       |
|  ύ  |   1F7B  |  GREEK SMALL LETTER UPSILON WITH OXIA                |  Ll  |  canonical  |  03CD               |  03C5 0301       |
|  ώ  |   1F7D  |  GREEK SMALL LETTER OMEGA WITH OXIA                  |  Ll  |  canonical  |  03CE               |  03C9 0301       |
|  Ά  |   1FBB  |  GREEK CAPITAL LETTER ALPHA WITH OXIA                |  Lu  |  canonical  |  0386               |  0391 0301       |
|  Έ  |   1FC9  |  GREEK CAPITAL LETTER EPSILON WITH OXIA              |  Lu  |  canonical  |  0388               |  0395 0301       |
|  Ή  |   1FCB  |  GREEK CAPITAL LETTER ETA WITH OXIA                  |  Lu  |  canonical  |  0389               |  0397 0301       |
|  ΐ  |   1FD3  |  GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA     |  Ll  |  canonical  |  0390               |  03B9 0308 0301  |
|  Ί  |   1FDB  |  GREEK CAPITAL LETTER IOTA WITH OXIA                 |  Lu  |  canonical  |  038A               |  0399 0301       |
|  ΰ  |   1FE3  |  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA  |  Ll  |  canonical  |  03B0               |  03C5 0308 0301  |
|  Ύ  |   1FEB  |  GREEK CAPITAL LETTER UPSILON WITH OXIA              |  Lu  |  canonical  |  038E               |  03A5 0301       |
|  Ό  |   1FF9  |  GREEK CAPITAL LETTER OMICRON WITH OXIA              |  Lu  |  canonical  |  038C               |  039F 0301       |
|  Ώ  |   1FFB  |  GREEK CAPITAL LETTER OMEGA WITH OXIA                |  Lu  |  canonical  |  038F               |  03A9 0301       |
|  Å  |   212B  |  ANGSTROM SIGN                                       |  Lu  |  canonical  |  00C5               |  0041 030A       |
•-----•---------•------------------------------------------------------•------•-------------•---------------------•------------------•



9 characters with NFC = NFKC = NFD = NFKD and D.M. <> from OTHERS :

•-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
! Chr.|   C.P.  |                 Character Name                  | G.C. |  Dec. Type  |  Decomp. Map. |  NFC = NFKC = NFD = NFKD  |
•-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
|  𝅘𝅥𝅮 ‎ |  1D160  |  MUSICAL SYMBOL EIGHTH NOTE                     |  So  |  canonical  |  1D15F 1D16E  |  1D158 1D165 1D16E        |
|  𝅘𝅥𝅯 ‎ |  1D161  |  MUSICAL SYMBOL SIXTEENTH NOTE                  |  So  |  canonical  |  1D15F 1D16F  |  1D158 1D165 1D16F        |
|  𝅘𝅥𝅰 ‎ |  1D162  |  MUSICAL SYMBOL THIRTY-SECOND NOTE              |  So  |  canonical  |  1D15F 1D170  |  1D158 1D165 1D170        |
|  𝅘𝅥𝅱 ‎ |  1D163  |  MUSICAL SYMBOL SIXTY-FOURTH NOTE               |  So  |  canonical  |  1D15F 1D171  |  1D158 1D165 1D171        |
|  𝅘𝅥𝅲 ‎ |  1D164  |  MUSICAL SYMBOL ONE HUNDRED TWENTY-EIGHTH NOTE  |  So  |  canonical  |  1D15F 1D172  |  1D158 1D165 1D172        |
|  𝆹𝅥𝅮 ‎ |  1D1BD  |  MUSICAL SYMBOL SEMIMINIMA WHITE                |  So  |  canonical  |  1D1BB 1D16E  |  1D1B9 1D165 1D16E        |
|  𝆺𝅥𝅮 ‎ |  1D1BE  |  MUSICAL SYMBOL SEMIMINIMA BLACK                |  So  |  canonical  |  1D1BC 1D16E  |  1D1BA 1D165 1D16E        |
|  𝆹𝅥𝅯 ‎ |  1D1BF  |  MUSICAL SYMBOL FUSA WHITE                      |  So  |  canonical  |  1D1BB 1D16F  |  1D1B9 1D165 1D16F        |
|  𝆺𝅥𝅯 ‎ |  1D1C0  |  MUSICAL SYMBOL FUSA BLACK                      |  So  |  canonical  |  1D1BC 1D16F  |  1D1BA 1D165 1D16F        |
•-----•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•



3 characters with NFC = NFD = D.M. and NFKC = NFKD :

•-----•---------•------------------•------•-------------•-------------------•---------------•
! Chr.|   C.P.  |  Character Name  | G.C. |  Dec. Type  |  NFC = NFD = D.M. |  NFKC = NFKD  |
•-----•---------•------------------•------•-------------•-------------------•---------------•
|  ´  |   1FFD  |  GREEK OXIA      |  Sk  |  canonical  |  00B4             |  0020 0301    |
|     |   2000  |  EN QUAD         |  Zs  |  canonical  |  2002             |  0020         |
|     |   2001  |  EM QUAD         |  Zs  |  canonical  |  2003             |  0020         |
•-----•---------•------------------•------•-------------•-------------------•---------------•



1 character with NFKC = NFKD and D.M. = NFC :

•-----•---------•----------------------------•------•-------------•-------------•-------------•------------------•
! Chr.|   C.P.  |       Character Name       | G.C. |  Dec. Type  |  NFC = D.M. |     NFD     |   NFKC = NFKD    |
•-----•---------•----------------------------•------•-------------•-------------•-------------•------------------•
|  ΅  |   1FEE  |  GREEK DIALYTIKA AND OXIA  |  Sk  |  canonical  |  0385       |  00A8 0301  |  0020 0308 0301  |
•-----•---------•----------------------------•------•-------------•-------------•-------------•------------------•

Hope this helps you !

Cheers,

guy038

Reply to Unicode Normalization on Sat, 03 Jul 2021 11:54:54 GMT

guy038 — Sat, 03 Jul 2021 11:54:54 GMT

Hi, @xaviermdq,

As you are interested by the Latin, Greek, Cyrillic and Roman scripts, only, I filtered my previous list of 16,908 chars and obtained a smaller file, containing 2,635 characters !

I was able to class all these characters in 12 categories. Below, you’ll see the first character of each class :

     2,635 characters, from LATIN, GREEK, CYRILLIC and ROMAN scripts, are concerned with DECOMPOSITION MAPPING


        ( NOTE : G.C. means GENERAL CATEGORY, C.P. means CODE-POINT and D.M. means DECOMPOSITION MAPPING )


563 characters with NFC = NFKC = C.P. and NFD = NFKD = D.M. :

•---------•------------------------------------------------------------•------•-------------•--------------------•--------------------•
!   C.P.  |                       Character Name                       | G.C. |  Dec. Type  |  NFC = NFKC = C.P. |  NFD = NFKD = D.M. |
•---------•------------------------------------------------------------•------•-------------•--------------------•--------------------•
|   00C0  |  LATIN CAPITAL LETTER A WITH GRAVE                         |  Lu  |  canonical  |  00C0              |  0041 0300         |
•---------•------------------------------------------------------------•------•-------------•--------------------•--------------------•



18 characters with NFC = NFKC = NFD = NFKD = D.M. :

•---------•-----------------------------------•------•-------------•---------------------------------•
!   C.P.  |          Character Name           | G.C. |  Dec. Type  |  NFC = NFKC = NFD = NFKD = D.M. |
•---------•-----------------------------------•------•-------------•---------------------------------•
|   0340  |  COMBINING GRAVE TONE MARK        |  Mn  |  canonical  |  0300                           |
•---------•-----------------------------------•------•-------------•---------------------------------•



17 characters with NFC = NFKC = D.M. and NFD = NFKD :

•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
!   C.P.  |                    Character Name                    | G.C. |  Dec. Type  |  NFC = NFKC = D.M.  |    NFD = NFKD    |
•---------•------------------------------------------------------•------•-------------•---------------------•------------------•
|   1F71  |  GREEK SMALL LETTER ALPHA WITH OXIA                  |  Ll  |  canonical  |  03AC               |  03B1 0301       |
•---------•------------------------------------------------------•------•-------------•---------------------•------------------•



250 characters with NFC = NFKC = C.P. and NFD = NFKD and D.M. <> from OTHERS :

•---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------•
!   C.P.  |                               Character Name                               | G.C. |  Dec. Type  |  Dec. Map.  |  NFC = NFKC = C.P. |      NFD = NFKD       |  Code   |
•---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------•
|   01D5  |  LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON                          |  Lu  |  canonical  |  00DC 0304  |  01D5              |  0055 0308 0304       |   01D5  |
•---------•----------------------------------------------------------------------------•------•-------------•-------------•--------------------•-----------------------•---------•



9 characters with NFC = NFKC = NFD = NFKD and D.M. <> from OTHERS :

•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
!   C.P.  |                 Character Name                  | G.C. |  Dec. Type  |  Decomp. Map. |  NFC = NFKC = NFD = NFKD  |
•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•
|  1D160  |  MUSICAL SYMBOL EIGHTH NOTE                     |  So  |  canonical  |  1D15F 1D16E  |  1D158 1D165 1D16E        |
•---------•-------------------------------------------------•------•-------------•---------------•---------------------------•



1716 characters with NFC = NFD = C.P. and NFKC = NFKD = D.M. :

•---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------•
!   C.P.  |                         Character Name                          | G.C. |  Dec. Type   |  NFC = NFD = C.P. |  NFKC = NFKD = D.M.   |  Code   |
•---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------•
|   00A0  |  NO-BREAK SPACE                                                 |  Zs  |     |  00A0             |  0020                 |   00A0  |
•---------•-----------------------------------------------------------------•------•--------------•-------------------•-----------------------•---------•



3 characters with NFC = NFD = D.M. and NFKC = NFKD :

•---------•------------------•------•-------------•-------------------•---------------•
!   C.P.  |  Character Name  | G.C. |  Dec. Type  |  NFC = NFD = D.M. |  NFKC = NFKD  |
•---------•------------------•------•-------------•-------------------•---------------•
|   1FFD  |  GREEK OXIA      |  Sk  |  canonical  |  00B4             |  0020 0301    |
•---------•------------------•------•-------------•-------------------•---------------•



43 characters with NFC = NFD = C.P. and NFKC = NFKD and D.M. <> from OTHERS :

•---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------•
!   C.P.  |                       Character Name                       | G.C. |  Dec. Type   |  Dec. Map.  |  NFC = NFD = C.P. |   NFKC = NFKD    |  Code   |
•---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------•
|   FB05  |  LATIN SMALL LIGATURE LONG S T                             |  Ll  |      |  017F 0074  |  FB05             |  0073 0074       |   FB05  |
•---------•------------------------------------------------------------•------•--------------•-------------•-------------------•------------------•---------•



3 characters with NFC = NFD = C.P. and NFKC = D.M. :

•---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------•
!   C.P.  |                     Character Name                      | G.C. |  Dec. Type  |  NFC = NFD = C.P.  |  NFKC = D.M. |       NFKD       |  Code   |
•---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------•
|   01C4  |  LATIN CAPITAL LETTER DZ WITH CARON                     |  Lu  |     |  01C4              |  0044 017D   |  0044 005A 030C  |   01C4  |
•---------•---------------------------------------------------------•------•-------------•--------------------•--------------•------------------•---------•



9 characters with NFC = C.P. and NFD = D.M. and NFKC = NFKD :

•---------•-----------------------------------•------•-------------•-------------•-------------•------------------•
!   C.P.  |          Character Name           | G.C. |  Dec. Type  |  NFC = C.P. |  NFD = D.M. |   NFKC = NFKD    |
•---------•-----------------------------------•------•-------------•-------------•-------------•------------------•
|   0385  |  GREEK DIALYTIKA TONOS            |  Sk  |  canonical  |  0385       |  00A8 0301  |  0020 0308 0301  |
•---------•-----------------------------------•------•-------------•-------------•-------------•------------------•



1 character with NFC = D.M. and NFKC = NFKD :

•---------•----------------------------•------•-------------•-------------•-------------•------------------•
!   C.P.  |       Character Name       | G.C. |  Dec. Type  |  NFC = D.M. |     NFD     |   NFKC = NFKD    |
•---------•----------------------------•------•-------------•-------------•-------------•------------------•
|   1FEE  |  GREEK DIALYTIKA AND OXIA  |  Sk  |  canonical  |  0385       |  00A8 0301  |  0020 0308 0301  |
•---------•----------------------------•------•-------------•-------------•-------------•------------------•



3 characters with NFC = C.P. and NFD = D.M. and ALL columns DIFFERENT :

•---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------•
!   C.P.  |                Character Name                 | G.C. |  Dec. Type  |  NFC = C.P. |  NFD = D.M. |  NFKC  |    NFKD     |
•---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------•
|   03D3  |  GREEK UPSILON WITH ACUTE AND HOOK SYMBOL     |  Lu  |  canonical  |  03D3       |  03D2 0301  |  038E  |  03A5 0301  |
•---------•-----------------------------------------------•------•-------------•-------------•-------------•--------•-------------•

Now, as you can see, quite a lot of categories have an NFC value strictly identical to the C.P. ( code-point ) of characters. If we omit all characters of these categories, it remains, only, 48 characters which are changed when using their NFC value !

In the next post, you’ll get the list of these 48 characters which are modified when using the option Convert > Normalization From > To NFC of the BabelPad software !

Just tell me if you need the complete list ( 2,635 chars ) too. I could send it by e-mail !

Best Regards,

guy038

Reply to Unicode Normalization on Tue, 29 Jun 2021 22:15:09 GMT

xaviermdq — Tue, 29 Jun 2021 22:15:09 GMT

Hi guy038,
Sorry for delay (I missed the email notification). I am using “CYRILLIC - GREEK - LATIN - ROMAN”. But is it important to know? I ask because the normalization function that I would like Notepad++ to have, wouldn’t it be independent of the character set used? Anyway, I already did the normalization using BabelPad (menu Convert, Normalization form, To NFC). Only corrected 3 composition characters. In the future I’ll use BabelPad for normalization. Thank you very much for the explanations. They were revealing. As soon as I can, I am going to study this matter in more detail.

Reply to Unicode Normalization on Tue, 01 Jun 2021 14:16:09 GMT

guy038 — Tue, 01 Jun 2021 14:16:09 GMT

Hello, @xaviermdq,

I’ve begun, with the advanced search of the BabelMap software and the contents of the NormalisationTest.txt file, that you may download from here, to build a complete list of Unicode characters with a Decomposition Maping property, as well as their NFC, NFD, NFKC and NFKD values !

I obtained a list of 16,908 characters, corresponding to @Part1 # Character by character test of the NormalisationTest.txt file.

It would be sensible to restrict such a list to the Unicode script(s) that you currently use ! So, could you tell me, from all the scripts list, below, which one(s) do you want to consider ?

CYRILLIC - GREEK - LATIN - ROMAN

ARABIC - ARMENIAN - HEBREW

CJK - HANGUL - HANGZHOU - KANGXI

HIRAGANA - KATAKANA

BALINESE

BENGALI - CHAKMA - DEVANAGARI - DIVEHI AKURU - GRANTHA - GURMUKHI - KAITHI
KANNADA - MALAYALAM - ORIYA - SIDDHAM - SINHALA - TAMIL - TELUGU - TIRHUTA

LAO - MYANMAR - THAI - TIBETAN

TIFINAGH

Best Regards,

guy038

Reply to Unicode Normalization on Mon, 31 May 2021 15:19:42 GMT

xaviermdq — Mon, 31 May 2021 15:19:42 GMT

Hi @guy038 :
Sorry, I asked the wrong question because I didn’t understand what was happening. What I really want is a new option (for example “Convert to UTF-8 NFC”, or something like that, in Encoding menu) that allows me to do canonical normalization. So that you understand what I want, I will show you an example:
The code points:
GREEK CAPITAL LETTER OMEGA , U+03A9 , UTF-8: 0xCE 0xA9
OHM SIGN , U+2126 , UTF-8: 0xE2 0x84 0xA6
refer to the same character, although some fonts (like MS Arial) represent it slightly differently.
If you apply canonical normalization, U+2126 transform to U+03A9 (I tested it with BabelPad).

Thank you for your really comprehensive response.

Reply to Unicode Normalization on Wed, 30 Jun 2021 03:52:09 GMT

guy038 — Wed, 30 Jun 2021 03:52:09 GMT

Hello, @xaviermdq,

I don’t think that character decomposition and encodings are related, in any way !

Whatever the Unicode encoding used, the encoding process simply writes the appropriate byte(s) in order to encode each individual character

By contrast, the Unicode Normalization forms rather deal with :

Composition of characters into some pre-composed characters
Decompostion of characters into their base letter and some combining characters in a specific order

For instance, let consider the SMALL LETTER LATIN e of code-point U+0065. Starting with this base letter, we may condiser the related characters, below :

•----------•-----------------•------------------------------•------•------------•------•
|  String  | Char(s) Number  |        Decomposition         |   >  |      e     |   <  |
•----------•-----------------•------------------------------•------•------------•------•
|   >e<    |        3        |  U+003E    U+0065    U+003C  |  3E  |     65     |  3C  |
•----------•-----------------•------------------------------•------•------------•------•


•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
|  String  | Char(s) Number  |                    Decomposition                     |   >  |      e     |     ́    |     ̂    |   <  |
•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
|   >é̂<   |        5        |  U+003E  U+0065 (e)  U+0301 ( ́)  U+0302 ( ̂)  U+003C  |  3E  |     65     |  CC 81  |  CC 82  |  3C  |
•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•


•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
|  String  | Char(s) Number  |                    Decomposition                     |   >  |      e     |     ̂    |     ́    |   <  |
•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•
|   >ế<    |        5        |  U+003E  U+0065 (e)  U+0302 ( ̂)  U+0301 ( ́)  U+003C  |  3E  |     65     |  CC 82  |  CC 81  |  3C  |
•----------•-----------------•------------------------------------------------------•------•------------•---------•---------•------•


•----------•-----------------•--------------------------------------------•------•------------•---------•------•
|  String  | Char(s) Number  |               Decomposition                |   >  |      ê     |     ́    |   <  |
•----------•-----------------•--------------------------------------------•------•------------•---------•------•
|   >ế<   |        4        |  U+003E    U+00EA (ê)    U+0301    U+003C  |  3E  |     EA     |  CC 81  |  3C  |
•----------•-----------------•--------------------------------------------•------•------------•---------•------•

•----------•-----------------•--------------------------------------------•------•------------•---------•------•
|  String  | Char(s) Number  |               Decomposition                |   >  |      é     |     ̂    |   <  |
•----------•-----------------•--------------------------------------------•------•------------•---------•------•
|   >é̂<    |        4        |  U+003E    U+00E9 (é)    U+0302    U+003C  |  3E  |     E9     |  CC 82  |  3C  |
•----------•-----------------•--------------------------------------------•------•------------•---------•------•


•----------•-----------------•------------------------------•------•------------•------•
|  String  | Char(s) Number  |        Decomposition         |   >  |      ế     |   <  |
•----------•-----------------•------------------------------•------•------------•------•
|   >ế<    |        3        |  U+003E    U+1EBF    U+003C  |  3E  |  E1 BA BF  |  3C  |
•----------•-----------------•------------------------------•------•------------•------•

Note that I placed each string, composed of a base letter and possible diacritic signs, between the delimiters > and < for an exact search !

Paste the text, above, in a new N++ tab
Open the Mark dialog
SEARCH : Successively try the six regex syntaxes, below :
- (A) (?-s)>.<
- (B) (?-s)>..<
- (C) (?-s)>...<
- (D) >[[=e=]]<
- (E) >(?=e)\X<
- (F) >(?=[[=e=]])\X<
Tick the Purge for each each search and Wrap around options
Un-tick all other options
Select the Regular expression search mode
Click on the Mark All button

Notes :

The regex A, finds the strings containing one char, between the delimiters > and <, so 3 chars in totality. It matches, of course, the string >e< and the string >ế<, containing the Vienamese letter ế
The regex B, finds the strings containing two chars, between the delimiters > and <, so 4 chars in totality. It matches the strings >ế<* and >é̂< which contain an accentuated char with an additionnal diacritic character
The regex C, finds the strings containing three chars, between the delimiters > and <, so 5 chars in totality. It matches the strings >é̂< and >ế<, which contain the base letter e and two diacritic characters, in a different order
The regex D find all the individual equivalent characters to the base letter e between the delimiters > and <, so 3 chars in totality. As the regex A, it matches an unique character, related to the e letter and the delimiters
In the regex E, we use a specific syntax \X which matches any base character, followed with one or several combining characters ( diacritical marks or else ). But as we just want to focus to the letter e we place, before \X, a look-ahead (?=e) which forces the regex engine to match this base letter e and possible combining characters, following it. So, it matches the first 3 cases only !
In the regex F, we use again the \X syntax which finds any char followed with possible combining characters. But, this time, we change the look-ahead as (?=[[=e=]]) which forces the regex engine to match any equivalent char to the letter e. Refer at end of this post for further explanation. As you can see, this regex does find ALL the above cases:-))

This regex leads to the following generic regex : (?=[[=C=]])\X

which matches any character C, whatever its case, followed with some combining diacritical marks

For instance :

The regex (?=[[=3=]])\X does match the character 3̯̿, composed of the base digit 3 and two combining marks
The regex (?=[[=$=]])\X does match the character $̶̳̚ composed of the base symbol $ and three combining marks

Test these two regexes against this text :

3̯̿

$̶̳̚

Most of the combining characters can be found in the Combining Diacritical Marks Unicode block, in the range [\x{0300}–\x{036F}], below :

https://www.unicode.org/charts/PDF/U0300.pdf

So, @xaviermdq, as you can see, we never worried about the exact bytes used by the UTF-8 encoding !

Apparently, you wish to replace some decomposed consecutive characters by a precomposed equivalent character, if any !? This goal could be achieved with regexes !

Just tell me some more details about your needs, and also, your usual working Unicode script(s) : Latin, Cyrillic, Hebrew, Arabic, CJK, ... !

Best Regards,

guy038

Reply to Unicode Normalization on Thu, 27 May 2021 14:37:37 GMT

PeterJones — Thu, 27 May 2021 14:37:37 GMT

@xaviermdq ,

I don’t know what unicode normalization is used.

Unfortunately, the primary developer and most of the other volunteer contributors don’t regularly read this forum, so I don’t know if they’ll ever see this question. I don’t know if any of the regulars in this Forum have studied the guts of the Notepad++/Scintilla UTF-8 handling enough to know how to answer that question.

I’d suggest waiting for another reply here, in case someone has studied it more than I’d previously gathered. But if you don’t get a reply after a reasonable wait, you might consider going to the github issues location and asking this question there – because there is hopefully a developer who knows enough about the guts to answer over there.