Unicode Normalize: A simple plugin

Coises

In case it’s useful to anyone else, I’ve uploaded to GitHub a simple plugin, Unicode Normalize, that lets you convert selected text (or the entire file, if nothing is selected) to one of the four standard Unicode Normalization Forms.

guy038

Hello, @coises and All,

I tested a bit your new Unicode normalize plugin. Within the https://unicode.org/reports/tr15/ article, it is said :

Text exclusively containing ASCII characters \x{0000}-\x{007F}] is left unaffected by any Normalization Form.

Text exclusively containing Latin-1 characters \x{0000}-\x{00FF}] is left unaffected by NFC. Thus, all Latin-1 text is already normalized to NFC

Although that some characters of a particular script may not be normalized to NFC, I verified that any character belonging to one of the Windows encodings ( from Win-1250 to Win-1258 ) is normalized to NFC

So, I suppose that your plugin will be of particular interest for people using scripts different from Latin, Hebrew and Arabic !

In this https://stackoverflow.com/questions/7041013/unicode-normalization-in-windows article, it is said, in the second answer :

There is generally a cultural preference for NFC in the Windows world and on the Web, and for NFD in the Apple world. But it’s not rigorously enforced and you should expect to cope with any mixture of combined and decomposed characters.

And also :

Kernel and filesystem don’t know anything about normalisation and will quite happily allow you to have three files with names ầ.txt, ầ.txt and ầ.txt in the same folder !

Just because :

	â	00E2	LATIN SMALL LETTER A WITH CIRCUMFLEX
	̀ 	0300	COMBINING GRAVE ACCENT

	ầ	1EA7	LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE

	a	0061	LATIN SMALL LETTER A
	̂	0302	COMBINING CIRCUMFLEX ACCENT
	̀	0300	COMBINING GRAVE ACCENT

Here is below, a list of the Unicode characters which have a NFC form different from the character itself. Note that this list is not exhaustive as I did not include the Han script !

•-----------------------•---------------•-------------------------------•
|    Script             |  Char  Hex    |  Char Hex  ( NFC )            |
•-----------------------•---------------•-------------------------------•
|    Latin              |   Å	212B    |    Å	00C5                    |
|    Latin              |   K	212A    |    K	004B                    |
•-----------------------•---------------•-------------------------------•
|    Greek              |   ά	1F71    |    ά	03AC                    |
|    Greek              |   Ά	1FBB    |    Ά	0386                    |
|    Greek              |   έ	1F73    |    έ	03AD                    |
|    Greek              |   Έ	1FC9    |    Έ	0388                    |
|    Greek              |   ή	1F75    |    ή	03AE                    |
|    Greek              |   Ή	1FCB    |    Ή	0389                    |
|    Greek              |   ι	1FBE    |    ι	03B9                    |
|    Greek              |   ί	1F77    |    ί	03AF                    |
|    Greek              |   Ί	1FDB    |    Ί	038A                    |
|    Greek              |   ΐ	1FD3    |    ΐ	0390                    |
|    Greek              |   ό	1F79    |    ό	03CC                    |
|    Greek              |   Ό	1FF9    |    Ό	038C                    |
|    Greek              |   ύ	1F7B    |    ύ	03CD                    |
|    Greek              |   Ύ	1FEB    |    Ύ	038E                    |
|    Greek              |   ΰ	1FE3    |    ΰ	03B0                    |
|    Greek              |   Ω	2126    |    Ω	03A9                    |
|    Greek              |   ώ	1F7D    |    ώ	03CE                    |
|    Greek              |   Ώ	1FFB    |    Ώ	038F                    |
•-----------------------•---------------•-------------------------------•
|    Hebrew             |   אַ	FB2E    |    אַ	05D0 05B7               |
|    Hebrew             |   אָ	FB2F    |    אָ	05D0 05B8               |
|    Hebrew             |   אּ	FB30    |    אּ	05D0 05BC               |
|    Hebrew             |   בּ	FB31    |    בּ	05D1 05BC               |
|    Hebrew             |   בֿ	FB4C    |    בֿ	05D1 05BF               |
|    Hebrew             |   גּ	FB32    |    גּ	05D2 05BC               |
|    Hebrew             |   דּ	FB33    |    דּ	05D3 05BC               |
|    Hebrew             |   הּ	FB34    |    הּ	05D4 05BC               |
|    Hebrew             |   וֹ	FB4B    |    וֹ	05D5 05B9               |
|    Hebrew             |   וּ	FB35    |    וּ	05D5 05BC               |
|    Hebrew             |   זּ	FB36    |    זּ	05D6 05BC               |
|    Hebrew             |   טּ	FB38    |    טּ	05D8 05BC               |
|    Hebrew             |   יִ	FB1D    |    יִ	05D9 05B4               |
|    Hebrew             |   יּ	FB39    |    יּ	05D9 05BC               |
|    Hebrew             |   ךּ	FB3A    |    ךּ	05DA 05BC               |
|    Hebrew             |   כּ	FB3B    |    כּ	05DB 05BC               |
|    Hebrew             |   כֿ	FB4D    |    כֿ	05DB 05BF               |
|    Hebrew             |   לּ	FB3C    |    לּ	05DC 05BC               |
|    Hebrew             |   מּ	FB3E    |    מּ	05DE 05BC               |
|    Hebrew             |   נּ	FB40    |    נּ	05E0 05BC               |
|    Hebrew             |   סּ	FB41    |    סּ	05E1 05BC               |
|    Hebrew             |   ףּ	FB43    |    ףּ	05E3 05BC               |
|    Hebrew             |   פּ	FB44    |    פּ	05E4 05BC               |
|    Hebrew             |   פֿ	FB4E    |    פֿ	05E4 05BF               |
|    Hebrew             |   צּ	FB46    |    צּ	05E6 05BC               |
|    Hebrew             |   קּ	FB47    |    קּ	05E7 05BC               |
|    Hebrew             |   רּ	FB48    |    רּ	05E8 05BC               |
|    Hebrew             |   שּ	FB49    |    שּ	05E9 05BC               |
|    Hebrew             |   שּׁ	FB2C    |    שּׁ	05E9 05BC 05C1          |
|    Hebrew             |   שּׂ	FB2D    |    שּׂ	05E9 05BC 05C2          |
|    Hebrew             |   שׁ	FB2A    |    שׁ	05E9 05C1               |
|    Hebrew             |   שׂ	FB2B    |    שׂ	05E9 05C2               |
|    Hebrew             |   תּ	FB4A    |    תּ	05EA 05BC               |
|    Hebrew             |   ײַ	FB1F    |    ײַ	05F2 05B7               |
•-----------------------•---------------•-------------------------------•
|    Devanagari         |   क़	0958    |	क़	0915 093C       |
|    Devanagari         |   ख़	0959    |	ख़	0916 093C       |
|    Devanagari         |   ग़	095A    |    ग़	0917 093C               |
|    Devanagari         |   ज़	095B    |    ज़	091C 093C               |
|    Devanagari         |   ड़	095C    |    ड़	0921 093C               |
|    Devanagari         |   ढ़	095D    |    ढ़	0922 093C               |
|    Devanagari         |   फ़	095E    |    फ़	092B 093C               |
|    Devanagari         |   य़	095F    |    य़	092F 093C               |
•-----------------------•---------------•-------------------------------•
|    Bengali            |   ড়	09DC    |    ড়	09A1 09BC               |
|    Bengali            |   ঢ়	09DD    |    ঢ়	09A2 09BC               |
|    Bengali            |   য়	09DF    |    য়	09AF 09BC               |
•-----------------------•---------------•-------------------------------•
|    Gurmukhi           |   ਖ਼	0A59    |    ਖ਼	0A16 0A3C               |
|    Gurmukhi           |   ਗ਼	0A5A    |    ਗ਼	0A17 0A3C               |
|    Gurmukhi           |   ਜ਼	0A5B    |    ਜ਼	0A1C 0A3C               |
|    Gurmukhi           |   ਫ਼	0A5E    |    ਫ਼	0A2B 0A3C               |
|    Gurmukhi           |   ਲ਼	0A33    |    ਲ਼	0A32 0A3C               |
|    Gurmukhi           |   ਸ਼	0A36    |    ਸ਼	0A38 0A3C               |
•-----------------------•---------------•-------------------------------•
|    Oriya              |   ଡ଼	0B5C    |    ଡ଼	0B21 0B3C               |
|    Oriya              |   ଢ଼	0B5D    |    ଢ଼	0B22 0B3C               |
•-----------------------•---------------•-------------------------------•
|    Thibetan           |   ཀྵ	0F69    |    ཀྵ	0F40 0FB5               |
|    Thibetan           |   གྷ	0F43    |    གྷ	0F42 0FB7               |
|    Thibetan           |   ཌྷ	0F4D    |    ཌྷ	0F4C 0FB7               |
|    Thibetan           |   དྷ	0F52    |    དྷ	0F51 0FB7               |
|    Thibetan           |   བྷ	0F57    |    བྷ	0F56 0FB7               |
|    Thibetan           |   ཛྷ	0F5C    |    ཛྷ	0F5B 0FB7               |
|    Thibetan           |   ཱི	0F73    |    ◌ཱི	0F71 0F72               |
|    Thibetan           |   ཱུ	0F75    |    ◌ཱུ	0F71 0F74               |
|    Thibetan           |   ཱྀ	0F81    |    ◌ཱྀ	0F71 0F80               |
|    Thibetan           |   ྐྵ	0FB9    |    ◌ྐྵ	0F90 0FB5               |
|    Thibetan           |   ྒྷ	0F93    |    ◌ྒྷ	0F92 0FB7               |
|    Thibetan           |   ྜྷ	0F9D    |    ◌ྜྷ	0F9C 0FB7               |
|    Thibetan           |   ྡྷ	0FA2    |    ◌ྡྷ	0FA1 0FB7               |
|    Thibetan           |   ྦྷ	0FA7    |    ◌ྦྷ	0FA6 0FB7               |
|    Thibetan           |   ྫྷ	0FAC    |    ◌ྫྷ	0FAB 0FB7               |
|    Thibetan           |   ྲྀ	0F76    |    ◌ྲྀ	0FB2 0F80               |
|    Thibetan           |   ླྀ	0F78    |    ◌ླྀ	0FB3 0F80               |
•-----------------------•---------------•-------------------------------•
|    Letter Modifier    |   ʹ	0374    |    ʹ	02B9                    |
•-----------------------•---------------•-------------------------------•
|    Mark-NonSpacing    |   ̀	0340    |    ̀	0300                    |
|    Mark-NonSpacing    |   ́	0341    |    ́	0301                    |
|    Mark-NonSpacing    |   ̈́	0344    | ̈   ́	0308 0301               |
•-----------------------•---------------•-------------------------------•
|    Separator Space    |    	2000    |     	2002                    |
|    Separator Space    |    	2001    |     	2003                    |
•-----------------------•---------------•-------------------------------•
|    Punctuation-Open   |   〈	2329    |    〈	3008                    |
•-----------------------•---------------•-------------------------------•
|    Punctuation-Close  |   〉	232A    |    〉	3009                    |
•-----------------------•---------------•-------------------------------•
|    Punctuation-Other  |   ;	037E    |    ;	003B                    |
|    Punctuation-Other  |   ·	0387    |    ·	00B7                    |
•-----------------------•---------------•-------------------------------•
|    Symbol-Math        |   ⫝̸	2ADC    |    ⫝̸	2ADD 0338               |
•-----------------------•---------------•-------------------------------•
|    Symbol-Modifier    |   ´	1FFD    |    ´	00B4                    |
|    Symbol-Modifier    |   ΅	1FEE    |    ΅	0385                    |
|    Symbol-Modifier    |   `	1FEF    |    `	0060                    |
•-----------------------•---------------•-------------------------------•
|    Symbol-Other       |   𝅗𝅥	1D15E   |    𝅗𝅥	1D157 1D165             |
|    Symbol-Other       |   𝅘𝅥	1D15F   |    𝅘𝅥	1D158 1D165             |
|    Symbol-Other       |   𝅘𝅥𝅮	1D160   |    𝅘𝅥𝅮	1D158 1D165 1D16E       |
|    Symbol-Other       |   𝅘𝅥𝅯	1D161   |    𝅘𝅥𝅯	1D158 1D165 1D16F       |
|    Symbol-Other       |   𝅘𝅥𝅰	1D162   |    𝅘𝅥𝅯	1D158 1D165 1D170       |
|    Symbol-Other       |   𝅘𝅥𝅱	1D163   |    𝅘𝅥𝅯	1D158 1D165 1D171       |
|    Symbol-Other       |   𝅘𝅥𝅲	1D164   |    𝅘𝅥𝅯	1D158 1D165 1D172       |
|    Symbol-Other       |   𝆹𝅥	1D1BB   |    𝆹𝅥	1D1B9 1D165             |
|    Symbol-Other       |   𝆹𝅥𝅮	1D1BD   |    𝆹𝅥𝅮	1D1B9 1D165 1D16E       |
|    Symbol-Other       |   𝆹𝅥𝅯	1D1BF   |    𝆹𝅥𝅯	1D1B9 1D165 1D16F       |
|    Symbol-Other       |   𝆺𝅥	1D1BC   |    𝆺𝅥	1D1BA 1D165             |
|    Symbol-Other       |   𝆺𝅥𝅮	1D1BE   |    𝆺𝅥𝅮	1D1BA 1D165 1D16E       |
|    Symbol-Other       |   𝆺𝅥𝅯	1D1C0   |    𝆺𝅥𝅯	1D1BA 1D165 1D16F       |
•-----------------------•---------------•-------------------------------•

Now, @coises, may I ask you for a small improvement ?

When a selection of less than, let’s say, 100 characters about is set, before running your plugin, I think it would be nice to get the original text and after a line-break, the four normalization forms displayed on four successive lines ! So, you would have to add a new entry All normalisation forms

For example :

Here is one example of text ; it contains the Ⅳ normalisation forms.

Here is one example of text ; it contains Ⅳ normalisation forms.
Here is one example of text ; it contains Ⅳ normalisation forms.
Here is one example of text ; it contains IV normalisation forms.
Here is one example of text ; it contains IV normalisation forms.

Or :

The Latin characters Å and Ǽ as well as the Greek characters Ω and 𝟊 in each normalisation form :

The Latin characters Å and Ǽ as well as the Greek characters Ω and 𝟊 in each normalisation form :
The Latin characters Å and Ǽ as well as the Greek characters Ω and 𝟊 in each normalisation form :
The Latin characters Å and Ǽ as well as the Greek characters Ω and Ϝ in each normalisation form :
The Latin characters Å and Ǽ as well as the Greek characters Ω and Ϝ in each normalisation form :

Finally, when a single char would be selected, in addition to displaying the 4 normalizations, could it be possible to display their code-points as well, like below :

	ϔ	03D4
́	ϔ	0302 0308
	Ϋ	03AB
́	Ϋ	03A5 0308

or :

	Ⅲ	2162
	Ⅲ	2162
	III	0049 0049 0049
	III	0049 0049 0049

Do you think these suggestions sensible ?

Best Regards,

guy038

Coises

@guy038:

You’ve given me something to consider.

It had not occurred to me that converting to “normalization form composed” could decompose a fully pre-composed character. I see now that the specification does describe how that can happen. It just wasn’t intuitive that it could do that, and I hadn’t read closely enough. So there are pre-composed characters that exist, but cannot be synthesized from their components (at least not by these algorithms). Ugh. That’s going to mess with something else I’m trying to build.

I’m not sure how useful any of this is — I threw it together when I was investigating some oddities with Korean text display. I wanted a fast way to convert between decomposed and composed forms so I could try to figure out what was happening.

I will see if I can find a reasonable way to show all normalization forms. I entered your request as Issue #1, so hopefully I won’t forget about it.

Showing the Unicode code points, as well as the UTF-16 and UTF-8 code units, for a character or a selection of characters is another thing I would like to do. If I do it, I think I’m more likely to put that in a new, different plugin that would use a docking panel; I could perhaps include the encodings for the canonical forms in the same display.

Coises

@guy038

I implemented your suggestions in version 1.1.

Hex values for the Unicode code points are shown if there are no more than eight code points in any of the normalized forms.