How to normalize fancy Unicode text back to regular text?

Alan Kilborn

@mkupper said in How to normalize fancy Unicode text back to regular text?:

please provide examples of what you mean by “fancy” and “normalize.”

Unless I misunderstand, this was already provided by way of example.
OP would like to convert text such as:

𝙃𝙚𝙡𝙡𝙤 𝙉𝙤𝙩𝙚𝙥𝙖𝙙 𝙥𝙡𝙪𝙨 𝙥𝙡𝙪𝙨 𝙘𝙤𝙢𝙢𝙪𝙣𝙞𝙩𝙮

to

Hello Notepad plus plus community

The “before” text is multi-byte unicode, the “after” text is single-byte unicode.

Dean-Corso

Hey guys,

short question. I was trying to post my post but I get some spam message by aki something com and can’t post my post now. Also I can’t post URL tags because of not having 1 reputation point. What now?

Dean-Corso

Hi guys,

thanks for trying to help me so far. Alright, back to my problem I have using npp.

Notepad++ does not have a Unicode normalization function built in.

So that’s somehow sad. So I would like to have such an function to normalize those symbol fancy text style back to normal plain text. The problem is that these day’s people using that symbol text style really often for anything specially on YT and other sources etc. I don’t like that style because when you search some text phrases then you will not find them. So I thought npp could manage it to change those symbols back to normal ASCII text but it can’t, unfortunately.

Also about your 30 seconds search results. I think that’s not the right spot to manage that problem. I tried already to enter specific python command combo in python console but the results I got out was just question marks using the “NFKD”.

All in all I’m looking for an quick solution to convert those symbol fancy (however they call it) text just back to normal text (ASCII) etc. I think some modern text editors should support that feature because those symbols is used as text too and in the future people will use it more & more and therefore the npp developer should think about it.

@mkupper

So I did post already the two text lines in that fancy symbol style I want to change back to normal plain text. Just have a look on these two websites below. In first one you can enter some normal text in the edit control on left side and on the right side you get the fancy text / symbols back etc. Just copy any of them and past it into npp and now try to change it into normal text.

“FancyTextGenerator”

Now on this websites below you can paste your fancy symbol text style into the edit control and you will get the clean plain text back = what I want to do in npp itself without using the website (offline work).

“unicode text normalizer”
or
“Normalize Unicode Text Convert”

So as you can see they work and can handle those symbol text styles to convert them to normal text pretty simple by paste & convert etc. Something like that I’m missing in npp and would like to have it of course. In best case as own build in function or as custom plugin to download or in last instance as python script maybe. So If I would code on npp project then I would add also the entire symbol sets to handle them all (write & convert etc). I think I’m not the only one who would wish such an option in npp. Otherwise we are depending on other sources like those websites who support that function and that a disadvantage for npp and npp users you know.

PS: Had to remove my URLs in text.

mkupper

@Alan-Kilborn I had seen the 𝘏𝘦𝘭𝘭𝘰 𝘕𝘰𝘵𝘦𝘱𝘢𝘥 𝘱𝘭𝘶𝘴 𝘱𝘭𝘶𝘴 𝘤𝘰𝘮𝘮𝘶𝘯𝘪𝘵𝘺 and 𝙃𝙚𝙡𝙡𝙤 𝙉𝙤𝙩𝙚𝙥𝙖𝙙 𝙥𝙡𝙪𝙨 𝙥𝙡𝙪𝙨 𝙘𝙤𝙢𝙢𝙪𝙣𝙞𝙩𝙮 but made the (incorrect) assumption that the OP was using the forum’s italics and bold.

Those will be a pain to convert as they are all extended Unicode plane characters. Someone else on the forums, I think @Coises, recently showed how to do a transform operator in Boost regexp. I believe it started out with
a search of:

(?-i)(?:(𝘈|𝘼)|(𝘉|𝘽)|(𝘊|𝘾)|(𝘋|𝘿)|(𝘌|𝙀)|(𝘍|𝙁)|(𝘎|𝙂)|(𝘏|𝙃)|(𝘐|𝙄)|(𝘑|𝙅)|(𝘒|𝙆)|(𝘓|𝙇)|(𝘔|𝙈)|(𝘕|𝙉)|(𝘖|𝙊)|(𝘗|𝙋)|(𝘘|𝙌)|(𝘙|𝙍)|(𝘚|𝙎)|(𝘛|𝙏)|(𝘜|𝙐)|(𝘝|𝙑)|(𝘞|𝙒)|(𝘟|𝙓)|(𝘠|𝙔)|(𝘡|𝙕)|(𝘢|𝙖)|(𝘣|𝙗)|(𝘤|𝙘)|(𝘥|𝙙)|(𝘦|𝙚)|(𝘧|𝙛)|(𝘨|𝙜)|(𝘩|𝙝)|(𝘪|𝙞)|(𝘫|𝙟)|(𝘬|𝙠)|(𝘭|𝙡)|(𝘮|𝙢)|(𝘯|𝙣)|(𝘰|𝙤)|(𝘱|𝙥)|(𝘲|𝙦)|(𝘳|𝙧)|(𝘴|𝙨)|(𝘵|𝙩)|(𝘶|𝙪)|(𝘷|𝙫)|(𝘸|𝙬)|(𝘹|𝙭)|(𝘺|𝙮)|(𝘻|𝙯))|

but how do you then replace $1 to $52 with the ASCII A to Z and a to z?

Dean-Corso

@mkupper

Just keep in mind that the two example symbol text lines are just an example so there are many more. Just have a look on that fancy website and enter some text and you got many symbol styles out. So it makes not sense just to translate / check one of them only. So my goal is it to change those symbol text styles (any of them / all) back to plain text. So I can not just focus at one symbol char set.

Coises

@Dean-Corso said in How to normalize fancy Unicode text back to regular text?:

Now on this websites below you can paste your fancy symbol text style into the edit control and you will get the clean plain text back = what I want to do in npp itself without using the website (offline work).

“unicode text normalizer”
or
“Normalize Unicode Text Convert”

For what it’s worth, I went to Normalize Unicode Text, then disconnected my ethernet cable and typed in some characters on the left — the corresponding alphabetic characters appeared on the right. So it is doing the translation offline. (I thought so based on how quickly it was responding. I tried reading the web page source, but gave up after my eyes started to cross; so I did the pull-the-plug test to find out if my guess was correct.) With patience and determination, no doubt someone could figure out how to extract the normalization routine from the web page and the additional assets it downloads. I don’t have enough of either at the moment.

This really does sound like a job for a script or similar tool, rather than regular expressions. There are too many possible replacements — there’s a character limit to the size of regular expressions (I forget what it is), and you’d probably overrun that before you could capture everything.

mkupper

@Dean-Corso Understood but at least for the two examples you provided the search/replace is:
** Search: (?-i)(?:(𝘈|𝘼)|(𝘉|𝘽)|(𝘊|𝘾)|(𝘋|𝘿)|(𝘌|𝙀)|(𝘍|𝙁)|(𝘎|𝙂)|(𝘏|𝙃)|(𝘐|𝙄)|(𝘑|𝙅)|(𝘒|𝙆)|(𝘓|𝙇)|(𝘔|𝙈)|(𝘕|𝙉)|(𝘖|𝙊)|(𝘗|𝙋)|(𝘘|𝙌)|(𝘙|𝙍)|(𝘚|𝙎)|(𝘛|𝙏)|(𝘜|𝙐)|(𝘝|𝙑)|(𝘞|𝙒)|(𝘟|𝙓)|(𝘠|𝙔)|(𝘡|𝙕)|(𝘢|𝙖)|(𝘣|𝙗)|(𝘤|𝙘)|(𝘥|𝙙)|(𝘦|𝙚)|(𝘧|𝙛)|(𝘨|𝙜)|(𝘩|𝙝)|(𝘪|𝙞)|(𝘫|𝙟)|(𝘬|𝙠)|(𝘭|𝙡)|(𝘮|𝙢)|(𝘯|𝙣)|(𝘰|𝙤)|(𝘱|𝙥)|(𝘲|𝙦)|(𝘳|𝙧)|(𝘴|𝙨)|(𝘵|𝙩)|(𝘶|𝙪)|(𝘷|𝙫)|(𝘸|𝙬)|(𝘹|𝙭)|(𝘺|𝙮)|(𝘻|𝙯))
Replace: (?1(A))(?2(B))(?3(C))(?4(D))(?5(E))(?6(F))(?7(G))(?8(H))(?9(I))(?10(J))(?11(K))(?12(L))(?13(M))(?14(N))(?15(O))(?16(P))(?17(Q))(?18(R))(?19(S))(?20(T))(?21(U))(?22(V))(?23(W))(?24(X))(?25(Y))(?26(Z))(?27(a))(?28(b))(?29(c))(?30(d))(?31(e))(?32(f))(?33(g))(?34(h))(?35(i))(?36(j))(?37(k))(?38(l))(?39(m))(?40(n))(?41(o))(?42(p))(?43(q))(?44(r))(?45(s))(?46(t))(?47(u))(?48(v))(?49(w))(?50(x))(?51(y))(?52(z))

It looks messy because
𝘏𝘦𝘭𝘭𝘰 𝘕𝘰𝘵𝘦𝘱𝘢𝘥 𝘱𝘭𝘶𝘴 𝘱𝘭𝘶𝘴 𝘤𝘰𝘮𝘮𝘶𝘯𝘪𝘵𝘺 and 𝙃𝙚𝙡𝙡𝙤 𝙉𝙤𝙩𝙚𝙥𝙖𝙙 𝙥𝙡𝙪𝙨 𝙥𝙡𝙪𝙨 𝙘𝙤𝙢𝙢𝙪𝙣𝙞𝙩𝙮 are using characters from the Unicode Mathematical Alphanumeric Symbols block which ranges from U+1D400 to U+1D7FF. As they are using Unicode characters higher than U+FFFF the Notepad++ editor has glitches. The search/replace I provided works around those glitches by creating 52 separate buckets, 26 for upper case A-to-Z, and 26 more for lower case a-to-z. Each of the 52 buckets has a list of “fancy” Unicode characters that can get matched and filed into that bucket and in the replacement part each of the 52 buckets replaces the bucket contents with a plain ASCII A-to-Z or a-to-z.

I’m not sure if Python can deal with characters above U+FFFF. If it can, then that is likely easier to work with and to maintain as most of that block looks like A-to-Z ranges that could each get translated to A-to-Z ASCII.

Ideally, Notepad++'s convert to ANSI code gets updated to be aware of those character ranges.

Coises

@mkupper said in How to normalize fancy Unicode text back to regular text?:

Ideally, Notepad++'s convert to ANSI code gets updated to be aware of those character ranges.

What is described in this topic is not, at all, what Convert to ANSI does.

For example, at Normalize Unicode Text, if you paste this in the left side:

ßášï©

you will see:

BasiC

on the right side.

If your default code page is CP1252 (common for Western Europe and the US) and you open a new tab in Notepad++, set it to UTF-8, paste in ßášï© and then convert to ANSI… nothing will change.

(However, something odd does happen on my system if I save it ANSI — when I open it again, Notepad++ detects it as Greek/ISO-8859-7; I have to select Western European / Windows-1252 to see it properly. I don’t know why Notepad++ thought it was more likely to be Greek than my default code page; all those symbols are in Windows-1252.)

mkupper

@Coises said in How to normalize fancy Unicode text back to regular text?:

Nothing will seem to change for ßášï© because all six characters in your example are available in ANSI.

ß is ANSI DF and Unicode U+00DF
á is ANSI E1 and Unicode U+00E1
š is ANSI 9A and Unicode U+0161. Unicode U+009A is used for SINGLE CHARACTER INTRODUCER.
ï is ANSI EF and Unicode U+00EF
© is ANSI A9 and Unicode U+00A9

When I put UTF-8 encoded ßášï© into a file, convert to ANSI, and save it, then I see that the file contains \xDF \xE1 \x9A \xEF \xA9 which is what I expect.

You are correct in that when I load it into Notepad++ then it’s identified as ISO-8859-7. I don’t think the misidentification is a concern as the file is very small. The code only has six bytes to work with and all six are unusual for ANSI text.

Dean-Corso

On that website I can see it’s running a JavaScript with all functions to do that operation what also includes all of these Unicode letters in a array variable called “var unicodeLetters”. Maybe it’s possible to adapt this code to make a python script etc. In the script is also a link to a UTF-8 encoder/decoder you can enter text to get encoded hex bytes out. In case of entering “𝗗” I get 4 hex bytes “\xF0\x9D\x97\x97” and in case of “ßášï©” I get “\xC3\x9F\xC3\xA1\xC5\xA1\xC3\xAF\xC2\xA9” out.

https://mothereff.in/utf-8
Here the script the website is using.
|onlinetools.com/CACHE/js/unicode-normalize-unicode-text.js

Maybe it could help to adapt it and create something for npp.

PS: Thank you for that one reputation point @mkupper. Now I could also post links.

PeterJones

@Dean-Corso said in How to normalize fancy Unicode text back to regular text?:

I tried already to enter specific python command combo in python console but the results I got out was just question marks using the “NFKD”.

If you used the default PythonScript 2.0.0.0 from the Plugins Admin, then your unicode strings in your test would need to be marked as u'𝘏𝘦𝘭𝘭𝘰 𝘕𝘰𝘵𝘦𝘱𝘢𝘥 𝘱𝘭𝘶𝘴 𝘱𝘭𝘶𝘴 𝘤𝘰𝘮𝘮𝘶𝘯𝘪𝘵𝘺' in order for Python 2.7 to know it was unicode.

But when i tried even with that in the default PythonScript plugin, it wouldn’t normalize that string (though it could normalize some others).

When i switch to a copy of Notepad++ with PythonScript 3.0.17 (using Python 3.12, which doesn’t need fancy string syntax for unicode strings), and ran an example script from here, I got the expected output:

Python 3.12.1 (tags/v3.12.1:2305ca5, Dec  7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)]
Initialisation took 234ms
Ready.
>>> import unicodedata
>>> strings = [   '𝖙𝖍𝖚𝖌 𝖑𝖎𝖋𝖊',   '𝓽𝓱𝓾𝓰 𝓵𝓲𝓯𝓮',   '𝓉𝒽𝓊𝑔 𝓁𝒾𝒻𝑒',   '𝕥𝕙𝕦𝕘 𝕝𝕚𝕗𝕖',   'ｔｈｕｇ ｌｉｆｅ', '𝘏𝘦𝘭𝘭𝘰 𝘕𝘰𝘵𝘦𝘱𝘢𝘥 𝘱𝘭𝘶𝘴 𝘱𝘭𝘶𝘴 𝘤𝘰𝘮𝘮𝘶𝘯𝘪𝘵𝘺', '𝙃𝙚𝙡𝙡𝙤 𝙉𝙤𝙩𝙚𝙥𝙖𝙙 𝙥𝙡𝙪𝙨 𝙥𝙡𝙪𝙨 𝙘𝙤𝙢𝙢𝙪𝙣𝙞𝙩𝙮']
>>> for x in strings:
...   print(unicodedata.normalize( 'NFKC', x), x)
... 
thug life 𝖙𝖍𝖚𝖌 𝖑𝖎𝖋𝖊
thug life 𝓽𝓱𝓾𝓰 𝓵𝓲𝓯𝓮
thug life 𝓉𝒽𝓊𝑔 𝓁𝒾𝒻𝑒
thug life 𝕥𝕙𝕦𝕘 𝕝𝕚𝕗𝕖
thug life ｔｈｕｇ ｌｉｆｅ
Hello Notepad plus plus community 𝘏𝘦𝘭𝘭𝘰 𝘕𝘰𝘵𝘦𝘱𝘢𝘥 𝘱𝘭𝘶𝘴 𝘱𝘭𝘶𝘴 𝘤𝘰𝘮𝘮𝘶𝘯𝘪𝘵𝘺
Hello Notepad plus plus community 𝙃𝙚𝙡𝙡𝙤 𝙉𝙤𝙩𝙚𝙥𝙖𝙙 𝙥𝙡𝙪𝙨 𝙥𝙡𝙪𝙨 𝙘𝙤𝙢𝙢𝙪𝙣𝙞𝙩𝙮

So to be able to normalize your particular characters, you must use PythonScript 3.x, which you can download from here.

Coises

@mkupper said in How to normalize fancy Unicode text back to regular text?:

Nothing will seem to change for ßášï© because all six characters in your example are available in ANSI.

You’re correct, of course. I should have worded that more carefully. My point was that the characters themselves don’t change when you convert from Unicode to ANSI, only the internal representations; my example didn’t include characters not in the local code page, which just change to question marks.

The original poster desires to change the characters themselves, not just their internal representations; that is not the purpose of Convert to ANSI.

Coises

@Dean-Corso said in How to normalize fancy Unicode text back to regular text?:

Maybe it’s possible to adapt this code to make a python script etc.

There is a jN (JavaScript for Notepad++) plugin available through Plugins Admin. I was never quite able to grasp, from the documentation, how to use it, but it appears very powerful. If you know Javascript and have found the relevant scripts in the web page, you might be able to use it to do what you need.

mkupper

@Coises said in How to normalize fancy Unicode text back to regular text?:

My point was that the characters themselves don’t change when you convert from Unicode to ANSI, only the internal representations

Some characters do change. Copy/paste the following into a UTF-8 encoded tab or file. It should be the same as when you see here on the forums.

Copy/paste it into an ANSI encoded tab or file. You should see plain ASCII.

On the UTF-8 encoded tab do an Encoding / Convert to ANSI. The first two lines are plain ASCII but the rest are not ASCII and are characters that do not have a direct mapping into the same character/glyph in ANSI.

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
ＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺ
ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ

ĀāĂăĄąǍǎǞǟα
ƀℬ
ĆćĈĉĊċČčℂℭ
Ďďđδ
ĒēĔĕĖėĘęĚěεℇℯ
Φφℱ
ĜĝĞğĠġĢģǤǥǦǧɡΓℊ
ĤĥĦħℋℌℍℎ
ĨĩĪīĬĭĮįİıƗǏǐℐℑ
Ĵĵǰ
ĶķǨǩK
ĹĺĻļĽľŁłƚℒℓ
ℳ
ŃńŅņŇňⁿℕ
ŌōŎŏŐőƟƠơǑǒǪǫǬǭΩℴ
πℙ
ℚ
ŔŕŖŗŘřℛℜℝ
ŚśŜŝŞşΣσ
ŢţŤťŦŧƮƫΘτ
ŨũŪūŬŭŮůŰűŲųƯưǓǔǕǖǗǘǙǚǛǜ
√
Ŵŵ
Ŷŷ
ŹźŻżƶℤℨ

Alan Kilborn

@Coises said in How to normalize fancy Unicode text back to regular text?:

If your default code page is CP1252 (common for Western Europe and the US) and you open a new tab in Notepad++, set it to UTF-8, paste in ßášï© and then convert to ANSI… nothing will change.

Nothing visually will change, but the underlying bytes of the file will all change.
EDIT: Oops, this has already been said. :-)

if I save it ANSI — when I open it again, Notepad++ detects it as Greek/ISO-8859-7

Yes, the encoding autodetection of Notepad++ has long been a target of scorn because it can fail in such ways, but apparently autodetection is not always possible, or at least can be extremely difficult…

guy038

Hello, @dean-corso, @peterjones, @mkupper, @alan-kilborn, @coises and All,

In Unicode v15.1, the Mathematical Alphanumeric Symbols bliock contains 1024 characters, between U+1D400 and U+1D7FF, whose 28 are unassigned. Refer to :

https://www.unicode.org/charts/PDF/U1D400.pdf

Of course, we could generate from within N++, some regexes in order to transform these characters into standard ASCII characters !

If we restrict our goal to latin letters and digits of this block, we would use the ranges 1D400 - 1D6A3 and 1D7CE - 1D7FF

However, the N++ regex engine cannot directly handle characters over the BMP ( so chars with code over \x{FFFF} )

You need to use its surrogate pair characters to match a specific character

For example, the char :

𝑻

Cannot be found with the regex \x{1D47B}. Luckily, you can match it with the regex \x{D835}\x{DC7B}, using the Unicode Surrogate Area

So, we could build these 3 giant regex S/R, below, which, in theory, could replace any fancy character of this Unicode block by standard letters and digits

SEARCH :

(?x)
( [\x{D835}\x{DC00}\x{D835}\x{DC34}\x{D835}\x{DC68}\x{D835}\x{DC9C}\x{D835}\x{DCD0}\x{D835}\x{DD04}\x{D835}\x{DD38}\x{D835}\x{DD6C}\x{D835}\x{DDA0}\x{D835}\x{DDD4}\x{D835}\x{DE08}\x{D835}\x{DE3C}\x{D835}\x{DE70}] ) |  #  Letter A
( [\x{D835}\x{DC01}\x{D835}\x{DC35}\x{D835}\x{DC69}\x{D835}\x{DC9D}\x{D835}\x{DCD1}\x{D835}\x{DD05}\x{D835}\x{DD39}\x{D835}\x{DD6D}\x{D835}\x{DDA1}\x{D835}\x{DDD5}\x{D835}\x{DE09}\x{D835}\x{DE3D}\x{D835}\x{DE71}] ) |  #  Letter B
( [\x{D835}\x{DC02}\x{D835}\x{DC36}\x{D835}\x{DC6A}\x{D835}\x{DC9E}\x{D835}\x{DCD2}\x{D835}\x{DD06}\x{D835}\x{DD3A}\x{D835}\x{DD6E}\x{D835}\x{DDA2}\x{D835}\x{DDD6}\x{D835}\x{DE0A}\x{D835}\x{DE3E}\x{D835}\x{DE72}] ) |  #  Letter C
( [\x{D835}\x{DC03}\x{D835}\x{DC37}\x{D835}\x{DC6B}\x{D835}\x{DC9F}\x{D835}\x{DCD3}\x{D835}\x{DD07}\x{D835}\x{DD3B}\x{D835}\x{DD6F}\x{D835}\x{DDA3}\x{D835}\x{DDD7}\x{D835}\x{DE0B}\x{D835}\x{DE3F}\x{D835}\x{DE73}] ) |  #  Letter D
( [\x{D835}\x{DC04}\x{D835}\x{DC38}\x{D835}\x{DC6C}\x{D835}\x{DCA0}\x{D835}\x{DCD4}\x{D835}\x{DD08}\x{D835}\x{DD3C}\x{D835}\x{DD70}\x{D835}\x{DDA4}\x{D835}\x{DDD8}\x{D835}\x{DE0C}\x{D835}\x{DE40}\x{D835}\x{DE74}] ) |  #  Letter E
( [\x{D835}\x{DC05}\x{D835}\x{DC39}\x{D835}\x{DC6D}\x{D835}\x{DCA1}\x{D835}\x{DCD5}\x{D835}\x{DD09}\x{D835}\x{DD3D}\x{D835}\x{DD71}\x{D835}\x{DDA5}\x{D835}\x{DDD9}\x{D835}\x{DE0D}\x{D835}\x{DE41}\x{D835}\x{DE75}] ) |  #  Letter F
( [\x{D835}\x{DC06}\x{D835}\x{DC3A}\x{D835}\x{DC6E}\x{D835}\x{DCA2}\x{D835}\x{DCD6}\x{D835}\x{DD0A}\x{D835}\x{DD3E}\x{D835}\x{DD72}\x{D835}\x{DDA6}\x{D835}\x{DDDA}\x{D835}\x{DE0E}\x{D835}\x{DE42}\x{D835}\x{DE76}] ) |  #  Letter G
( [\x{D835}\x{DC07}\x{D835}\x{DC3B}\x{D835}\x{DC6F}\x{D835}\x{DCA3}\x{D835}\x{DCD7}\x{D835}\x{DD0B}\x{D835}\x{DD3F}\x{D835}\x{DD73}\x{D835}\x{DDA7}\x{D835}\x{DDDB}\x{D835}\x{DE0F}\x{D835}\x{DE43}\x{D835}\x{DE77}] ) |  #  Letter H
( [\x{D835}\x{DC08}\x{D835}\x{DC3C}\x{D835}\x{DC70}\x{D835}\x{DCA4}\x{D835}\x{DCD8}\x{D835}\x{DD0C}\x{D835}\x{DD40}\x{D835}\x{DD74}\x{D835}\x{DDA8}\x{D835}\x{DDDC}\x{D835}\x{DE10}\x{D835}\x{DE44}\x{D835}\x{DE78}] ) |  #  Letter I
( [\x{D835}\x{DC09}\x{D835}\x{DC3D}\x{D835}\x{DC71}\x{D835}\x{DCA5}\x{D835}\x{DCD9}\x{D835}\x{DD0D}\x{D835}\x{DD41}\x{D835}\x{DD75}\x{D835}\x{DDA9}\x{D835}\x{DDDD}\x{D835}\x{DE11}\x{D835}\x{DE45}\x{D835}\x{DE79}] ) |  #  Letter J
( [\x{D835}\x{DC0A}\x{D835}\x{DC3E}\x{D835}\x{DC72}\x{D835}\x{DCA6}\x{D835}\x{DCDA}\x{D835}\x{DD0E}\x{D835}\x{DD42}\x{D835}\x{DD76}\x{D835}\x{DDAA}\x{D835}\x{DDDE}\x{D835}\x{DE12}\x{D835}\x{DE46}\x{D835}\x{DE7A}] ) |  #  Letter K
( [\x{D835}\x{DC0B}\x{D835}\x{DC3F}\x{D835}\x{DC73}\x{D835}\x{DCA7}\x{D835}\x{DCDB}\x{D835}\x{DD0F}\x{D835}\x{DD43}\x{D835}\x{DD77}\x{D835}\x{DDAB}\x{D835}\x{DDDF}\x{D835}\x{DE13}\x{D835}\x{DE47}\x{D835}\x{DE7B}] ) |  #  Letter L
( [\x{D835}\x{DC0C}\x{D835}\x{DC40}\x{D835}\x{DC74}\x{D835}\x{DCA8}\x{D835}\x{DCDC}\x{D835}\x{DD10}\x{D835}\x{DD44}\x{D835}\x{DD78}\x{D835}\x{DDAC}\x{D835}\x{DDE0}\x{D835}\x{DE14}\x{D835}\x{DE48}\x{D835}\x{DE7C}] ) |  #  Letter M
( [\x{D835}\x{DC0D}\x{D835}\x{DC41}\x{D835}\x{DC75}\x{D835}\x{DCA9}\x{D835}\x{DCDD}\x{D835}\x{DD11}\x{D835}\x{DD45}\x{D835}\x{DD79}\x{D835}\x{DDAD}\x{D835}\x{DDE1}\x{D835}\x{DE15}\x{D835}\x{DE49}\x{D835}\x{DE7D}] ) |  #  Letter N
( [\x{D835}\x{DC0E}\x{D835}\x{DC42}\x{D835}\x{DC76}\x{D835}\x{DCAA}\x{D835}\x{DCDE}\x{D835}\x{DD12}\x{D835}\x{DD46}\x{D835}\x{DD7A}\x{D835}\x{DDAE}\x{D835}\x{DDE2}\x{D835}\x{DE16}\x{D835}\x{DE4A}\x{D835}\x{DE7E}] ) |  #  Letter O
( [\x{D835}\x{DC0F}\x{D835}\x{DC43}\x{D835}\x{DC77}\x{D835}\x{DCAB}\x{D835}\x{DCDF}\x{D835}\x{DD13}\x{D835}\x{DD47}\x{D835}\x{DD7B}\x{D835}\x{DDAF}\x{D835}\x{DDE3}\x{D835}\x{DE17}\x{D835}\x{DE4B}\x{D835}\x{DE7F}] ) |  #  Letter P
( [\x{D835}\x{DC10}\x{D835}\x{DC44}\x{D835}\x{DC78}\x{D835}\x{DCAC}\x{D835}\x{DCE0}\x{D835}\x{DD14}\x{D835}\x{DD48}\x{D835}\x{DD7C}\x{D835}\x{DDB0}\x{D835}\x{DDE4}\x{D835}\x{DE18}\x{D835}\x{DE4C}\x{D835}\x{DE80}] ) |  #  Letter Q
( [\x{D835}\x{DC11}\x{D835}\x{DC45}\x{D835}\x{DC79}\x{D835}\x{DCAD}\x{D835}\x{DCE1}\x{D835}\x{DD15}\x{D835}\x{DD49}\x{D835}\x{DD7D}\x{D835}\x{DDB1}\x{D835}\x{DDE5}\x{D835}\x{DE19}\x{D835}\x{DE4D}\x{D835}\x{DE81}] ) |  #  Letter R
( [\x{D835}\x{DC12}\x{D835}\x{DC46}\x{D835}\x{DC7A}\x{D835}\x{DCAE}\x{D835}\x{DCE2}\x{D835}\x{DD16}\x{D835}\x{DD4A}\x{D835}\x{DD7E}\x{D835}\x{DDB2}\x{D835}\x{DDE6}\x{D835}\x{DE1A}\x{D835}\x{DE4E}\x{D835}\x{DE82}] ) |  #  Letter S
( [\x{D835}\x{DC13}\x{D835}\x{DC47}\x{D835}\x{DC7B}\x{D835}\x{DCAF}\x{D835}\x{DCE3}\x{D835}\x{DD17}\x{D835}\x{DD4B}\x{D835}\x{DD7F}\x{D835}\x{DDB3}\x{D835}\x{DDE7}\x{D835}\x{DE1B}\x{D835}\x{DE4F}\x{D835}\x{DE83}] ) |  #  Letter T
( [\x{D835}\x{DC14}\x{D835}\x{DC48}\x{D835}\x{DC7C}\x{D835}\x{DCB0}\x{D835}\x{DCE4}\x{D835}\x{DD18}\x{D835}\x{DD4C}\x{D835}\x{DD80}\x{D835}\x{DDB4}\x{D835}\x{DDE8}\x{D835}\x{DE1C}\x{D835}\x{DE50}\x{D835}\x{DE84}] ) |  #  Letter U
( [\x{D835}\x{DC15}\x{D835}\x{DC49}\x{D835}\x{DC7D}\x{D835}\x{DCB1}\x{D835}\x{DCE5}\x{D835}\x{DD19}\x{D835}\x{DD4D}\x{D835}\x{DD81}\x{D835}\x{DDB5}\x{D835}\x{DDE9}\x{D835}\x{DE1D}\x{D835}\x{DE51}\x{D835}\x{DE85}] ) |  #  Letter V
( [\x{D835}\x{DC16}\x{D835}\x{DC4A}\x{D835}\x{DC7E}\x{D835}\x{DCB2}\x{D835}\x{DCE6}\x{D835}\x{DD1A}\x{D835}\x{DD4E}\x{D835}\x{DD82}\x{D835}\x{DDB6}\x{D835}\x{DDEA}\x{D835}\x{DE1E}\x{D835}\x{DE52}\x{D835}\x{DE86}] ) |  #  Letter W
( [\x{D835}\x{DC17}\x{D835}\x{DC4B}\x{D835}\x{DC7F}\x{D835}\x{DCB3}\x{D835}\x{DCE7}\x{D835}\x{DD1B}\x{D835}\x{DD4F}\x{D835}\x{DD83}\x{D835}\x{DDB7}\x{D835}\x{DDEB}\x{D835}\x{DE1F}\x{D835}\x{DE53}\x{D835}\x{DE87}] ) |  #  Letter X
( [\x{D835}\x{DC18}\x{D835}\x{DC4C}\x{D835}\x{DC80}\x{D835}\x{DCB4}\x{D835}\x{DCE8}\x{D835}\x{DD1C}\x{D835}\x{DD50}\x{D835}\x{DD84}\x{D835}\x{DDB8}\x{D835}\x{DDEC}\x{D835}\x{DE20}\x{D835}\x{DE54}\x{D835}\x{DE88}] ) |  #  Letter Y
( [\x{D835}\x{DC19}\x{D835}\x{DC4D}\x{D835}\x{DC81}\x{D835}\x{DCB5}\x{D835}\x{DCE9}\x{D835}\x{DD1D}\x{D835}\x{DD51}\x{D835}\x{DD85}\x{D835}\x{DDB9}\x{D835}\x{DDED}\x{D835}\x{DE21}\x{D835}\x{DE55}\x{D835}\x{DE89}] )    #  Letter Z

REPLACE :

(?{01}A)(?{02}B)(?{03}C)(?{04}D)(?{05}E)(?{06}F)(?{07}G)(?{08}H)(?{09}I)(?{10}J)(?{11}K)(?{12}L)(?{13}M)(?{14}N)(?{15}O)(?{16}P)(?{17}Q)(?{18}R)(?{19}S)(?{20}T)(?{21}U)(?{22}V)(?{23}W)(?{24}X)(?{25}Y)(?{26}Z)

SEARCH :

(?x)
( [\x{D835}\x{DC1A}\x{D835}\x{DC4E}\x{D835}\x{DC82}\x{D835}\x{DCB6}\x{D835}\x{DCEA}\x{D835}\x{DD1E}\x{D835}\x{DD52}\x{D835}\x{DD86}\x{D835}\x{DDBA}\x{D835}\x{DDEE}\x{D835}\x{DE22}\x{D835}\x{DE56}\x{D835}\x{DE8A}] ) |  # Letter a
( [\x{D835}\x{DC1B}\x{D835}\x{DC4F}\x{D835}\x{DC83}\x{D835}\x{DCB7}\x{D835}\x{DCEB}\x{D835}\x{DD1F}\x{D835}\x{DD53}\x{D835}\x{DD87}\x{D835}\x{DDBB}\x{D835}\x{DDEF}\x{D835}\x{DE23}\x{D835}\x{DE57}\x{D835}\x{DE8B}] ) |  # Letter b
( [\x{D835}\x{DC1C}\x{D835}\x{DC50}\x{D835}\x{DC84}\x{D835}\x{DCB8}\x{D835}\x{DCEC}\x{D835}\x{DD20}\x{D835}\x{DD54}\x{D835}\x{DD88}\x{D835}\x{DDBC}\x{D835}\x{DDF0}\x{D835}\x{DE24}\x{D835}\x{DE58}\x{D835}\x{DE8C}] ) |  # Letter c
( [\x{D835}\x{DC1D}\x{D835}\x{DC51}\x{D835}\x{DC85}\x{D835}\x{DCB9}\x{D835}\x{DCED}\x{D835}\x{DD21}\x{D835}\x{DD55}\x{D835}\x{DD89}\x{D835}\x{DDBD}\x{D835}\x{DDF1}\x{D835}\x{DE25}\x{D835}\x{DE59}\x{D835}\x{DE8D}] ) |  # Letter d
( [\x{D835}\x{DC1E}\x{D835}\x{DC52}\x{D835}\x{DC86}\x{D835}\x{DCBA}\x{D835}\x{DCEE}\x{D835}\x{DD22}\x{D835}\x{DD56}\x{D835}\x{DD8A}\x{D835}\x{DDBE}\x{D835}\x{DDF2}\x{D835}\x{DE26}\x{D835}\x{DE5A}\x{D835}\x{DE8E}] ) |  # Letter e
( [\x{D835}\x{DC1F}\x{D835}\x{DC53}\x{D835}\x{DC87}\x{D835}\x{DCBB}\x{D835}\x{DCEF}\x{D835}\x{DD23}\x{D835}\x{DD57}\x{D835}\x{DD8B}\x{D835}\x{DDBF}\x{D835}\x{DDF3}\x{D835}\x{DE27}\x{D835}\x{DE5B}\x{D835}\x{DE8F}] ) |  # Letter f
( [\x{D835}\x{DC20}\x{D835}\x{DC54}\x{D835}\x{DC88}\x{D835}\x{DCBC}\x{D835}\x{DCF0}\x{D835}\x{DD24}\x{D835}\x{DD58}\x{D835}\x{DD8C}\x{D835}\x{DDC0}\x{D835}\x{DDF4}\x{D835}\x{DE28}\x{D835}\x{DE5C}\x{D835}\x{DE90}] ) |  # Letter g
( [\x{D835}\x{DC21}\x{D835}\x{DC55}\x{D835}\x{DC89}\x{D835}\x{DCBD}\x{D835}\x{DCF1}\x{D835}\x{DD25}\x{D835}\x{DD59}\x{D835}\x{DD8D}\x{D835}\x{DDC1}\x{D835}\x{DDF5}\x{D835}\x{DE29}\x{D835}\x{DE5D}\x{D835}\x{DE91}] ) |  # Letter h
( [\x{D835}\x{DC22}\x{D835}\x{DC56}\x{D835}\x{DC8A}\x{D835}\x{DCBE}\x{D835}\x{DCF2}\x{D835}\x{DD26}\x{D835}\x{DD5A}\x{D835}\x{DD8E}\x{D835}\x{DDC2}\x{D835}\x{DDF6}\x{D835}\x{DE2A}\x{D835}\x{DE5E}\x{D835}\x{DE92}] ) |  # Letter i
( [\x{D835}\x{DC23}\x{D835}\x{DC57}\x{D835}\x{DC8B}\x{D835}\x{DCBF}\x{D835}\x{DCF3}\x{D835}\x{DD27}\x{D835}\x{DD5B}\x{D835}\x{DD8F}\x{D835}\x{DDC3}\x{D835}\x{DDF7}\x{D835}\x{DE2B}\x{D835}\x{DE5F}\x{D835}\x{DE93}] ) |  # Letter j
( [\x{D835}\x{DC24}\x{D835}\x{DC58}\x{D835}\x{DC8C}\x{D835}\x{DCC0}\x{D835}\x{DCF4}\x{D835}\x{DD28}\x{D835}\x{DD5C}\x{D835}\x{DD90}\x{D835}\x{DDC4}\x{D835}\x{DDF8}\x{D835}\x{DE2C}\x{D835}\x{DE60}\x{D835}\x{DE94}] ) |  # Letter k
( [\x{D835}\x{DC25}\x{D835}\x{DC59}\x{D835}\x{DC8D}\x{D835}\x{DCC1}\x{D835}\x{DCF5}\x{D835}\x{DD29}\x{D835}\x{DD5D}\x{D835}\x{DD91}\x{D835}\x{DDC5}\x{D835}\x{DDF9}\x{D835}\x{DE2D}\x{D835}\x{DE61}\x{D835}\x{DE95}] ) |  # Letter l
( [\x{D835}\x{DC26}\x{D835}\x{DC5A}\x{D835}\x{DC8E}\x{D835}\x{DCC2}\x{D835}\x{DCF6}\x{D835}\x{DD2A}\x{D835}\x{DD5E}\x{D835}\x{DD92}\x{D835}\x{DDC6}\x{D835}\x{DDFA}\x{D835}\x{DE2E}\x{D835}\x{DE62}\x{D835}\x{DE96}] ) |  # Letter m
( [\x{D835}\x{DC27}\x{D835}\x{DC5B}\x{D835}\x{DC8F}\x{D835}\x{DCC3}\x{D835}\x{DCF7}\x{D835}\x{DD2B}\x{D835}\x{DD5F}\x{D835}\x{DD93}\x{D835}\x{DDC7}\x{D835}\x{DDFB}\x{D835}\x{DE2F}\x{D835}\x{DE63}\x{D835}\x{DE97}] ) |  # Letter n
( [\x{D835}\x{DC28}\x{D835}\x{DC5C}\x{D835}\x{DC90}\x{D835}\x{DCC4}\x{D835}\x{DCF8}\x{D835}\x{DD2C}\x{D835}\x{DD60}\x{D835}\x{DD94}\x{D835}\x{DDC8}\x{D835}\x{DDFC}\x{D835}\x{DE30}\x{D835}\x{DE64}\x{D835}\x{DE98}] ) |  # Letter o
( [\x{D835}\x{DC29}\x{D835}\x{DC5D}\x{D835}\x{DC91}\x{D835}\x{DCC5}\x{D835}\x{DCF9}\x{D835}\x{DD2D}\x{D835}\x{DD61}\x{D835}\x{DD95}\x{D835}\x{DDC9}\x{D835}\x{DDFD}\x{D835}\x{DE31}\x{D835}\x{DE65}\x{D835}\x{DE99}] ) |  # Letter p
( [\x{D835}\x{DC2A}\x{D835}\x{DC5E}\x{D835}\x{DC92}\x{D835}\x{DCC6}\x{D835}\x{DCFA}\x{D835}\x{DD2E}\x{D835}\x{DD62}\x{D835}\x{DD96}\x{D835}\x{DDCA}\x{D835}\x{DDFE}\x{D835}\x{DE32}\x{D835}\x{DE66}\x{D835}\x{DE9A}] ) |  # Letter q
( [\x{D835}\x{DC2B}\x{D835}\x{DC5F}\x{D835}\x{DC93}\x{D835}\x{DCC7}\x{D835}\x{DCFB}\x{D835}\x{DD2F}\x{D835}\x{DD63}\x{D835}\x{DD97}\x{D835}\x{DDCB}\x{D835}\x{DDFF}\x{D835}\x{DE33}\x{D835}\x{DE67}\x{D835}\x{DE9B}] ) |  # Letter r
( [\x{D835}\x{DC2C}\x{D835}\x{DC60}\x{D835}\x{DC94}\x{D835}\x{DCC8}\x{D835}\x{DCFC}\x{D835}\x{DD30}\x{D835}\x{DD64}\x{D835}\x{DD98}\x{D835}\x{DDCC}\x{D835}\x{DE00}\x{D835}\x{DE34}\x{D835}\x{DE68}\x{D835}\x{DE9C}] ) |  # Letter s
( [\x{D835}\x{DC2D}\x{D835}\x{DC61}\x{D835}\x{DC95}\x{D835}\x{DCC9}\x{D835}\x{DCFD}\x{D835}\x{DD31}\x{D835}\x{DD65}\x{D835}\x{DD99}\x{D835}\x{DDCD}\x{D835}\x{DE01}\x{D835}\x{DE35}\x{D835}\x{DE69}\x{D835}\x{DE9D}] ) |  # Letter t
( [\x{D835}\x{DC2E}\x{D835}\x{DC62}\x{D835}\x{DC96}\x{D835}\x{DCCA}\x{D835}\x{DCFE}\x{D835}\x{DD32}\x{D835}\x{DD66}\x{D835}\x{DD9A}\x{D835}\x{DDCE}\x{D835}\x{DE02}\x{D835}\x{DE36}\x{D835}\x{DE6A}\x{D835}\x{DE9E}] ) |  # Letter u
( [\x{D835}\x{DC2F}\x{D835}\x{DC63}\x{D835}\x{DC97}\x{D835}\x{DCCB}\x{D835}\x{DCFF}\x{D835}\x{DD33}\x{D835}\x{DD67}\x{D835}\x{DD9B}\x{D835}\x{DDCF}\x{D835}\x{DE03}\x{D835}\x{DE37}\x{D835}\x{DE6B}\x{D835}\x{DE9F}] ) |  # Letter v
( [\x{D835}\x{DC30}\x{D835}\x{DC64}\x{D835}\x{DC98}\x{D835}\x{DCCC}\x{D835}\x{DD00}\x{D835}\x{DD34}\x{D835}\x{DD68}\x{D835}\x{DD9C}\x{D835}\x{DDD0}\x{D835}\x{DE04}\x{D835}\x{DE38}\x{D835}\x{DE6C}\x{D835}\x{DEA0}] ) |  # Letter w
( [\x{D835}\x{DC31}\x{D835}\x{DC65}\x{D835}\x{DC99}\x{D835}\x{DCCD}\x{D835}\x{DD01}\x{D835}\x{DD35}\x{D835}\x{DD69}\x{D835}\x{DD9D}\x{D835}\x{DDD1}\x{D835}\x{DE05}\x{D835}\x{DE39}\x{D835}\x{DE6D}\x{D835}\x{DEA1}] ) |  # Letter x
( [\x{D835}\x{DC32}\x{D835}\x{DC66}\x{D835}\x{DC9A}\x{D835}\x{DCCE}\x{D835}\x{DD02}\x{D835}\x{DD36}\x{D835}\x{DD6A}\x{D835}\x{DD9E}\x{D835}\x{DDD2}\x{D835}\x{DE06}\x{D835}\x{DE3A}\x{D835}\x{DE6E}\x{D835}\x{DEA2}] ) |  # Letter y
( [\x{D835}\x{DC33}\x{D835}\x{DC67}\x{D835}\x{DC9B}\x{D835}\x{DCCF}\x{D835}\x{DD03}\x{D835}\x{DD37}\x{D835}\x{DD6B}\x{D835}\x{DD9F}\x{D835}\x{DDD3}\x{D835}\x{DE07}\x{D835}\x{DE3B}\x{D835}\x{DE6F}\x{D835}\x{DEA3}] )    # Letter z

REPLACE :

(?{01}a)(?{02}b)(?{03}c)(?{04}d)(?{05}e)(?{06}f)(?{07}g)(?{08}h)(?{09}i)(?{10}j)(?{11}k)(?{12}l)(?{13}m)(?{14}n)(?{15}o)(?{16}p)(?{17}q)(?{18}r)(?{19}s)(?{20}t)(?{21}u)(?{22}v)(?{23}w)(?{24}x)(?{25}y)(?{26}z)

SEARCH :

(?x)
([\x{D835}\x{DFCE}\x{D835}\x{DFD8}\x{D835}\x{DFE2}\x{D835}\x{DFEC}\x{D835}\x{DFF6}]) |  #  Digit 0
([\x{D835}\x{DFCF}\x{D835}\x{DFD9}\x{D835}\x{DFE3}\x{D835}\x{DFED}\x{D835}\x{DFF7}]) |  #  Digit 1
([\x{D835}\x{DFD0}\x{D835}\x{DFDA}\x{D835}\x{DFE4}\x{D835}\x{DFEE}\x{D835}\x{DFF8}]) |  #  Digit 2
([\x{D835}\x{DFD1}\x{D835}\x{DFDB}\x{D835}\x{DFE5}\x{D835}\x{DFEF}\x{D835}\x{DFF9}]) |  #  Digit 3
([\x{D835}\x{DFD2}\x{D835}\x{DFDC}\x{D835}\x{DFE6}\x{D835}\x{DFF0}\x{D835}\x{DFFA}]) |  #  Digit 4
([\x{D835}\x{DFD3}\x{D835}\x{DFDD}\x{D835}\x{DFE7}\x{D835}\x{DFF1}\x{D835}\x{DFFB}]) |  #  Digit 5
([\x{D835}\x{DFD4}\x{D835}\x{DFDE}\x{D835}\x{DFE8}\x{D835}\x{DFF2}\x{D835}\x{DFFC}]) |  #  Digit 6
([\x{D835}\x{DFD5}\x{D835}\x{DFDF}\x{D835}\x{DFE9}\x{D835}\x{DFF3}\x{D835}\x{DFFD}]) |  #  Digit 7
([\x{D835}\x{DFD6}\x{D835}\x{DFE0}\x{D835}\x{DFEA}\x{D835}\x{DFF4}\x{D835}\x{DFFE}]) |  #  Digit 8
([\x{D835}\x{DFD7}\x{D835}\x{DFE1}\x{D835}\x{DFEB}\x{D835}\x{DFF5}\x{D835}\x{DFFF}])    #  Digit 9

REPLACE :

(?{01}0)(?{02}1)(?{03}2)(?{04}3)(?{05}4)(?{06}5)(?{07}6)(?{08}7)(?{09}8)(?{10}9)

However, regarding the regexes which concern the letters, this would NOT be possible because the search exceeds the N++ limit of 2,046 chars !

We could, also, divide this work in several macros, which would consecutively change all these chars but it would be a tedious work, anyway !

So personally, I advice you to simply use this on-line tool :

https://onlinetools.com/unicode/normalize-unicode-text

To get normal text and this other one :

https://onlinetools.com/unicode/generate-unicode-text

To do the reverse operation, if necessary

Have also a look to an old post of mine :

https://community.notepad-plus-plus.org/topic/17581/how-to-correctly-use-characters-from-the-mathematical-alphanumeric-symbols-unicode-block

Best Regards,

guyo38

Alan Kilborn

@guy038 said in How to normalize fancy Unicode text back to regular text?:

I advice you to simply use this on-line tool

OP has already said that such usage is undesirable; well, I think that is what he is saying with:

Otherwise we are depending on other sources like those websites who support that function and that a disadvantage for npp and npp users

Isn’t the true answer what @PeterJones has shown is possible, with a script?
Such a script could act on selected text when the script is run, and replace that text with the normalized text…pretty simple concept.

Coises

@mkupper said in How to normalize fancy Unicode text back to regular text?:

Some characters do change. Copy/paste the following into a UTF-8 encoded tab or file. It should be the same as when you see here on the forums.

I stand corrected. I did not at all expect that to happen. It’s my understanding of “convert to ANSI” that is confused. I apologize.

Very strange:

Open Notepad++, convert empty tab to UTF-8, copy your text, paste into tab, I see all the characters.

Copy your text, open Notepad++, convert empty tab to UTF-8, paste into tab, I see only ASCII characters.

I have no idea what is going on here.

Mark Olson

@Alan-Kilborn said in How to normalize fancy Unicode text back to regular text?:

Such a script could act on selected text when the script is run, and replace that text with the normalized text…pretty simple concept.

Might as well just make the script now, save others time.

As noted in the docstring of the code, the most obvious difference between NFKD and NFKC seems to be treatment of characters with combining diacritics or umlauts or what have you. Which form is better seems really context-dependent to me; if you’re sorting text, you probably want ö to be an o and then an umlaut (so that ö sorts after o and before p as expected), but if you’re doing regular expression search, you might prefer it to be a single character.

'''
requires PythonScript v3 or higher: https://github.com/bruderstein/PythonScript
ref: https://community.notepad-plus-plus.org/topic/25285/how-to-normalize-fancy-unicode-text-back-to-regular-text/17
docs: https://docs.python.org/3.10/library/unicodedata.html
'''
import unicodedata
from Npp import *

def normalize(text):
    '''
    NFKC stands for normalization form compatibility decomposition
        with subsequent canonical composition.
    NFKD works similarly AFAIK; it may be a bit faster, but it has some weird 
        behaviors like breaking ö into two characters: ASCII "o" and then ̈
        whereas NFKC combines those two into a single character.
    '''
    return unicodedata.normalize('NFKC', text)

selstart = editor.getSelectionStart()
selend = editor.getSelectionEnd()

if selstart == selend:
    text = editor.getText()
    editor.setText(normalize(text))
else:
    text = editor.getSelText()
    editor.replaceSel(normalize(text))

guy038

Hi, @alan-kilborn,

I completely agree with your last assumption and that why I had already upvoted @peterjones’s post and I now upvote to @mark-olson’s solution too !

BR

guy038