How to normalize fancy Unicode text back to regular text?

@Coises said in How to normalize fancy Unicode text back to regular text?:
Nothing will seem to change for
ßášï©
because all six characters in your example are available in ANSI.ß
is ANSIDF
and UnicodeU+00DF
á
is ANSIE1
and UnicodeU+00E1
š
is ANSI9A
and UnicodeU+0161
. UnicodeU+009A
is used for SINGLE CHARACTER INTRODUCER.ï
is ANSIEF
and UnicodeU+00EF
©
is ANSIA9
and UnicodeU+00A9
When I put UTF8 encoded
ßášï©
into a file, convert to ANSI, and save it, then I see that the file contains\xDF \xE1 \x9A \xEF \xA9
which is what I expect.You are correct in that when I load it into Notepad++ then it’s identified as ISO88597. I don’t think the misidentification is a concern as the file is very small. The code only has six bytes to work with and all six are unusual for ANSI text.

On that website I can see it’s running a JavaScript with all functions to do that operation what also includes all of these Unicode letters in a array variable called “var unicodeLetters”. Maybe it’s possible to adapt this code to make a python script etc. In the script is also a link to a UTF8 encoder/decoder you can enter text to get encoded hex bytes out. In case of entering “𝗗” I get 4 hex bytes “\xF0\x9D\x97\x97” and in case of “ßášï©” I get “\xC3\x9F\xC3\xA1\xC5\xA1\xC3\xAF\xC2\xA9” out.
https://mothereff.in/utf8
Here the script the website is using.
onlinetools.com/CACHE/js/unicodenormalizeunicodetext.jsMaybe it could help to adapt it and create something for npp.
PS: Thank you for that one reputation point @mkupper. Now I could also post links.

@DeanCorso said in How to normalize fancy Unicode text back to regular text?:
I tried already to enter specific python command combo in python console but the results I got out was just question marks using the “NFKD”.
If you used the default PythonScript 2.0.0.0 from the Plugins Admin, then your unicode strings in your test would need to be marked as
u'𝘏𝘦𝘭𝘭𝘰 𝘕𝘰𝘵𝘦𝘱𝘢𝘥 𝘱𝘭𝘶𝘴 𝘱𝘭𝘶𝘴 𝘤𝘰𝘮𝘮𝘶𝘯𝘪𝘵𝘺'
in order for Python 2.7 to know it was unicode.But when i tried even with that in the default PythonScript plugin, it wouldn’t normalize that string (though it could normalize some others).
When i switch to a copy of Notepad++ with PythonScript 3.0.17 (using Python 3.12, which doesn’t need fancy string syntax for unicode strings), and ran an example script from here, I got the expected output:
Python 3.12.1 (tags/v3.12.1:2305ca5, Dec 7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)] Initialisation took 234ms Ready. >>> import unicodedata >>> strings = [ '𝖙𝖍𝖚𝖌 𝖑𝖎𝖋𝖊', '𝓽𝓱𝓾𝓰 𝓵𝓲𝓯𝓮', '𝓉𝒽𝓊𝑔 𝓁𝒾𝒻𝑒', '𝕥𝕙𝕦𝕘 𝕝𝕚𝕗𝕖', 'ｔｈｕｇ ｌｉｆｅ', '𝘏𝘦𝘭𝘭𝘰 𝘕𝘰𝘵𝘦𝘱𝘢𝘥 𝘱𝘭𝘶𝘴 𝘱𝘭𝘶𝘴 𝘤𝘰𝘮𝘮𝘶𝘯𝘪𝘵𝘺', '𝙃𝙚𝙡𝙡𝙤 𝙉𝙤𝙩𝙚𝙥𝙖𝙙 𝙥𝙡𝙪𝙨 𝙥𝙡𝙪𝙨 𝙘𝙤𝙢𝙢𝙪𝙣𝙞𝙩𝙮'] >>> for x in strings: ... print(unicodedata.normalize( 'NFKC', x), x) ... thug life 𝖙𝖍𝖚𝖌 𝖑𝖎𝖋𝖊 thug life 𝓽𝓱𝓾𝓰 𝓵𝓲𝓯𝓮 thug life 𝓉𝒽𝓊𝑔 𝓁𝒾𝒻𝑒 thug life 𝕥𝕙𝕦𝕘 𝕝𝕚𝕗𝕖 thug life ｔｈｕｇ ｌｉｆｅ Hello Notepad plus plus community 𝘏𝘦𝘭𝘭𝘰 𝘕𝘰𝘵𝘦𝘱𝘢𝘥 𝘱𝘭𝘶𝘴 𝘱𝘭𝘶𝘴 𝘤𝘰𝘮𝘮𝘶𝘯𝘪𝘵𝘺 Hello Notepad plus plus community 𝙃𝙚𝙡𝙡𝙤 𝙉𝙤𝙩𝙚𝙥𝙖𝙙 𝙥𝙡𝙪𝙨 𝙥𝙡𝙪𝙨 𝙘𝙤𝙢𝙢𝙪𝙣𝙞𝙩𝙮
So to be able to normalize your particular characters, you must use PythonScript 3.x, which you can download from here.

@mkupper said in How to normalize fancy Unicode text back to regular text?:
Nothing will seem to change for ßášï© because all six characters in your example are available in ANSI.
You’re correct, of course. I should have worded that more carefully. My point was that the characters themselves don’t change when you convert from Unicode to ANSI, only the internal representations; my example didn’t include characters not in the local code page, which just change to question marks.
The original poster desires to change the characters themselves, not just their internal representations; that is not the purpose of Convert to ANSI.

@DeanCorso said in How to normalize fancy Unicode text back to regular text?:
Maybe it’s possible to adapt this code to make a python script etc.
There is a jN (JavaScript for Notepad++) plugin available through Plugins Admin. I was never quite able to grasp, from the documentation, how to use it, but it appears very powerful. If you know Javascript and have found the relevant scripts in the web page, you might be able to use it to do what you need.

@Coises said in How to normalize fancy Unicode text back to regular text?:
My point was that the characters themselves don’t change when you convert from Unicode to ANSI, only the internal representations
Some characters do change. Copy/paste the following into a UTF8 encoded tab or file. It should be the same as when you see here on the forums.
Copy/paste it into an ANSI encoded tab or file. You should see plain ASCII.
On the UTF8 encoded tab do an Encoding / Convert to ANSI. The first two lines are plain ASCII but the rest are not ASCII and are characters that do not have a direct mapping into the same character/glyph in ANSI.
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz ＡＢＣＤＥＦＧＨＩＪＫＬＭＮＯＰＱＲＳＴＵＶＷＸＹＺ ａｂｃｄｅｆｇｈｉｊｋｌｍｎｏｐｑｒｓｔｕｖｗｘｙｚ ĀāĂăĄąǍǎǞǟα ƀℬ ĆćĈĉĊċČčℂℭ Ďďđδ ĒēĔĕĖėĘęĚěεℇℯ Φφℱ ĜĝĞğĠġĢģǤǥǦǧɡΓℊ ĤĥĦħℋℌℍℎ ĨĩĪīĬĭĮįİıƗǏǐℐℑ Ĵĵǰ ĶķǨǩK ĹĺĻļĽľŁłƚℒℓ ℳ ŃńŅņŇňⁿℕ ŌōŎŏŐőƟƠơǑǒǪǫǬǭΩℴ πℙ ℚ ŔŕŖŗŘřℛℜℝ ŚśŜŝŞşΣσ ŢţŤťŦŧƮƫΘτ ŨũŪūŬŭŮůŰűŲųƯưǓǔǕǖǗǘǙǚǛǜ √ Ŵŵ Ŷŷ ŹźŻżƶℤℨ

@Coises said in How to normalize fancy Unicode text back to regular text?:
If your default code page is CP1252 (common for Western Europe and the US) and you open a new tab in Notepad++, set it to UTF8, paste in ßášï© and then convert to ANSI… nothing will change.
Nothing visually will change, but the underlying bytes of the file will all change.
EDIT: Oops, this has already been said. :)if I save it ANSI — when I open it again, Notepad++ detects it as Greek/ISO88597
Yes, the encoding autodetection of Notepad++ has long been a target of scorn because it can fail in such ways, but apparently autodetection is not always possible, or at least can be extremely difficult…

Hello, @deancorso, @peterjones, @mkupper, @alankilborn, @coises and All,
In Unicode
v15.1
, theMathematical Alphanumeric Symbols
bliock contains1024
characters, betweenU+1D400
andU+1D7FF
, whose28
are unassigned. Refer to :https://www.unicode.org/charts/PDF/U1D400.pdf
Of course, we could generate from within N++, some regexes in order to transform these characters into standard
ASCII
characters !If we restrict our goal to latin letters and digits of this block, we would use the ranges
1D400  1D6A3
and1D7CE  1D7FF
However, the N++ regex engine cannot directly handle characters over the BMP ( so chars with code over
\x{FFFF}
)You need to use its surrogate pair characters to match a specific character
For example, the char :
𝑻
Cannot be found with the regex
\x{1D47B}
. Luckily, you can match it with the regex\x{D835}\x{DC7B}
, using the UnicodeSurrogate Area
So, we could build these
3
giant regex S/R, below, which, in theory, could replace any fancy character of this Unicode block by standard letters and digitsSEARCH : (?x) ( [\x{D835}\x{DC00}\x{D835}\x{DC34}\x{D835}\x{DC68}\x{D835}\x{DC9C}\x{D835}\x{DCD0}\x{D835}\x{DD04}\x{D835}\x{DD38}\x{D835}\x{DD6C}\x{D835}\x{DDA0}\x{D835}\x{DDD4}\x{D835}\x{DE08}\x{D835}\x{DE3C}\x{D835}\x{DE70}] )  # Letter A ( [\x{D835}\x{DC01}\x{D835}\x{DC35}\x{D835}\x{DC69}\x{D835}\x{DC9D}\x{D835}\x{DCD1}\x{D835}\x{DD05}\x{D835}\x{DD39}\x{D835}\x{DD6D}\x{D835}\x{DDA1}\x{D835}\x{DDD5}\x{D835}\x{DE09}\x{D835}\x{DE3D}\x{D835}\x{DE71}] )  # Letter B ( [\x{D835}\x{DC02}\x{D835}\x{DC36}\x{D835}\x{DC6A}\x{D835}\x{DC9E}\x{D835}\x{DCD2}\x{D835}\x{DD06}\x{D835}\x{DD3A}\x{D835}\x{DD6E}\x{D835}\x{DDA2}\x{D835}\x{DDD6}\x{D835}\x{DE0A}\x{D835}\x{DE3E}\x{D835}\x{DE72}] )  # Letter C ( [\x{D835}\x{DC03}\x{D835}\x{DC37}\x{D835}\x{DC6B}\x{D835}\x{DC9F}\x{D835}\x{DCD3}\x{D835}\x{DD07}\x{D835}\x{DD3B}\x{D835}\x{DD6F}\x{D835}\x{DDA3}\x{D835}\x{DDD7}\x{D835}\x{DE0B}\x{D835}\x{DE3F}\x{D835}\x{DE73}] )  # Letter D ( [\x{D835}\x{DC04}\x{D835}\x{DC38}\x{D835}\x{DC6C}\x{D835}\x{DCA0}\x{D835}\x{DCD4}\x{D835}\x{DD08}\x{D835}\x{DD3C}\x{D835}\x{DD70}\x{D835}\x{DDA4}\x{D835}\x{DDD8}\x{D835}\x{DE0C}\x{D835}\x{DE40}\x{D835}\x{DE74}] )  # Letter E ( [\x{D835}\x{DC05}\x{D835}\x{DC39}\x{D835}\x{DC6D}\x{D835}\x{DCA1}\x{D835}\x{DCD5}\x{D835}\x{DD09}\x{D835}\x{DD3D}\x{D835}\x{DD71}\x{D835}\x{DDA5}\x{D835}\x{DDD9}\x{D835}\x{DE0D}\x{D835}\x{DE41}\x{D835}\x{DE75}] )  # Letter F ( [\x{D835}\x{DC06}\x{D835}\x{DC3A}\x{D835}\x{DC6E}\x{D835}\x{DCA2}\x{D835}\x{DCD6}\x{D835}\x{DD0A}\x{D835}\x{DD3E}\x{D835}\x{DD72}\x{D835}\x{DDA6}\x{D835}\x{DDDA}\x{D835}\x{DE0E}\x{D835}\x{DE42}\x{D835}\x{DE76}] )  # Letter G ( [\x{D835}\x{DC07}\x{D835}\x{DC3B}\x{D835}\x{DC6F}\x{D835}\x{DCA3}\x{D835}\x{DCD7}\x{D835}\x{DD0B}\x{D835}\x{DD3F}\x{D835}\x{DD73}\x{D835}\x{DDA7}\x{D835}\x{DDDB}\x{D835}\x{DE0F}\x{D835}\x{DE43}\x{D835}\x{DE77}] )  # Letter H ( [\x{D835}\x{DC08}\x{D835}\x{DC3C}\x{D835}\x{DC70}\x{D835}\x{DCA4}\x{D835}\x{DCD8}\x{D835}\x{DD0C}\x{D835}\x{DD40}\x{D835}\x{DD74}\x{D835}\x{DDA8}\x{D835}\x{DDDC}\x{D835}\x{DE10}\x{D835}\x{DE44}\x{D835}\x{DE78}] )  # Letter I ( [\x{D835}\x{DC09}\x{D835}\x{DC3D}\x{D835}\x{DC71}\x{D835}\x{DCA5}\x{D835}\x{DCD9}\x{D835}\x{DD0D}\x{D835}\x{DD41}\x{D835}\x{DD75}\x{D835}\x{DDA9}\x{D835}\x{DDDD}\x{D835}\x{DE11}\x{D835}\x{DE45}\x{D835}\x{DE79}] )  # Letter J ( [\x{D835}\x{DC0A}\x{D835}\x{DC3E}\x{D835}\x{DC72}\x{D835}\x{DCA6}\x{D835}\x{DCDA}\x{D835}\x{DD0E}\x{D835}\x{DD42}\x{D835}\x{DD76}\x{D835}\x{DDAA}\x{D835}\x{DDDE}\x{D835}\x{DE12}\x{D835}\x{DE46}\x{D835}\x{DE7A}] )  # Letter K ( [\x{D835}\x{DC0B}\x{D835}\x{DC3F}\x{D835}\x{DC73}\x{D835}\x{DCA7}\x{D835}\x{DCDB}\x{D835}\x{DD0F}\x{D835}\x{DD43}\x{D835}\x{DD77}\x{D835}\x{DDAB}\x{D835}\x{DDDF}\x{D835}\x{DE13}\x{D835}\x{DE47}\x{D835}\x{DE7B}] )  # Letter L ( [\x{D835}\x{DC0C}\x{D835}\x{DC40}\x{D835}\x{DC74}\x{D835}\x{DCA8}\x{D835}\x{DCDC}\x{D835}\x{DD10}\x{D835}\x{DD44}\x{D835}\x{DD78}\x{D835}\x{DDAC}\x{D835}\x{DDE0}\x{D835}\x{DE14}\x{D835}\x{DE48}\x{D835}\x{DE7C}] )  # Letter M ( [\x{D835}\x{DC0D}\x{D835}\x{DC41}\x{D835}\x{DC75}\x{D835}\x{DCA9}\x{D835}\x{DCDD}\x{D835}\x{DD11}\x{D835}\x{DD45}\x{D835}\x{DD79}\x{D835}\x{DDAD}\x{D835}\x{DDE1}\x{D835}\x{DE15}\x{D835}\x{DE49}\x{D835}\x{DE7D}] )  # Letter N ( [\x{D835}\x{DC0E}\x{D835}\x{DC42}\x{D835}\x{DC76}\x{D835}\x{DCAA}\x{D835}\x{DCDE}\x{D835}\x{DD12}\x{D835}\x{DD46}\x{D835}\x{DD7A}\x{D835}\x{DDAE}\x{D835}\x{DDE2}\x{D835}\x{DE16}\x{D835}\x{DE4A}\x{D835}\x{DE7E}] )  # Letter O ( [\x{D835}\x{DC0F}\x{D835}\x{DC43}\x{D835}\x{DC77}\x{D835}\x{DCAB}\x{D835}\x{DCDF}\x{D835}\x{DD13}\x{D835}\x{DD47}\x{D835}\x{DD7B}\x{D835}\x{DDAF}\x{D835}\x{DDE3}\x{D835}\x{DE17}\x{D835}\x{DE4B}\x{D835}\x{DE7F}] )  # Letter P ( [\x{D835}\x{DC10}\x{D835}\x{DC44}\x{D835}\x{DC78}\x{D835}\x{DCAC}\x{D835}\x{DCE0}\x{D835}\x{DD14}\x{D835}\x{DD48}\x{D835}\x{DD7C}\x{D835}\x{DDB0}\x{D835}\x{DDE4}\x{D835}\x{DE18}\x{D835}\x{DE4C}\x{D835}\x{DE80}] )  # Letter Q ( [\x{D835}\x{DC11}\x{D835}\x{DC45}\x{D835}\x{DC79}\x{D835}\x{DCAD}\x{D835}\x{DCE1}\x{D835}\x{DD15}\x{D835}\x{DD49}\x{D835}\x{DD7D}\x{D835}\x{DDB1}\x{D835}\x{DDE5}\x{D835}\x{DE19}\x{D835}\x{DE4D}\x{D835}\x{DE81}] )  # Letter R ( [\x{D835}\x{DC12}\x{D835}\x{DC46}\x{D835}\x{DC7A}\x{D835}\x{DCAE}\x{D835}\x{DCE2}\x{D835}\x{DD16}\x{D835}\x{DD4A}\x{D835}\x{DD7E}\x{D835}\x{DDB2}\x{D835}\x{DDE6}\x{D835}\x{DE1A}\x{D835}\x{DE4E}\x{D835}\x{DE82}] )  # Letter S ( [\x{D835}\x{DC13}\x{D835}\x{DC47}\x{D835}\x{DC7B}\x{D835}\x{DCAF}\x{D835}\x{DCE3}\x{D835}\x{DD17}\x{D835}\x{DD4B}\x{D835}\x{DD7F}\x{D835}\x{DDB3}\x{D835}\x{DDE7}\x{D835}\x{DE1B}\x{D835}\x{DE4F}\x{D835}\x{DE83}] )  # Letter T ( [\x{D835}\x{DC14}\x{D835}\x{DC48}\x{D835}\x{DC7C}\x{D835}\x{DCB0}\x{D835}\x{DCE4}\x{D835}\x{DD18}\x{D835}\x{DD4C}\x{D835}\x{DD80}\x{D835}\x{DDB4}\x{D835}\x{DDE8}\x{D835}\x{DE1C}\x{D835}\x{DE50}\x{D835}\x{DE84}] )  # Letter U ( [\x{D835}\x{DC15}\x{D835}\x{DC49}\x{D835}\x{DC7D}\x{D835}\x{DCB1}\x{D835}\x{DCE5}\x{D835}\x{DD19}\x{D835}\x{DD4D}\x{D835}\x{DD81}\x{D835}\x{DDB5}\x{D835}\x{DDE9}\x{D835}\x{DE1D}\x{D835}\x{DE51}\x{D835}\x{DE85}] )  # Letter V ( [\x{D835}\x{DC16}\x{D835}\x{DC4A}\x{D835}\x{DC7E}\x{D835}\x{DCB2}\x{D835}\x{DCE6}\x{D835}\x{DD1A}\x{D835}\x{DD4E}\x{D835}\x{DD82}\x{D835}\x{DDB6}\x{D835}\x{DDEA}\x{D835}\x{DE1E}\x{D835}\x{DE52}\x{D835}\x{DE86}] )  # Letter W ( [\x{D835}\x{DC17}\x{D835}\x{DC4B}\x{D835}\x{DC7F}\x{D835}\x{DCB3}\x{D835}\x{DCE7}\x{D835}\x{DD1B}\x{D835}\x{DD4F}\x{D835}\x{DD83}\x{D835}\x{DDB7}\x{D835}\x{DDEB}\x{D835}\x{DE1F}\x{D835}\x{DE53}\x{D835}\x{DE87}] )  # Letter X ( [\x{D835}\x{DC18}\x{D835}\x{DC4C}\x{D835}\x{DC80}\x{D835}\x{DCB4}\x{D835}\x{DCE8}\x{D835}\x{DD1C}\x{D835}\x{DD50}\x{D835}\x{DD84}\x{D835}\x{DDB8}\x{D835}\x{DDEC}\x{D835}\x{DE20}\x{D835}\x{DE54}\x{D835}\x{DE88}] )  # Letter Y ( [\x{D835}\x{DC19}\x{D835}\x{DC4D}\x{D835}\x{DC81}\x{D835}\x{DCB5}\x{D835}\x{DCE9}\x{D835}\x{DD1D}\x{D835}\x{DD51}\x{D835}\x{DD85}\x{D835}\x{DDB9}\x{D835}\x{DDED}\x{D835}\x{DE21}\x{D835}\x{DE55}\x{D835}\x{DE89}] ) # Letter Z REPLACE : (?{01}A)(?{02}B)(?{03}C)(?{04}D)(?{05}E)(?{06}F)(?{07}G)(?{08}H)(?{09}I)(?{10}J)(?{11}K)(?{12}L)(?{13}M)(?{14}N)(?{15}O)(?{16}P)(?{17}Q)(?{18}R)(?{19}S)(?{20}T)(?{21}U)(?{22}V)(?{23}W)(?{24}X)(?{25}Y)(?{26}Z)
SEARCH : (?x) ( [\x{D835}\x{DC1A}\x{D835}\x{DC4E}\x{D835}\x{DC82}\x{D835}\x{DCB6}\x{D835}\x{DCEA}\x{D835}\x{DD1E}\x{D835}\x{DD52}\x{D835}\x{DD86}\x{D835}\x{DDBA}\x{D835}\x{DDEE}\x{D835}\x{DE22}\x{D835}\x{DE56}\x{D835}\x{DE8A}] )  # Letter a ( [\x{D835}\x{DC1B}\x{D835}\x{DC4F}\x{D835}\x{DC83}\x{D835}\x{DCB7}\x{D835}\x{DCEB}\x{D835}\x{DD1F}\x{D835}\x{DD53}\x{D835}\x{DD87}\x{D835}\x{DDBB}\x{D835}\x{DDEF}\x{D835}\x{DE23}\x{D835}\x{DE57}\x{D835}\x{DE8B}] )  # Letter b ( [\x{D835}\x{DC1C}\x{D835}\x{DC50}\x{D835}\x{DC84}\x{D835}\x{DCB8}\x{D835}\x{DCEC}\x{D835}\x{DD20}\x{D835}\x{DD54}\x{D835}\x{DD88}\x{D835}\x{DDBC}\x{D835}\x{DDF0}\x{D835}\x{DE24}\x{D835}\x{DE58}\x{D835}\x{DE8C}] )  # Letter c ( [\x{D835}\x{DC1D}\x{D835}\x{DC51}\x{D835}\x{DC85}\x{D835}\x{DCB9}\x{D835}\x{DCED}\x{D835}\x{DD21}\x{D835}\x{DD55}\x{D835}\x{DD89}\x{D835}\x{DDBD}\x{D835}\x{DDF1}\x{D835}\x{DE25}\x{D835}\x{DE59}\x{D835}\x{DE8D}] )  # Letter d ( [\x{D835}\x{DC1E}\x{D835}\x{DC52}\x{D835}\x{DC86}\x{D835}\x{DCBA}\x{D835}\x{DCEE}\x{D835}\x{DD22}\x{D835}\x{DD56}\x{D835}\x{DD8A}\x{D835}\x{DDBE}\x{D835}\x{DDF2}\x{D835}\x{DE26}\x{D835}\x{DE5A}\x{D835}\x{DE8E}] )  # Letter e ( [\x{D835}\x{DC1F}\x{D835}\x{DC53}\x{D835}\x{DC87}\x{D835}\x{DCBB}\x{D835}\x{DCEF}\x{D835}\x{DD23}\x{D835}\x{DD57}\x{D835}\x{DD8B}\x{D835}\x{DDBF}\x{D835}\x{DDF3}\x{D835}\x{DE27}\x{D835}\x{DE5B}\x{D835}\x{DE8F}] )  # Letter f ( [\x{D835}\x{DC20}\x{D835}\x{DC54}\x{D835}\x{DC88}\x{D835}\x{DCBC}\x{D835}\x{DCF0}\x{D835}\x{DD24}\x{D835}\x{DD58}\x{D835}\x{DD8C}\x{D835}\x{DDC0}\x{D835}\x{DDF4}\x{D835}\x{DE28}\x{D835}\x{DE5C}\x{D835}\x{DE90}] )  # Letter g ( [\x{D835}\x{DC21}\x{D835}\x{DC55}\x{D835}\x{DC89}\x{D835}\x{DCBD}\x{D835}\x{DCF1}\x{D835}\x{DD25}\x{D835}\x{DD59}\x{D835}\x{DD8D}\x{D835}\x{DDC1}\x{D835}\x{DDF5}\x{D835}\x{DE29}\x{D835}\x{DE5D}\x{D835}\x{DE91}] )  # Letter h ( [\x{D835}\x{DC22}\x{D835}\x{DC56}\x{D835}\x{DC8A}\x{D835}\x{DCBE}\x{D835}\x{DCF2}\x{D835}\x{DD26}\x{D835}\x{DD5A}\x{D835}\x{DD8E}\x{D835}\x{DDC2}\x{D835}\x{DDF6}\x{D835}\x{DE2A}\x{D835}\x{DE5E}\x{D835}\x{DE92}] )  # Letter i ( [\x{D835}\x{DC23}\x{D835}\x{DC57}\x{D835}\x{DC8B}\x{D835}\x{DCBF}\x{D835}\x{DCF3}\x{D835}\x{DD27}\x{D835}\x{DD5B}\x{D835}\x{DD8F}\x{D835}\x{DDC3}\x{D835}\x{DDF7}\x{D835}\x{DE2B}\x{D835}\x{DE5F}\x{D835}\x{DE93}] )  # Letter j ( [\x{D835}\x{DC24}\x{D835}\x{DC58}\x{D835}\x{DC8C}\x{D835}\x{DCC0}\x{D835}\x{DCF4}\x{D835}\x{DD28}\x{D835}\x{DD5C}\x{D835}\x{DD90}\x{D835}\x{DDC4}\x{D835}\x{DDF8}\x{D835}\x{DE2C}\x{D835}\x{DE60}\x{D835}\x{DE94}] )  # Letter k ( [\x{D835}\x{DC25}\x{D835}\x{DC59}\x{D835}\x{DC8D}\x{D835}\x{DCC1}\x{D835}\x{DCF5}\x{D835}\x{DD29}\x{D835}\x{DD5D}\x{D835}\x{DD91}\x{D835}\x{DDC5}\x{D835}\x{DDF9}\x{D835}\x{DE2D}\x{D835}\x{DE61}\x{D835}\x{DE95}] )  # Letter l ( [\x{D835}\x{DC26}\x{D835}\x{DC5A}\x{D835}\x{DC8E}\x{D835}\x{DCC2}\x{D835}\x{DCF6}\x{D835}\x{DD2A}\x{D835}\x{DD5E}\x{D835}\x{DD92}\x{D835}\x{DDC6}\x{D835}\x{DDFA}\x{D835}\x{DE2E}\x{D835}\x{DE62}\x{D835}\x{DE96}] )  # Letter m ( [\x{D835}\x{DC27}\x{D835}\x{DC5B}\x{D835}\x{DC8F}\x{D835}\x{DCC3}\x{D835}\x{DCF7}\x{D835}\x{DD2B}\x{D835}\x{DD5F}\x{D835}\x{DD93}\x{D835}\x{DDC7}\x{D835}\x{DDFB}\x{D835}\x{DE2F}\x{D835}\x{DE63}\x{D835}\x{DE97}] )  # Letter n ( [\x{D835}\x{DC28}\x{D835}\x{DC5C}\x{D835}\x{DC90}\x{D835}\x{DCC4}\x{D835}\x{DCF8}\x{D835}\x{DD2C}\x{D835}\x{DD60}\x{D835}\x{DD94}\x{D835}\x{DDC8}\x{D835}\x{DDFC}\x{D835}\x{DE30}\x{D835}\x{DE64}\x{D835}\x{DE98}] )  # Letter o ( [\x{D835}\x{DC29}\x{D835}\x{DC5D}\x{D835}\x{DC91}\x{D835}\x{DCC5}\x{D835}\x{DCF9}\x{D835}\x{DD2D}\x{D835}\x{DD61}\x{D835}\x{DD95}\x{D835}\x{DDC9}\x{D835}\x{DDFD}\x{D835}\x{DE31}\x{D835}\x{DE65}\x{D835}\x{DE99}] )  # Letter p ( [\x{D835}\x{DC2A}\x{D835}\x{DC5E}\x{D835}\x{DC92}\x{D835}\x{DCC6}\x{D835}\x{DCFA}\x{D835}\x{DD2E}\x{D835}\x{DD62}\x{D835}\x{DD96}\x{D835}\x{DDCA}\x{D835}\x{DDFE}\x{D835}\x{DE32}\x{D835}\x{DE66}\x{D835}\x{DE9A}] )  # Letter q ( [\x{D835}\x{DC2B}\x{D835}\x{DC5F}\x{D835}\x{DC93}\x{D835}\x{DCC7}\x{D835}\x{DCFB}\x{D835}\x{DD2F}\x{D835}\x{DD63}\x{D835}\x{DD97}\x{D835}\x{DDCB}\x{D835}\x{DDFF}\x{D835}\x{DE33}\x{D835}\x{DE67}\x{D835}\x{DE9B}] )  # Letter r ( [\x{D835}\x{DC2C}\x{D835}\x{DC60}\x{D835}\x{DC94}\x{D835}\x{DCC8}\x{D835}\x{DCFC}\x{D835}\x{DD30}\x{D835}\x{DD64}\x{D835}\x{DD98}\x{D835}\x{DDCC}\x{D835}\x{DE00}\x{D835}\x{DE34}\x{D835}\x{DE68}\x{D835}\x{DE9C}] )  # Letter s ( [\x{D835}\x{DC2D}\x{D835}\x{DC61}\x{D835}\x{DC95}\x{D835}\x{DCC9}\x{D835}\x{DCFD}\x{D835}\x{DD31}\x{D835}\x{DD65}\x{D835}\x{DD99}\x{D835}\x{DDCD}\x{D835}\x{DE01}\x{D835}\x{DE35}\x{D835}\x{DE69}\x{D835}\x{DE9D}] )  # Letter t ( [\x{D835}\x{DC2E}\x{D835}\x{DC62}\x{D835}\x{DC96}\x{D835}\x{DCCA}\x{D835}\x{DCFE}\x{D835}\x{DD32}\x{D835}\x{DD66}\x{D835}\x{DD9A}\x{D835}\x{DDCE}\x{D835}\x{DE02}\x{D835}\x{DE36}\x{D835}\x{DE6A}\x{D835}\x{DE9E}] )  # Letter u ( [\x{D835}\x{DC2F}\x{D835}\x{DC63}\x{D835}\x{DC97}\x{D835}\x{DCCB}\x{D835}\x{DCFF}\x{D835}\x{DD33}\x{D835}\x{DD67}\x{D835}\x{DD9B}\x{D835}\x{DDCF}\x{D835}\x{DE03}\x{D835}\x{DE37}\x{D835}\x{DE6B}\x{D835}\x{DE9F}] )  # Letter v ( [\x{D835}\x{DC30}\x{D835}\x{DC64}\x{D835}\x{DC98}\x{D835}\x{DCCC}\x{D835}\x{DD00}\x{D835}\x{DD34}\x{D835}\x{DD68}\x{D835}\x{DD9C}\x{D835}\x{DDD0}\x{D835}\x{DE04}\x{D835}\x{DE38}\x{D835}\x{DE6C}\x{D835}\x{DEA0}] )  # Letter w ( [\x{D835}\x{DC31}\x{D835}\x{DC65}\x{D835}\x{DC99}\x{D835}\x{DCCD}\x{D835}\x{DD01}\x{D835}\x{DD35}\x{D835}\x{DD69}\x{D835}\x{DD9D}\x{D835}\x{DDD1}\x{D835}\x{DE05}\x{D835}\x{DE39}\x{D835}\x{DE6D}\x{D835}\x{DEA1}] )  # Letter x ( [\x{D835}\x{DC32}\x{D835}\x{DC66}\x{D835}\x{DC9A}\x{D835}\x{DCCE}\x{D835}\x{DD02}\x{D835}\x{DD36}\x{D835}\x{DD6A}\x{D835}\x{DD9E}\x{D835}\x{DDD2}\x{D835}\x{DE06}\x{D835}\x{DE3A}\x{D835}\x{DE6E}\x{D835}\x{DEA2}] )  # Letter y ( [\x{D835}\x{DC33}\x{D835}\x{DC67}\x{D835}\x{DC9B}\x{D835}\x{DCCF}\x{D835}\x{DD03}\x{D835}\x{DD37}\x{D835}\x{DD6B}\x{D835}\x{DD9F}\x{D835}\x{DDD3}\x{D835}\x{DE07}\x{D835}\x{DE3B}\x{D835}\x{DE6F}\x{D835}\x{DEA3}] ) # Letter z REPLACE : (?{01}a)(?{02}b)(?{03}c)(?{04}d)(?{05}e)(?{06}f)(?{07}g)(?{08}h)(?{09}i)(?{10}j)(?{11}k)(?{12}l)(?{13}m)(?{14}n)(?{15}o)(?{16}p)(?{17}q)(?{18}r)(?{19}s)(?{20}t)(?{21}u)(?{22}v)(?{23}w)(?{24}x)(?{25}y)(?{26}z)
SEARCH : (?x) ([\x{D835}\x{DFCE}\x{D835}\x{DFD8}\x{D835}\x{DFE2}\x{D835}\x{DFEC}\x{D835}\x{DFF6}])  # Digit 0 ([\x{D835}\x{DFCF}\x{D835}\x{DFD9}\x{D835}\x{DFE3}\x{D835}\x{DFED}\x{D835}\x{DFF7}])  # Digit 1 ([\x{D835}\x{DFD0}\x{D835}\x{DFDA}\x{D835}\x{DFE4}\x{D835}\x{DFEE}\x{D835}\x{DFF8}])  # Digit 2 ([\x{D835}\x{DFD1}\x{D835}\x{DFDB}\x{D835}\x{DFE5}\x{D835}\x{DFEF}\x{D835}\x{DFF9}])  # Digit 3 ([\x{D835}\x{DFD2}\x{D835}\x{DFDC}\x{D835}\x{DFE6}\x{D835}\x{DFF0}\x{D835}\x{DFFA}])  # Digit 4 ([\x{D835}\x{DFD3}\x{D835}\x{DFDD}\x{D835}\x{DFE7}\x{D835}\x{DFF1}\x{D835}\x{DFFB}])  # Digit 5 ([\x{D835}\x{DFD4}\x{D835}\x{DFDE}\x{D835}\x{DFE8}\x{D835}\x{DFF2}\x{D835}\x{DFFC}])  # Digit 6 ([\x{D835}\x{DFD5}\x{D835}\x{DFDF}\x{D835}\x{DFE9}\x{D835}\x{DFF3}\x{D835}\x{DFFD}])  # Digit 7 ([\x{D835}\x{DFD6}\x{D835}\x{DFE0}\x{D835}\x{DFEA}\x{D835}\x{DFF4}\x{D835}\x{DFFE}])  # Digit 8 ([\x{D835}\x{DFD7}\x{D835}\x{DFE1}\x{D835}\x{DFEB}\x{D835}\x{DFF5}\x{D835}\x{DFFF}]) # Digit 9 REPLACE : (?{01}0)(?{02}1)(?{03}2)(?{04}3)(?{05}4)(?{06}5)(?{07}6)(?{08}7)(?{09}8)(?{10}9)
However, regarding the regexes which concern the letters, this would NOT be possible because the search exceeds the N++ limit of
2,046
chars !We could, also, divide this work in several macros, which would consecutively change all these chars but it would be a tedious work, anyway !
So personally, I advice you to simply use this online tool :
https://onlinetools.com/unicode/normalizeunicodetext
To get normal text and this other one :
https://onlinetools.com/unicode/generateunicodetext
To do the reverse operation, if necessary
Have also a look to an old post of mine :
Best Regards,
guyo38

@guy038 said in How to normalize fancy Unicode text back to regular text?:
I advice you to simply use this online tool
OP has already said that such usage is undesirable; well, I think that is what he is saying with:
Otherwise we are depending on other sources like those websites who support that function and that a disadvantage for npp and npp users
Isn’t the true answer what @PeterJones has shown is possible, with a script?
Such a script could act on selected text when the script is run, and replace that text with the normalized text…pretty simple concept. 
@mkupper said in How to normalize fancy Unicode text back to regular text?:
Some characters do change. Copy/paste the following into a UTF8 encoded tab or file. It should be the same as when you see here on the forums.
I stand corrected. I did not at all expect that to happen. It’s my understanding of “convert to ANSI” that is confused. I apologize.
Very strange:
Open Notepad++, convert empty tab to UTF8, copy your text, paste into tab, I see all the characters.
Copy your text, open Notepad++, convert empty tab to UTF8, paste into tab, I see only ASCII characters.
I have no idea what is going on here.

@AlanKilborn said in How to normalize fancy Unicode text back to regular text?:
Such a script could act on selected text when the script is run, and replace that text with the normalized text…pretty simple concept.
Might as well just make the script now, save others time.
As noted in the docstring of the code, the most obvious difference between NFKD and NFKC seems to be treatment of characters with combining diacritics or umlauts or what have you. Which form is better seems really contextdependent to me; if you’re sorting text, you probably want
ö
to be ano
and then an umlaut (so thatö
sorts aftero
and beforep
as expected), but if you’re doing regular expression search, you might prefer it to be a single character.''' requires PythonScript v3 or higher: https://github.com/bruderstein/PythonScript ref: https://community.notepadplusplus.org/topic/25285/howtonormalizefancyunicodetextbacktoregulartext/17 docs: https://docs.python.org/3.10/library/unicodedata.html ''' import unicodedata from Npp import * def normalize(text): ''' NFKC stands for normalization form compatibility decomposition with subsequent canonical composition. NFKD works similarly AFAIK; it may be a bit faster, but it has some weird behaviors like breaking ö into two characters: ASCII "o" and then ̈ whereas NFKC combines those two into a single character. ''' return unicodedata.normalize('NFKC', text) selstart = editor.getSelectionStart() selend = editor.getSelectionEnd() if selstart == selend: text = editor.getText() editor.setText(normalize(text)) else: text = editor.getSelText() editor.replaceSel(normalize(text))


Hi, @alankilborn,
I completely agree with your last assumption and that why I had already upvoted @peterjones’s post and I now upvote to @markolson’s solution too !
BR
guy038

Hi guys,
thanks again for your help. Really nice from you all.
Thanks for hint about the python script versions. I did download the latest pre version as you but could not make the same steps like you did to enter your example lines. Got some errors trying to exec the print command (getting expand error on for statement etc). Just did enter same as you. Maybe some space issue or something not sure. But good to know that I needed to use a higher python 3x version so I was still using the older 2x version.
Thank you for that example script. I tried that one and it seems to work. Great! The results are very good for me and its working for some of those different symbol styles (not all) to get a rid of those symbol text at all or some mixed plain text with symbol text etc. I mean the script works same like those few websites I found to normalize the symbol text to plain text. That’s very good and I don’t need to use those websites anymore and that was one of my goals. Would be good when npp could make a build in function for that in any future releases if possible.

@DeanCorso said in How to normalize fancy Unicode text back to regular text?:
Got some errors trying to exec the print command (getting expand error on for statement etc). Just did enter same as you. Maybe some space issue or something not sure.
If you copy/pasted the PythonScript console results (including the version information) like I did above, I bet someone could tell you what happened

Ok I tried again and now I get this out…
Python 3.12.1 (tags/v3.12.1:2305ca5, Dec 7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)] Initialisation took 204ms Ready. >>> import unicodedata >>> strings = [ '𝖙𝖍𝖚𝖌 𝖑𝖎𝖋𝖊', '𝓽𝓱𝓾𝓰 𝓵𝓲𝓯𝓮', '𝓉𝒽𝓊𝑔 𝓁𝒾𝒻𝑒', '𝕥𝕙𝕦𝕘 𝕝𝕚𝕗𝕖', 'ｔｈｕｇ ｌｉｆｅ', '𝘏𝘦𝘭𝘭𝘰 𝘕𝘰𝘵𝘦𝘱𝘢𝘥 𝘱𝘭𝘶𝘴 𝘱𝘭𝘶𝘴 𝘤𝘰𝘮𝘮𝘶𝘯𝘪𝘵𝘺', '𝙃𝙚𝙡𝙡𝙤 𝙉𝙤𝙩𝙚𝙥𝙖𝙙 𝙥𝙡𝙪𝙨 𝙥𝙡𝙪𝙨 𝙘𝙤𝙢𝙢𝙪𝙣𝙞𝙩𝙮'] >>> for x in strings: ... print(unicodedata.normalize( 'NFKC', x), x)
…but don’t see the printed output like you have. Did I miss anything to enter in this case?
PS: About that error before, I see I forgot to enter another white space before last print command.

If your PythonScript console prompt is still
...
instead of>>>
, you will need to enter a blank line (no whitespace) to tell the console to end the loop. It won’t run the loop until you do.