Interpret an Unicode value as real character in Notepad++
-
@PeterJones actually, now as I re-read your post I would be interested in your script, for some other Unicode stuff (from time-to-time) but when I click the link it gives me 404 error (page does not exist)
-
@vaso-blg said in Interpret an Unicode value as real character in Notepad++:
@PeterJones actually, now as I re-read your post I would be interested in your script, but when I click the link it gives me 404 error (page does not exist)
Oh, right, I copied an old link, but I had moved it to a subdirectory after the last time I posted a link to it. It’s now at https://github.com/pryrt/nppStuff/blob/main/pythonScripts/useful/pyscReplaceBackslashSequence.py
-
@PeterJones yes, I already found it myself by exploring that page and already tested it, great, works as you said it would - very good addition to my solution (Snippets), thank you!
-
@PeterJones BTW don’t you have a reversed script, that would convert all the selected text into the Unicode values? That could be very interesting option to have indeed!!!
-
@vaso-blg said in Interpret an Unicode value as real character in Notepad++:
don’t you have a reversed script, that would convert all the selected text into the Unicode values?
Not quite, but similar:
https://github.com/pryrt/nppStuff/blob/main/pythonScripts/useful/WhatUniChar.pyIt will update the status bar (until Notepad++'s next screen refresh) to show the codepoint of the single character at the typing caret. (It doesn’t do the whole selection, just a single character).
-
@PeterJones perfect + believe it or not, but once again I myself already downloaded exactly this script as the 2nd one expecting it should do something like that although I could not find a way how it operates, thinking it is not functioning or something (so luckily now I know that I have to look at the status bar and set the carret at the beggining instead of the end - one of the mistakes I was doing before you explained how it operates, haha - thank you!).
-
For future visitors of this thread, note that @PeterJones’s alternative suggestion of surrogate pairs does in fact work with HTML Tag, since version 1.4 at least:
Conversions are reversible, i.e., literal Unicode pasted from the Web with code points above
U+D800
will also be encoded as surrogates.You can convert to and from the commonly used
U+0000
format once you have configured the prefix in the settings:There is currently a hard limit of
U+DBFF
for convertible code points, and only the first 4 digits are read. So, for example,U+1D400
will become\u1D40 -> ᵀ
with the last0
remaining as is, as reported above.The official bug tracker is on GitHub.
-
This post is deleted! -
The free and open source software WinCompose is an option.
This program converts one of your keys (by default the right Alt key) into a Compose key, which you press and release, then enter a short mnemonic sequence to choose a character. For example, Right-Alt a " to generate ä or Right-Alt o c to generate ©.
You can use Right-Alt u xxx Enter to generate a Unicode character. It works for the original poster’s example: Right-Alt u 1 d 4 0 0 Enter gives 𝐀.
I don’t use it on a regular basis, so I can’t comment on its stability; I used the portable version to write the text above, and also verified that it works in Notepad++.
(Note: I replaced my earlier version of this comment because I discovered the website for this software links to an out-of-date version. The GitHub version appears to be the most recent.)
-
FYI
The HTML Tag wiki now includes a more accurate description of its decoding limitations. Not sure yet if this can be improved while the source code is targeting the Free Pascal runtime, which tends to favour UTF-8 as being more compatible with the many different platforms it supports. Perhaps for the sake of legacy Delphi code, the
WideString
type is an exception:WideStrings consist of COM compatible UTF16 encoded bytes on Windows machines (UCS2 on Windows 2000), and they are encoded as plain UTF16 on Linux, Mac OS X and iOS.
The P.E. header of a recent plugin DLL shows “4.0” as the minimum required OS version, even older than Windows 2K, so it’s possible that Unicode text is actually encoded as UCS2 (!).
The orignal developer seems to have assumed that a signed 16-bit
SmallInt
would be enough for all potential code points. They’ve been stored as 32-bit unsigned integers for a long time now, so there’s really no excuse for not extending the logic to decompose ordinals north ofU+010000
into surrogate pairs and feeding them back into the decoder.