Interpret an Unicode value as real character in Notepad++

vaso blg

If I copy an Unicode encoded value as actual rendered character (for example form Wikipedia’s Mathematical_Alphanumeric_Symbols - 1D400) and paste it into Notepad++, it really does show it as a “bold” character.

But when I try to manually write it in the Notepad++ as the Unicode value (typed as \u1D400) and use plugin HTML Tag > Decode JS (as advised in post called “replace-unicode-values-with-characters-e-g-u00e9-with-%C3%A9”), it does not renders it into the “bold” character…and what’s even worse, it converts it into different 2 characters!

Why is that and how to make Notepad++ converting the value into the same character visual as it shows it when I simply copy/paste it already rendered form a webpage?

Am I doing something wrong? Can anyone show me the proper way how would I simply write the actual Unicode value into the notepad++ and then it would converts it to the correct character?

mkupper

@vaso-blg - Unfortunately, the usual tips and tricks you will discover in the Internet only work for U+0000 to U+FFFF. Those tips and tricks usually fail for Unicode characters in the range U+10000 to U+10FFFF which includes U+1D400. I have always done what you do which is to copy paste characters such as 𝐀 into Notepad++.

Hopefully, someone else here will know of a trick that works within Notepad++ and allows for more direct entry of 1D400 and getting a 𝐀.

PeterJones

@mkupper said in Interpret an Unicode value as real character in Notepad++:

Hopefully, someone else here will know of a trick that works within Notepad++ and allows for more direct entry of 1D400 and getting a 𝐀.

I know of at least two ways:

If you know the surrogate code
- example: 𝐀 , as described here, https://www.fileformat.info/info/unicode/char/1d400/index.htm => C/C++/Java source code "\uD835\uDC00" has the surrogate code U+D835 U+DC00.
- hold down Alt with one hand, and with the other, type +D835 (*) then release Alt, then hold down Alt and type +DC00 (*)
  (*: the + and all digits must be on numeric keypad; the D and any of the non-digit hex characters can be on the normal keyboard)
- once you have typed both hex sequences, 𝐀 will appear
use my pyscReplaceBackslashSequence.py script in the PythonScript plugin, then type \u1D400 then with the cursor just after that, run my script; it will convert it into the 𝐀

The first requires knowing the surrogate code, having enabled the right registry key to allow unicode Alt codes, having a keyboard that still has a numeric keypad, and getting good at doing those sequences. The second requires knowing the full codepoint, and having PythonScript plugin and my script (and is made easier if you map my script to a keyboard shortcut). Neither are reasonable if you have a large set of special characters that you want to be able to insert, but have not memorized them all.

I usually just search that fileformat.info site for the Unicode characters I want, or launch charmap.exe (which I have in my Run command menu). But if you had an “emoji keyboard” app or some such, it might make finding the right emoji easier (similar to smartphone keyboard emoji inputs). (Caveat: I have adblocker on fileformat.info, so its gazillion ads don’t bother me; I tend to forget that it’s ad-intensive when I recommend other people use it.)

vaso blg

@PeterJones thank you, in fact as the final solution I just used Snippets plugin and simply added all the characters there and now I can insert them as needed - simple and useful solution + I do not need to remember the codes :-D

vaso blg

@PeterJones actually, now as I re-read your post I would be interested in your script, for some other Unicode stuff (from time-to-time) but when I click the link it gives me 404 error (page does not exist)

PeterJones

@vaso-blg said in Interpret an Unicode value as real character in Notepad++:

@PeterJones actually, now as I re-read your post I would be interested in your script, but when I click the link it gives me 404 error (page does not exist)

Oh, right, I copied an old link, but I had moved it to a subdirectory after the last time I posted a link to it. It’s now at https://github.com/pryrt/nppStuff/blob/main/pythonScripts/useful/pyscReplaceBackslashSequence.py

vaso blg

@PeterJones yes, I already found it myself by exploring that page and already tested it, great, works as you said it would - very good addition to my solution (Snippets), thank you!

vaso blg

@PeterJones BTW don’t you have a reversed script, that would convert all the selected text into the Unicode values? That could be very interesting option to have indeed!!!

PeterJones

@vaso-blg said in Interpret an Unicode value as real character in Notepad++:

don’t you have a reversed script, that would convert all the selected text into the Unicode values?

Not quite, but similar:
https://github.com/pryrt/nppStuff/blob/main/pythonScripts/useful/WhatUniChar.py

It will update the status bar (until Notepad++'s next screen refresh) to show the codepoint of the single character at the typing caret. (It doesn’t do the whole selection, just a single character).

vaso blg

@PeterJones perfect + believe it or not, but once again I myself already downloaded exactly this script as the 2nd one expecting it should do something like that although I could not find a way how it operates, thinking it is not functioning or something (so luckily now I know that I have to look at the status bar and set the carret at the beggining instead of the end - one of the mistakes I was doing before you explained how it operates, haha - thank you!).

rdipardo

For future visitors of this thread, note that @PeterJones’s alternative suggestion of surrogate pairs does in fact work with HTML Tag, since version 1.4 at least:

htmltag-U+1D400-decode

Conversions are reversible, i.e., literal Unicode pasted from the Web with code points above U+D800 will also be encoded as surrogates.

You can convert to and from the commonly used U+0000 format once you have configured the prefix in the settings:

htmltag-U+1D400-encode

There is currently a hard limit of U+DBFF for convertible code points, and only the first 4 digits are read. So, for example, U+1D400 will become \u1D40 -> ᵀ with the last 0 remaining as is, as reported above.

The official bug tracker is on GitHub.

Coises

This post is deleted!

Coises

The free and open source software WinCompose is an option.

This program converts one of your keys (by default the right Alt key) into a Compose key, which you press and release, then enter a short mnemonic sequence to choose a character. For example, Right-Alt a " to generate ä or Right-Alt o c to generate ©.

You can use Right-Alt u xxx Enter to generate a Unicode character. It works for the original poster’s example: Right-Alt u 1 d 4 0 0 Enter gives 𝐀.

I don’t use it on a regular basis, so I can’t comment on its stability; I used the portable version to write the text above, and also verified that it works in Notepad++.

(Note: I replaced my earlier version of this comment because I discovered the website for this software links to an out-of-date version. The GitHub version appears to be the most recent.)

rdipardo

FYI

The HTML Tag wiki now includes a more accurate description of its decoding limitations. Not sure yet if this can be improved while the source code is targeting the Free Pascal runtime, which tends to favour UTF-8 as being more compatible with the many different platforms it supports. Perhaps for the sake of legacy Delphi code, the WideString type is an exception:

WideStrings consist of COM compatible UTF16 encoded bytes on Windows machines (UCS2 on Windows 2000), and they are encoded as plain UTF16 on Linux, Mac OS X and iOS.

The P.E. header of a recent plugin DLL shows “4.0” as the minimum required OS version, even older than Windows 2K, so it’s possible that Unicode text is actually encoded as UCS2 (!).

The orignal developer seems to have assumed that a signed 16-bit SmallInt would be enough for all potential code points. They’ve been stored as 32-bit unsigned integers for a long time now, so there’s really no excuse for not extending the logic to decompose ordinals north of U+010000 into surrogate pairs and feeding them back into the decoder.