Hello, @haleba-hotmail, @peterjones and All,
First, in your post, you’re speaking about 2 characters, one char part of the Basic Multilingual Plane ( BMP ) and the other character outside the BMP These are :
The KATAKANA letter TU ( = TSU ) ツ ( \x{30C4} ), from the Unicode block Katakana, in range 30A0–30FF
The SHRUG 🤷 portrait symbol ( \x{1F937} ) from the Unicode block Supplemental Symbols and Pictographs, in range 1F900–1F9FF
The main characteristics of these two chars are :
Character ツ
Character name KATAKANA LETTER TU
Hex code point 30C4
Decimal code point 12484
Hex UTF-8 bytes E3 83 84
Octal UTF-8 bytes 343 203 204
UTF-8 bytes as Latin-1 characters bytes ã <83> <84>
and
Character 🤷
Character name SHRUG
Hex code point 1F937
Decimal code point 129335
Hex UTF-8 bytes F0 9F A4 B7
Octal UTF-8 bytes 360 237 244 267
UTF-8 bytes as Latin-1 characters bytes ð <9F> ¤ ·
Hex UTF-16 Surrogates D83E DD37
I got information on these characters, from an useful on-line UTF-8 tool, described in the last section of the post below :
https://community.notepad-plus-plus.org/post/50983
I must say that I did not pay attention, until now, to the Converter plugin, of @don-ho !!
Seemingly, if you select one or some consecutive character(s) and use the option Plugins > Converter > ASCII -> HEX, it correctly writes the hexadecimal byte(s), needed to encode this/these character(s) in UTF8, or in ANSI for the 255-characters allowed block !
IMPORTANT : Even if your current encoding is UCS-2 BE BOM or UCS-2 LE BOM, it still shows the hexadecimal bytes, used in an UTF-8 or an UTF-8 BOM file, to encode this/these characters :-( In any case, it’s best to avoid these two encodings because they cannot handle characters which are over the BMP, like your SHRUG symbol !
For instance, in an UTF-8 file, the selection of the string 🤷Aツé and then the option Plugins > Converter > ASCII -> HEX gives the result F09FA4B741E38384C3A9, because :
The 🤷 character is coded with the 4-bytes UTF-8 sequence F09FA4B7
The A character is coded with the 1-byte UTF-8 sequence 41
The ツ character is coded with the 3-bytes UTF-8 sequence E38384
The é character is coded with the 2-bytes UTF-8 sequence C3A9
And, in an ANSI file, the selection of the string Aé, with the option Plugins > Converter > ASCII -> HEX gives the result 41E9 because :
The A character is coded with the 1-byte ANSI sequence 41
The é character is coded with the 1-byte ANSI sequence E9
In the same way, if you select one or some consecutive hexadecimal bytes and use the option Plugins > Converter > HEX -> ASCII, it correctly writes the corresponding glyphs of this/these character(s), produced by the current font, in an UTF-8 or ANSI file. For instance, selecting the sequence F09FA4B741E38384C3A9, does give back our 4chars string 🤷Aツé
Now, regarding the different Windows input methods, I strongly advice you to read this post, first, where I recapitulate all the different Windows input methods :
https://community.notepad-plus-plus.org/topic/18903/regex-misidentifying-foreign-characters/6
And, in its last section, looks the reference to a nice monospaced font, which correctly writes almost the majority of all the Unicode characters, even those which are outside the BMP
As said in that post, after modifying the registry ( be careful ! ), you may directly insert, for instance, the KATAKANA letter TU, following these steps :
Hold down the Alt key and, successively :
Hit the + key, on the numeric keypad
Hit the 3 key, on the numeric keypad
Hit the 0 key, on the numeric keypad
Hit the C key, on the main keyboard
Hit the 4 key , on the numeric keypad
Release the Alt key
=> Immediately, the ツ character should be inserted at cursor location ;-))
However, note that the Shrug symbol cannot be inserted, even using this powerful input method, because its code-point 1F937 is greater than \x{FFFF} ! You’ll have to use, in that case, an on-line tool to get these characters, from their Unicode code-point, in the range \x{10000} - \x{10FFFF}, as, for instance, the UTF-8 tool described above !
Best Regards,
guy038
P.S. : I started writing this post, before the @peterjones reply. Also, some parts may be redundant ;-))