Seeking Clarification on Entering Alt Keypad Characters

haleba-hotmail

I was trying to find the keystrokes needed to create the “shrug” emoji and the middle Unicode character 0xE38384 ( ツ - Japanese Katakana Tu or Tsu) seems to require using the Alt plus numpad + followed by the full hex sequence, but when I try to use Alt with the numpad plus there is no response.

Alt with the numpad digits works fine, but any other key used with Alt triggers menu items.

The workaround was to use the Hex–>ASCII Converter plugin, which is OK, but curious to confirm the Alt plus numpad plus sign just won’t work.

Thanks in advance.

PeterJones

@haleba-hotmail said in Seeking Clarification on Entering Alt Keypad Characters:

middle Unicode character 0xE38384 ( ツ - Japanese Katakana Tu or Tsu)

You have one definite misunderstanding, and maybe a second.

First, 0xE38384 is not the Unicode Code Point for ツ (¹). U+30C4 is the codepoint. 0xE38384 is the three-byte sequence in the UTF-8 encoding of unicode for the U+30C4 character, ツ. You should be using ALT +30C4 sequence to enter that in any Windows application, not just Notepad++.

The second, which I’m not sure whether or not you understand, is that the + in this case isn’t just saying “hold down the alt key while typing the rest of the sequence”. The plus is actually part of the sequence: ALT +30C4 means “hold down ALT key, then type + on the numeric keypad, then type 30C4, where the 3 and 0 and 4 must all also be on the numeric keypad”.

That said, there are a couple more caveats for the ALT +30C4 sequence.

As detailed http://www.fileformat.info/tip/microsoft/enter_unicode.htm, in the Method 1: Universal section,

Alas, this appears to require a registry setting. It was already set on my computer, but some readers report that this method didn’t work for them, and this is probably why. If you don’t know what the registry is, please don’t try this. Under HKEY_Current_User/Control Panel/Input Method, set EnableHexNumpad to “1”. If you have to add it, set the type to be REG_SZ.
Sometimes, it has to do with timing; it can be difficult to get all 5 of those characters typed before Windows gives up and starts interpreting them as individual keystrokes again, instead of the unicode-entry-escape combo

FOOTNOTE 1: The first misunderstanding was probably compounded by the way that HEX->ASCII works.
The reason why Plugins > Converter > Hex->ASCII has you enter E38384 is because it is working with bytes, not characters, and it is internally using UTF-8, so after it converts the 6 hex nibbles into 3 bytes, ut recognizes those three bytes as a single character; when it pastes back into Notepad++ editor pane, it converts that character to the appropriate encoding for the active editor. For example, if I have a file containing E38384 saved as UCS-2 LE BOM:

C:\Users\peter.jones\Downloads\TempData\nppCommunity>xxd ucs2le.txt
00000000: fffe 4500 3300 3800 3300 3800 3400       ..E.3.8.3.8.4.

Then I select those six characters and run the HEX->ASCII command, it enters the ツ character. Then I save, and now on disk, I have:

C:\Users\peter.jones\Downloads\TempData\nppCommunity>xxd ucs2le.txt
00000000: fffe c430                                ...0

which is the little-endian for BOM then U+30C4.

guy038

Hello, @haleba-hotmail, @peterjones and All,

First, in your post, you’re speaking about 2 characters, one char part of the Basic Multilingual Plane ( BMP ) and the other character outside the BMP These are :

The KATAKANA letter TU ( = TSU ) ツ ( \x{30C4} ), from the Unicode block Katakana, in range 30A0–30FF
The SHRUG 🤷 portrait symbol ( \x{1F937} ) from the Unicode block Supplemental Symbols and Pictographs, in range 1F900–1F9FF

The main characteristics of these two chars are :

Character                                   ツ
Character name                              KATAKANA LETTER TU
Hex code point                              30C4
Decimal code point                          12484
Hex UTF-8 bytes                             E3 83 84
Octal UTF-8 bytes                           343 203 204
UTF-8 bytes as Latin-1 characters bytes     ã <83> <84>

and

Character                                   🤷
Character name                              SHRUG
Hex code point 	                            1F937
Decimal code point                          129335
Hex UTF-8 bytes                             F0 9F A4 B7
Octal UTF-8 bytes                           360 237 244 267
UTF-8 bytes as Latin-1 characters bytes     ð <9F> ¤ ·
Hex UTF-16 Surrogates                       D83E DD37

I got information on these characters, from an useful on-line UTF-8 tool, described in the last section of the post below :

https://community.notepad-plus-plus.org/post/50983

I must say that I did not pay attention, until now, to the Converter plugin, of @don-ho !!

Seemingly, if you select one or some consecutive character(s) and use the option Plugins > Converter > ASCII -> HEX, it correctly writes the hexadecimal byte(s), needed to encode this/these character(s) in UTF8, or in ANSI for the 255-characters allowed block !
IMPORTANT : Even if your current encoding is UCS-2 BE BOM or UCS-2 LE BOM, it still shows the hexadecimal bytes, used in an UTF-8 or an UTF-8 BOM file, to encode this/these characters :-( In any case, it’s best to avoid these two encodings because they cannot handle characters which are over the BMP, like your SHRUG symbol !

For instance, in an UTF-8 file, the selection of the string 🤷Aツé and then the option Plugins > Converter > ASCII -> HEX gives the result F09FA4B741E38384C3A9, because :

The 🤷 character is coded with the 4-bytes UTF-8 sequence F09FA4B7
The A character is coded with the 1-byte UTF-8 sequence 41
The ツ character is coded with the 3-bytes UTF-8 sequence E38384
The é character is coded with the 2-bytes UTF-8 sequence C3A9

And, in an ANSI file, the selection of the string Aé, with the option Plugins > Converter > ASCII -> HEX gives the result 41E9 because :

The A character is coded with the 1-byte ANSI sequence 41
The é character is coded with the 1-byte ANSI sequence E9

In the same way, if you select one or some consecutive hexadecimal bytes and use the option Plugins > Converter > HEX -> ASCII, it correctly writes the corresponding glyphs of this/these character(s), produced by the current font, in an UTF-8 or ANSI file. For instance, selecting the sequence F09FA4B741E38384C3A9, does give back our 4chars string 🤷Aツé

Now, regarding the different Windows input methods, I strongly advice you to read this post, first, where I recapitulate all the different Windows input methods :

https://community.notepad-plus-plus.org/topic/18903/regex-misidentifying-foreign-characters/6

And, in its last section, looks the reference to a nice monospaced font, which correctly writes almost the majority of all the Unicode characters, even those which are outside the BMP

As said in that post, after modifying the registry ( be careful ! ), you may directly insert, for instance, the KATAKANA letter TU, following these steps :

Hold down the Alt key and, successively :
Hit the + key, on the numeric keypad
Hit the 3 key, on the numeric keypad
Hit the 0 key, on the numeric keypad
Hit the C key, on the main keyboard
Hit the 4 key , on the numeric keypad
Release the Alt key

=> Immediately, the ツ character should be inserted at cursor location ;-))

However, note that the Shrug symbol cannot be inserted, even using this powerful input method, because its code-point 1F937 is greater than \x{FFFF} ! You’ll have to use, in that case, an on-line tool to get these characters, from their Unicode code-point, in the range \x{10000} - \x{10FFFF}, as, for instance, the UTF-8 tool described above !

Best Regards,

guy038

P.S. : I started writing this post, before the @peterjones reply. Also, some parts may be redundant ;-))