@PeterJones said in Korean script letters do not group in Notepad and Notepad++ (Help Wanted):
I tried pasting 하ᄂ구ᄀ into LibreOffice Writer, and it combines some of the pieces but not all:
9b81f20d-99dd-4e01-82e8-986bc761c8d4-image.png – it also uses NSimSum font when I paste those characters. And though I pasted them in as the six characters next to each other, once they are in, Writer treats them as 3 characters
I’ve been trying to figure out what is going on here. I know nothing about the Korean language, but I did come across the Hangul jamo while implementing the Unicode rules for determining grapheme cluster boundaries as part of my Unicode-based search.
I don’t recommend trying to read that unless you have a lot of free time and brain power. The bottom line for this purpose is that Hangul syllables are written as separate characters, each of which can be composed of multiple Unicode code points. The most common sequence is LeadingConsonant+Vowel+TrailingConsonant, though there are extensions and complications.
Like many other Unicode characters, Korean characters come in composed and decomposed forms. I think all the modern Korean characters have a single-code-point composed form. They’re usually typed in a decomposed form, though, and assembled into composed form by an Input Method Editor, though decomposed form is permitted in Unicode.
The reason pasting that combination into LibreOffice Writer produces three characters rather than two — which, as you might notice, is also true in the web browser rendering — is because the original poster’s decomposition is wrong. The canonical decomposition of 한국 (\ud55c\uad6d) is ᄒ ᅡ ᆫ ᄀ ᅮ ᆨ (\u1112\u1161\u11ab\u1100\u116e\u11a8), not ᄒ ᅡ ᄂ ᄀ ᅮ ᄀ (\u1112\u1161\u1102\u1100\u116e\u1100). The latter version uses all leading forms for the consonants, but the third and sixth code points should be the trailing forms.
However, the correct decomposition still doesn’t compose visually in any font I can find in Notepad++.
I downloaded the current version of Scite and set the font to NSimSum by using Options | Open User Options File, typing:
font.base=font:NSimSum,size:12
there, and saving. I opened an empty tab and chose File | Encoding | UTF-8. Then I pasted:
한국 á ç
(You can’t see it here, but that is the canonical decomposition of all those characters.) Just as in Notepad++, the Latin letters with diacritics display as composed, but the Korean syllables do not.
Therefore, I think this is a Scintilla limitation. Why it doesn’t compose visually, when LibreOffice Writer and web browsers do, I have not yet been able to determine.
All this might not be as related to the problem @SalviaSage described as it appears to be. Notice that the original post says:
For example, I want to get (한국).
Those two characters are composed. They don’t consist of individual jamo. You can determine that by pasting them into Notepad++. You can also determine that neither the forum nor the web browser is performing the conversion to composed form by pasting this: 한국 into Notepad++. It looks the same here, but it’s six separate characters in Notepad++.
So I think the original poster’s problem has more to do with using, or not using, or misusing, an Input Method Editor. Unfortunately, I don’t know anything about those. Notepad++ will display the composed Korean characters just fine; for some reason, the individual jamo are going directly from the keyboard to Scintilla without composition. I’m thinking perhaps an inline IME is a default for a web browser, but not for Notepad++? I have to leave pursuit of that possibility to someone else.