@guy038 said in Unicode problem !:
But, nowadays, with the Unicode standard and the universal UTF-8 encoding, I suppose that any computer language should handle these Unicode chars properly !
TLDR: Unicode is hard.
You know the joke that goes, “There’s a sort of person who sees a problem and thinks, ‘I know! I’ll use regular expressions!’ Now that person has two problems.”? One could say the same of Unicode.
Modern Windows is natively Unicode… but it’s UTF-16. Nearly everything else (including Scintilla) that uses Unicode does it as UTF-8.
And most programming languages — as well as most regular expression implementations! — think of “characters” as fixed-length entities in memory. By choosing UTF-16, Windows at least makes handling the Basic Multilingual Plane fairly straightforward, so long as you don’t mind converting nearly everyone else’s version of Unicode to Windows’ version. OK for short strings… not so good when you have megabytes or gigabytes of file data.
C++ had some limited Unicode-related functions in its standard library, but has now deprecated even those and just says one should use an appropriate library.
Even if you’re willing to convert to UTF-32 (which means quadrupling the storage requirements of ASCII characters), so that every Unicode code point is the same size in memory, not all Unicode characters are a single Unicode code point. The complexity just keeps increasing, and increasing… the hope that Unicode would “simplify” using all the world’s scripts has not exactly been realized.
When I wanted to incorporate Boost.Regex directly into my Columns++ plugin, I first thought to use its Unicode facilities. Then I saw that while Boost.Regex is a header-only library (if you don’t know C++, read that as “easy to include in a project without changing the way the project is structured”), its Unicode facilities require ICU4C — and if there’s any simple, straightforward way to incorporate that monstrosity into an otherwise simple, straightforward project, I couldn’t find it. I settled for the same “hack” that Notepad++ itself uses: convert UTF-8 to UTF-16 on the fly, use Windows’ support of UTF-16 where needed, and let the chips fall where they may for anyone bold enough to use combining characters or characters outside the basic multilingual plane.
Unicode is so complex, so full of details (What characters are “the same” when you ignore case? Well, that depends on the current locale…), that support is a project in itself, and attempts to be reasonably complete (ICU) are gigantic and constantly being updated.