Hello, @pilagit, @peterjones, @coises, @mpheath and All,
Not really off-topic but yes, sorting problems are always a nightmare :-(( Just some general remarks about the alphabetical sorting order :
https://en.wikipedia.org/wiki/alphabetical_order
Read, particularly, the section below :
https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions
Just an example, which, however, concerns two bordering countries :
(for example, the correct lexicographic order is baa, baá, báa, báá, bab, báb, bac, bác, bač, báč [in Czech] and baa, baá, baä, báa, báá, báä, bäa, bäá, bää, bab, báb, bäb, bac, bác, bäc, bač, báč, bäč [in Slovak])
On the other hand, on the Unicode consortium site, take the time to fully read this technical and interesting report :
https://www.unicode.org/reports/tr10/
The beginning of this article says :
Collation is the general term for the process and function of determining the sorting order of strings of characters.
It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it
in a sorted order so that they can easily and reliably find individual strings. Thus it is widely used in user interfaces.
It is also crucial for databases, both in sorting records and in selecting sets of records with fields within given bounds.
You’ll be stunned by the complexity of the problem !
So, practically, I suppose that anyone, trying to create a sort algorithm, must do a lot of simplifications. Indeed, taking in account all the specifications, for a correct sorting of all languages, seems to be a superhuman task !
Luckily, there’s nothing mysterious about the sort algorithm used by Notepad++ :
If you use the Edit > Line Operations > Sort lines Lexicographically Ascending option, any character is simply sorted by its Unicode code-point.
However, for all code-points over the BMP ( Basic Multilingual Plane ), i.e. with code-point over U + FFFF, they all lie within the surrogates section. So :
After the D7FF and previous code-points of the BMP
Before the E000 and next code-points of the BMP
Thus, a general N++ sorted list, with Unicode v16.0, is always of this form :
U + 0000 NULL character \
... |
... |
... | Plane 0 : Basic Multilingual Plane ( BMP )
... | ¯¯¯¯¯¯¯¯
... |
U + D7FB HANGUL JONGSEONG PHIEUPH-THIEUTH character /
U + 10000 LINEAR B SYLLABLE B008 A \
... |
... |
... | Plane 1 : Supplementary Multilingual Plane ( SMP )
... |
... |
U + 1FBF9 SEGMENTED DIGIT NINE /
U + 20000 CJK Ideograph Extension B GKX-0075.06 \
... |
... |
... | Plane 2 : Supplementary Ideographic Plane ( SIP )
... |
... |
U + 2FA1D CJK COMPATIBILITY IDEOGRAPH-2FA1D /
U + 30000 CJK Unified Ideographs Extension G UK-02764 \
... |
... |
... | Plane 3 : Tertiary Ideographic Plane ( TIP )
... |
... |
U + 323AF CJK Unified Ideographs Extension H T13-3D2C /
U + E0001 BEGIN LANGUAGE TAG \
... |
... |
... | Plane 14 : Supplementary Special-purpose Plane ( SSP )
... |
... |
U + E01EF VARIATION SELECTOR-256 /
U + F0000 Private Use-A \
... |
... |
... | Plane 15 : Supplementary Private Use Area A ( SPUA-A )
... |
... |
U + FFFFD Private Use-A /
U + 100000 Private Use-B \
... |
... |
... | Plane 16 : Supplementary Private Use Area B ( SPUA-B )
... |
... |
U + 10FFFD Private Use B /
U + E000 Private Use Area \
... |
... |
... | Plane 0 : Basic Multilingual Plane ( BMP )
... | ¯¯¯¯¯¯¯¯
... |
U + FFFD REPLACEMENT character /
Best Regards,
guy038