Does Sort Lines Lexographically have a bug with diacritics?
-
Here’s a test list of words:
naval
neither
nèither
never
nëver
nothingSelect all lines then Edit > Line Operations > Sort Lines Lexographically Ascending. The list is changed to:
naval
neither
never
nothing
nèither
nëverI expected no change since the list is already sorted correctly. I realise this is a little complicated because (to take an example) “è” has ASCII code 232 while “e” has ASCII code 101.
My (possibly incorrect) assumption is that most users of languages using dicritics would expect the order of letters to be:
…deèéêëf…
rather than:
…def…èéêë…
My suggestion would be to either:
- add an option like “Sort Lines Lexographically Respecting Dicritics Ascending” (someone can surely come up with a catchier title!) or
- add an option into Settings > Preferences to define the lexographic ordering or
- fix the algorithm to respect diacritic ordering
Any comments?
-
-
Thanks, have raised issue 8481:
https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8481 -
Hello, @abdekker, @ekopalypse and All,
The present N++ alphabetic sort simply rearranges the characters according to the value of their
Unicode
code-point. Refer to the list, below, to get all the existing code-points, from the latest Unicode version :http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt
And to this article which explains the format of that list
So, for instance, this list of characters :
Char Code ~~~~~~~~~~~~ 0009 0020 " 0022 “ 201C ” 201D - 002D — 2014 _ 005F 0 0030 Ø 00D8 9 0039 A 0041 a 0061 Ă 0102 B 0042 b 0062 þ 00FE β 03B2 Б 0411 E 0045 e 0065 Ě 011A € 20AC ∑ 2211 ℮ 212E fi FB01 ℓ 2113 O 004F o 006F ö 00F6 Œ 0152 θ 03B8 Ѳ 0472 T 0054 t 0074 τ 03C4 ŧ 0167 ‡ 2021 Ỳ 1EF2 ‰ 2030 ∆ 2206 ∞ 221E
is alphabetically sorted as :
Char Code ~~~~~~~~~~~~~ 0009 0020 " 0022 - 002D 0 0030 9 0039 A 0041 B 0042 E 0045 O 004F T 0054 _ 005F a 0061 b 0062 e 0065 o 006F t 0074 Ø 00D8 ö 00F6 þ 00FE Ă 0102 Ě 011A Œ 0152 ŧ 0167 β 03B2 θ 03B8 τ 03C4 Б 0411 Ѳ 0472 Ỳ 1EF2 — 2014 “ 201C ” 201D ‡ 2021 ‰ 2030 € 20AC ℓ 2113 ℮ 212E ∆ 2206 ∑ 2211 ∞ 221E fi FB01
You may think that is would be good to have an option to get, for instance, the Unicode collation mechanism, as below :
http://www.unicode.org/charts/collation/
However, countries use different sorting conventions, even for countries using the same Latin alphabet. Refer to this article :
https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions
So, in the end, the simple task of sorting letters is quite a puzzle if you want to take into account the specificities of each language and country !
In the meanwhile, here is a solution, involving regular expressions :
From this initial list, containing
45
French words :foret Forer forêt Cote Côte côtelée côté Côtière cotée là La prairie A à près pré Premier gare Gîte giter Gorge Où ou Règne renne Aigüe aïeul Ôté ôter flûte Batir bâton Canoë Été île Reflet régner Ilote ångström Âpreté à-propos Escale étude offense Odyssée
After running the
Edit > Line Operations > Sort Lines Lexicographically Ascending
option, we get :A Aigüe Batir Canoë Cote Côte Côtière Escale Forer Gorge Gîte Ilote La Odyssée Où Premier Reflet Règne aïeul bâton cotée côtelée côté flûte foret forêt gare giter là offense ou prairie près pré renne régner Âpreté Été Ôté à à-propos ångström étude île ôter
Obviously, as a Frenchman, this sorting seems rather awful to sight and make the words’s search really not easy !
Now, let’s take again, our initial list and duplicate all words, in each line, with the simple regex S/R :
SEARCH
(?-s).+
REPLACE
$0\t\t\t\t\t$0
So, the list is changed into :
foret foret Forer Forer forêt forêt Cote Cote Côte Côte côtelée côtelée côté côté Côtière Côtière cotée cotée là là La La prairie prairie A A à à près près pré pré Premier Premier gare gare Gîte Gîte giter giter Gorge Gorge Où Où ou ou Règne Règne renne renne Aigüe Aigüe aïeul aïeul Ôté Ôté ôter ôter flûte flûte Batir Batir bâton bâton Canoë Canoë Été Été île île Reflet Reflet régner régner Ilote Ilote ångström ångström Âpreté Âpreté à-propos à-propos Escale Escale étude étude offense offense Odyssée Odyssée
Now, with the next regex S/R, we change, only in the first coluimn :
-
Any accentuated vowel to its corresponding lowercase vowel
-
Any uppercase consonant to its corresponding lowercase letter
SEARCH
(?-i)(?:([[=A=]])|([[=E=]])|([[=I=]])|([[=O=]])|([[=U=]])|([[=Y=]])|([A-Z]))(?=.*\t)
REPLACE
(?1a)(?2e)(?3i)(?4o)(?5u)(?6y)(?7\l\7)
foret foret forer Forer foret forêt cote Cote cote Côte cotelee côtelée cote côté cotiere Côtière cotee cotée la là la La prairie prairie a A a à pres près pre pré premier Premier gare gare gite Gîte giter giter gorge Gorge ou Où ou ou regne Règne renne renne aigue Aigüe aieul aïeul ote Ôté oter ôter flute flûte batir Batir baton bâton canoe Canoë ete Été ile île reflet Reflet regner régner ilote Ilote angstrom ångström aprete Âpreté a-propos à-propos escale Escale etude étude offense offense odyssee Odyssée
Now, after a new
Edit > Line Operations > Sort Lines Lexicographically Ascending
action, we get this list :a A a à a-propos à-propos aieul aïeul aigue Aigüe angstrom ångström aprete Âpreté batir Batir baton bâton canoe Canoë cote Cote cote Côte cote côté cotee cotée cotelee côtelée cotiere Côtière escale Escale ete Été etude étude flute flûte forer Forer foret foret foret forêt gare gare gite Gîte giter giter gorge Gorge ile île ilote Ilote la La la là odyssee Odyssée offense offense ote Ôté oter ôter ou Où ou ou prairie prairie pre pré premier Premier pres près reflet Reflet regne Règne regner régner renne renne
Finally, here is our final expected, well sorted, list of words, with the simple regex S/R :
SEARCH
^.+\t
REPLACE
Leave EMPTY
A à à-propos aïeul Aigüe ångström Âpreté Batir bâton Canoë Cote Côte côté cotée côtelée Côtière Escale Été étude flûte Forer foret forêt gare Gîte giter Gorge île Ilote La là Odyssée offense Ôté ôter Où ou prairie pré Premier près Reflet Règne régner renne
Of course, in this first approach, I did not include the accented consonants as well as the
Æ
,Œ
,æ
,œ
characters but I suppose that you’ll get the general idea !The fine side of this method is that if the first column is identical, between several raws, the sort still acts onto the second field ;-)). For instance :
cote Cote cote Côte cote côté
Best Regards,
guy038
-