Does Sort Lines Lexographically have a bug with diacritics?

abdekker

Here’s a test list of words:

naval
neither
nèither
never
nëver
nothing

Select all lines then Edit > Line Operations > Sort Lines Lexographically Ascending. The list is changed to:

naval
neither
never
nothing
nèither
nëver

I expected no change since the list is already sorted correctly. I realise this is a little complicated because (to take an example) “è” has ASCII code 232 while “e” has ASCII code 101.

My (possibly incorrect) assumption is that most users of languages using dicritics would expect the order of letters to be:

…deèéêëf…

rather than:

…def…èéêë…

My suggestion would be to either:

add an option like “Sort Lines Lexographically Respecting Dicritics Ascending” (someone can surely come up with a catchier title!) or
add an option into Settings > Preferences to define the lexographic ordering or
fix the algorithm to respect diacritic ordering

Any comments?

Ekopalypse

@abdekker

I’d say you analyzed it just right. How a feature request can be made is described here.

abdekker

Thanks, have raised issue 8481:
https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8481

guy038

Hello, @abdekker, @ekopalypse and All,

The present N++ alphabetic sort simply rearranges the characters according to the value of their Unicode code-point. Refer to the list, below, to get all the existing code-points, from the latest Unicode version :

http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt

And to this article which explains the format of that list

So, for instance, this list of characters :

Char  Code
~~~~~~~~~~~~
	  0009
      0020
"     0022
“     201C
”     201D
-     002D
—     2014
_     005F
0     0030
Ø     00D8
9     0039
A     0041
a     0061
Ă     0102
B     0042
b     0062
þ     00FE
β     03B2
Б     0411
E     0045
e     0065
Ě     011A
€     20AC
∑     2211
℮     212E
ﬁ     FB01
ℓ     2113
O     004F
o     006F
ö     00F6
Œ     0152
θ     03B8
Ѳ     0472
T     0054
t     0074
τ     03C4
ŧ     0167
‡     2021
Ỳ     1EF2
‰     2030
∆     2206
∞     221E

is alphabetically sorted as :

Char  Code
~~~~~~~~~~~~~
	  0009
      0020
"     0022
-     002D
0     0030
9     0039
A     0041
B     0042
E     0045
O     004F
T     0054
_     005F
a     0061
b     0062
e     0065
o     006F
t     0074
Ø     00D8
ö     00F6
þ     00FE
Ă     0102
Ě     011A
Œ     0152
ŧ     0167
β     03B2
θ     03B8
τ     03C4
Б     0411
Ѳ     0472
Ỳ     1EF2
—     2014
“     201C
”     201D
‡     2021
‰     2030
€     20AC
ℓ     2113
℮     212E
∆     2206
∑     2211
∞     221E
ﬁ     FB01

You may think that is would be good to have an option to get, for instance, the Unicode collation mechanism, as below :

http://www.unicode.org/charts/collation/

However, countries use different sorting conventions, even for countries using the same Latin alphabet. Refer to this article :

https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions

So, in the end, the simple task of sorting letters is quite a puzzle if you want to take into account the specificities of each language and country !

In the meanwhile, here is a solution, involving regular expressions :

From this initial list, containing 45 French words :

foret
Forer
forêt
Cote
Côte
côtelée
côté
Côtière
cotée
là
La
prairie
A
à
près
pré
Premier
gare
Gîte
giter
Gorge
Où
ou
Règne
renne
Aigüe
aïeul
Ôté
ôter
flûte
Batir
bâton
Canoë
Été
île
Reflet
régner
Ilote
ångström
Âpreté
à-propos
Escale
étude
offense
Odyssée

After running the Edit > Line Operations > Sort Lines Lexicographically Ascending option, we get :

A
Aigüe
Batir
Canoë
Cote
Côte
Côtière
Escale
Forer
Gorge
Gîte
Ilote
La
Odyssée
Où
Premier
Reflet
Règne
aïeul
bâton
cotée
côtelée
côté
flûte
foret
forêt
gare
giter
là
offense
ou
prairie
près
pré
renne
régner
Âpreté
Été
Ôté
à
à-propos
ångström
étude
île
ôter

Obviously, as a Frenchman, this sorting seems rather awful to sight and make the words’s search really not easy !

Now, let’s take again, our initial list and duplicate all words, in each line, with the simple regex S/R :

SEARCH (?-s).+

REPLACE $0\t\t\t\t\t$0

So, the list is changed into :

foret					foret
Forer					Forer
forêt					forêt
Cote					Cote
Côte					Côte
côtelée					côtelée
côté					côté
Côtière					Côtière
cotée					cotée
là					là
La					La
prairie					prairie
A					A
à					à
près					près
pré					pré
Premier					Premier
gare					gare
Gîte					Gîte
giter					giter
Gorge					Gorge
Où					Où
ou					ou
Règne					Règne
renne					renne
Aigüe					Aigüe
aïeul					aïeul
Ôté					Ôté
ôter					ôter
flûte					flûte
Batir					Batir
bâton					bâton
Canoë					Canoë
Été					Été
île					île
Reflet					Reflet
régner					régner
Ilote					Ilote
ångström					ångström
Âpreté					Âpreté
à-propos					à-propos
Escale					Escale
étude					étude
offense					offense
Odyssée					Odyssée

Now, with the next regex S/R, we change, only in the first coluimn :

Any accentuated vowel to its corresponding lowercase vowel
Any uppercase consonant to its corresponding lowercase letter

SEARCH (?-i)(?:([[=A=]])|([[=E=]])|([[=I=]])|([[=O=]])|([[=U=]])|([[=Y=]])|([A-Z]))(?=.*\t)

REPLACE (?1a)(?2e)(?3i)(?4o)(?5u)(?6y)(?7\l\7)

foret					foret
forer					Forer
foret					forêt
cote					Cote
cote					Côte
cotelee					côtelée
cote					côté
cotiere					Côtière
cotee					cotée
la					là
la					La
prairie					prairie
a					A
a					à
pres					près
pre					pré
premier					Premier
gare					gare
gite					Gîte
giter					giter
gorge					Gorge
ou					Où
ou					ou
regne					Règne
renne					renne
aigue					Aigüe
aieul					aïeul
ote					Ôté
oter					ôter
flute					flûte
batir					Batir
baton					bâton
canoe					Canoë
ete					Été
ile					île
reflet					Reflet
regner					régner
ilote					Ilote
angstrom					ångström
aprete					Âpreté
a-propos					à-propos
escale					Escale
etude					étude
offense					offense
odyssee					Odyssée

Now, after a new Edit > Line Operations > Sort Lines Lexicographically Ascending action, we get this list :

a					A
a					à
a-propos					à-propos
aieul					aïeul
aigue					Aigüe
angstrom					ångström
aprete					Âpreté
batir					Batir
baton					bâton
canoe					Canoë
cote					Cote
cote					Côte
cote					côté
cotee					cotée
cotelee					côtelée
cotiere					Côtière
escale					Escale
ete					Été
etude					étude
flute					flûte
forer					Forer
foret					foret
foret					forêt
gare					gare
gite					Gîte
giter					giter
gorge					Gorge
ile					île
ilote					Ilote
la					La
la					là
odyssee					Odyssée
offense					offense
ote					Ôté
oter					ôter
ou					Où
ou					ou
prairie					prairie
pre					pré
premier					Premier
pres					près
reflet					Reflet
regne					Règne
regner					régner
renne					renne

Finally, here is our final expected, well sorted, list of words, with the simple regex S/R :

SEARCH ^.+\t

REPLACE Leave EMPTY

A
à
à-propos
aïeul
Aigüe
ångström
Âpreté
Batir
bâton
Canoë
Cote
Côte
côté
cotée
côtelée
Côtière
Escale
Été
étude
flûte
Forer
foret
forêt
gare
Gîte
giter
Gorge
île
Ilote
La
là
Odyssée
offense
Ôté
ôter
Où
ou
prairie
pré
Premier
près
Reflet
Règne
régner
renne

Of course, in this first approach, I did not include the accented consonants as well as the Æ, Œ, æ, œ characters but I suppose that you’ll get the general idea !

The fine side of this method is that if the first column is identical, between several raws, the sort still acts onto the second field ;-)). For instance :

cote					Cote
cote					Côte
cote					côté

Best Regards,

guy038