Does Sort Lines Lexographically have a bug with diacritics?



  • Here’s a test list of words:

    naval
    neither
    nèither
    never
    nëver
    nothing

    Select all lines then Edit > Line Operations > Sort Lines Lexographically Ascending. The list is changed to:

    naval
    neither
    never
    nothing
    nèither
    nëver

    I expected no change since the list is already sorted correctly. I realise this is a little complicated because (to take an example) “è” has ASCII code 232 while “e” has ASCII code 101.

    My (possibly incorrect) assumption is that most users of languages using dicritics would expect the order of letters to be:

    …deèéêëf…

    rather than:

    …def…èéêë…

    My suggestion would be to either:

    • add an option like “Sort Lines Lexographically Respecting Dicritics Ascending” (someone can surely come up with a catchier title!) or
    • add an option into Settings > Preferences to define the lexographic ordering or
    • fix the algorithm to respect diacritic ordering

    Any comments?



  • @abdekker

    I’d say you analyzed it just right. How a feature request can be made is described here.





  • Hello, @abdekker, @ekopalypse and All,

    The present N++ alphabetic sort simply rearranges the characters according to the value of their Unicode code-point. Refer to the list, below, to get all the existing code-points, from the latest Unicode version :

    http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt

    And to this article which explains the format of that list


    So, for instance, this list of characters :

    Char  Code
    ------------
    	  0009
          0020
    "     0022
    “     201C
    ”     201D
    -     002D
    —     2014
    _     005F
    0     0030
    Ø     00D8
    9     0039
    A     0041
    a     0061
    Ă     0102
    B     0042
    b     0062
    þ     00FE
    β     03B2
    Б     0411
    E     0045
    e     0065
    Ě     011A
    €     20AC
    ∑     2211
    ℮     212E
    fi     FB01
    ℓ     2113
    O     004F
    o     006F
    ö     00F6
    Π    0152
    θ     03B8
    Ѳ     0472
    T     0054
    t     0074
    τ     03C4
    ŧ     0167
    ‡     2021
    Ỳ     1EF2
    ‰     2030
    ∆     2206
    ∞     221E
    

    is alphabetically sorted as :

    Char  Code
    ------------
    	  0009
          0020
    "     0022
    -     002D
    0     0030
    9     0039
    A     0041
    B     0042
    E     0045
    O     004F
    T     0054
    _     005F
    a     0061
    b     0062
    e     0065
    o     006F
    t     0074
    Ø     00D8
    ö     00F6
    þ     00FE
    Ă     0102
    Ě     011A
    Π    0152
    ŧ     0167
    β     03B2
    θ     03B8
    τ     03C4
    Б     0411
    Ѳ     0472
    Ỳ     1EF2
    —     2014
    “     201C
    ”     201D
    ‡     2021
    ‰     2030
    €     20AC
    ℓ     2113
    ℮     212E
    ∆     2206
    ∑     2211
    ∞     221E
    fi     FB01
    

    You may think that is would be good to have an option to get, for instance, the Unicode collation mechanism, as below :

    http://www.unicode.org/charts/collation/

    However, countries use different sorting conventions, even for countries using the same Latin alphabet. Refer to this article :

    https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions

    So, in the end, the simple task of sorting letters is quite a puzzle if you want to take into account the specificities of each language and country !


    In the meanwhile, here is a solution, involving regular expressions :

    From this initial list, containing 45 French words :

    foret
    Forer
    forêt
    Cote
    Côte
    côtelée
    côté
    Côtière
    cotée
    là
    La
    prairie
    A
    à
    près
    pré
    Premier
    gare
    Gîte
    giter
    Gorge
    Où
    ou
    Règne
    renne
    Aigüe
    aïeul
    Ôté
    ôter
    flûte
    Batir
    bâton
    Canoë
    Été
    île
    Reflet
    régner
    Ilote
    ångström
    Âpreté
    à-propos
    Escale
    étude
    offense
    Odyssée
    

    After running the Edit > Line Operations > Sort Lines Lexicographically Ascending option, we get :

    A
    Aigüe
    Batir
    Canoë
    Cote
    Côte
    Côtière
    Escale
    Forer
    Gorge
    Gîte
    Ilote
    La
    Odyssée
    Où
    Premier
    Reflet
    Règne
    aïeul
    bâton
    cotée
    côtelée
    côté
    flûte
    foret
    forêt
    gare
    giter
    là
    offense
    ou
    prairie
    près
    pré
    renne
    régner
    Âpreté
    Été
    Ôté
    à
    à-propos
    ångström
    étude
    île
    ôter
    

    Obviously, as a Frenchman, this sorting seems rather awful to sight and make the words’s search really not easy !


    Now, let’s take again, our initial list and duplicate all words, in each line, with the simple regex S/R :

    SEARCH (?-s).+

    REPLACE $0\t\t\t\t\t$0

    So, the list is changed into :

    foret					foret
    Forer					Forer
    forêt					forêt
    Cote					Cote
    Côte					Côte
    côtelée					côtelée
    côté					côté
    Côtière					Côtière
    cotée					cotée
    là					là
    La					La
    prairie					prairie
    A					A
    à					à
    près					près
    pré					pré
    Premier					Premier
    gare					gare
    Gîte					Gîte
    giter					giter
    Gorge					Gorge
    Où					Où
    ou					ou
    Règne					Règne
    renne					renne
    Aigüe					Aigüe
    aïeul					aïeul
    Ôté					Ôté
    ôter					ôter
    flûte					flûte
    Batir					Batir
    bâton					bâton
    Canoë					Canoë
    Été					Été
    île					île
    Reflet					Reflet
    régner					régner
    Ilote					Ilote
    ångström					ångström
    Âpreté					Âpreté
    à-propos					à-propos
    Escale					Escale
    étude					étude
    offense					offense
    Odyssée					Odyssée
    

    Now, with the next regex S/R, we change, only in the first coluimn :

    • Any accentuated vowel to its corresponding lowercase vowel

    • Any uppercase consonant to its corresponding lowercase letter

    SEARCH (?-i)(?:([[=A=]])|([[=E=]])|([[=I=]])|([[=O=]])|([[=U=]])|([[=Y=]])|([A-Z]))(?=.*\t)

    REPLACE (?1a)(?2e)(?3i)(?4o)(?5u)(?6y)(?7\l\7)

    foret					foret
    forer					Forer
    foret					forêt
    cote					Cote
    cote					Côte
    cotelee					côtelée
    cote					côté
    cotiere					Côtière
    cotee					cotée
    la					là
    la					La
    prairie					prairie
    a					A
    a					à
    pres					près
    pre					pré
    premier					Premier
    gare					gare
    gite					Gîte
    giter					giter
    gorge					Gorge
    ou					Où
    ou					ou
    regne					Règne
    renne					renne
    aigue					Aigüe
    aieul					aïeul
    ote					Ôté
    oter					ôter
    flute					flûte
    batir					Batir
    baton					bâton
    canoe					Canoë
    ete					Été
    ile					île
    reflet					Reflet
    regner					régner
    ilote					Ilote
    angstrom					ångström
    aprete					Âpreté
    a-propos					à-propos
    escale					Escale
    etude					étude
    offense					offense
    odyssee					Odyssée
    

    Now, after a new Edit > Line Operations > Sort Lines Lexicographically Ascending action, we get this list :

    a					A
    a					à
    a-propos					à-propos
    aieul					aïeul
    aigue					Aigüe
    angstrom					ångström
    aprete					Âpreté
    batir					Batir
    baton					bâton
    canoe					Canoë
    cote					Cote
    cote					Côte
    cote					côté
    cotee					cotée
    cotelee					côtelée
    cotiere					Côtière
    escale					Escale
    ete					Été
    etude					étude
    flute					flûte
    forer					Forer
    foret					foret
    foret					forêt
    gare					gare
    gite					Gîte
    giter					giter
    gorge					Gorge
    ile					île
    ilote					Ilote
    la					La
    la					là
    odyssee					Odyssée
    offense					offense
    ote					Ôté
    oter					ôter
    ou					Où
    ou					ou
    prairie					prairie
    pre					pré
    premier					Premier
    pres					près
    reflet					Reflet
    regne					Règne
    regner					régner
    renne					renne
    

    Finally, here is our final expected, well sorted, list of words, with the simple regex S/R :

    SEARCH ^.+\t

    REPLACE Leave EMPTY

    A
    à
    à-propos
    aïeul
    Aigüe
    ångström
    Âpreté
    Batir
    bâton
    Canoë
    Cote
    Côte
    côté
    cotée
    côtelée
    Côtière
    Escale
    Été
    étude
    flûte
    Forer
    foret
    forêt
    gare
    Gîte
    giter
    Gorge
    île
    Ilote
    La
    là
    Odyssée
    offense
    Ôté
    ôter
    Où
    ou
    prairie
    pré
    Premier
    près
    Reflet
    Règne
    régner
    renne
    

    Of course, in this first approach, I did not include the accented consonants as well as the Æ, Œ, æ, œ characters but I suppose that you’ll get the general idea !

    The fine side of this method is that if the first column is identical, between several raws, the sort still acts onto the second field ;-)). For instance :

    cote					Cote
    cote					Côte
    cote					côté
    

    Best Regards,

    guy038


Log in to reply