Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    Does Sort Lines Lexographically have a bug with diacritics?

    General Discussion
    3
    4
    104
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • abdekker
      abdekker last edited by

      Here’s a test list of words:

      naval
      neither
      nèither
      never
      nëver
      nothing

      Select all lines then Edit > Line Operations > Sort Lines Lexographically Ascending. The list is changed to:

      naval
      neither
      never
      nothing
      nèither
      nëver

      I expected no change since the list is already sorted correctly. I realise this is a little complicated because (to take an example) “è” has ASCII code 232 while “e” has ASCII code 101.

      My (possibly incorrect) assumption is that most users of languages using dicritics would expect the order of letters to be:

      …deèéêëf…

      rather than:

      …def…èéêë…

      My suggestion would be to either:

      • add an option like “Sort Lines Lexographically Respecting Dicritics Ascending” (someone can surely come up with a catchier title!) or
      • add an option into Settings > Preferences to define the lexographic ordering or
      • fix the algorithm to respect diacritic ordering

      Any comments?

      Ekopalypse 1 Reply Last reply Reply Quote 2
      • Ekopalypse
        Ekopalypse @abdekker last edited by

        @abdekker

        I’d say you analyzed it just right. How a feature request can be made is described here.

        1 Reply Last reply Reply Quote 1
        • abdekker
          abdekker last edited by

          Thanks, have raised issue 8481:
          https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8481

          1 Reply Last reply Reply Quote 2
          • guy038
            guy038 last edited by guy038

            Hello, @abdekker, @ekopalypse and All,

            The present N++ alphabetic sort simply rearranges the characters according to the value of their Unicode code-point. Refer to the list, below, to get all the existing code-points, from the latest Unicode version :

            http://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt

            And to this article which explains the format of that list


            So, for instance, this list of characters :

            Char  Code
            ~~~~~~~~~~~~
            	  0009
                  0020
            "     0022
            “     201C
            ”     201D
            -     002D
            —     2014
            _     005F
            0     0030
            Ø     00D8
            9     0039
            A     0041
            a     0061
            Ă     0102
            B     0042
            b     0062
            þ     00FE
            β     03B2
            Б     0411
            E     0045
            e     0065
            Ě     011A
            €     20AC
            ∑     2211
            ℮     212E
            fi     FB01
            ℓ     2113
            O     004F
            o     006F
            ö     00F6
            Œ     0152
            θ     03B8
            Ѳ     0472
            T     0054
            t     0074
            τ     03C4
            ŧ     0167
            ‡     2021
            Ỳ     1EF2
            ‰     2030
            ∆     2206
            ∞     221E
            

            is alphabetically sorted as :

            Char  Code
            ~~~~~~~~~~~~~
            	  0009
                  0020
            "     0022
            -     002D
            0     0030
            9     0039
            A     0041
            B     0042
            E     0045
            O     004F
            T     0054
            _     005F
            a     0061
            b     0062
            e     0065
            o     006F
            t     0074
            Ø     00D8
            ö     00F6
            þ     00FE
            Ă     0102
            Ě     011A
            Œ     0152
            ŧ     0167
            β     03B2
            θ     03B8
            τ     03C4
            Б     0411
            Ѳ     0472
            Ỳ     1EF2
            —     2014
            “     201C
            ”     201D
            ‡     2021
            ‰     2030
            €     20AC
            ℓ     2113
            ℮     212E
            ∆     2206
            ∑     2211
            ∞     221E
            fi     FB01
            

            You may think that is would be good to have an option to get, for instance, the Unicode collation mechanism, as below :

            http://www.unicode.org/charts/collation/

            However, countries use different sorting conventions, even for countries using the same Latin alphabet. Refer to this article :

            https://en.wikipedia.org/wiki/Alphabetical_order#Language-specific_conventions

            So, in the end, the simple task of sorting letters is quite a puzzle if you want to take into account the specificities of each language and country !


            In the meanwhile, here is a solution, involving regular expressions :

            From this initial list, containing 45 French words :

            foret
            Forer
            forêt
            Cote
            Côte
            côtelée
            côté
            Côtière
            cotée
            là
            La
            prairie
            A
            à
            près
            pré
            Premier
            gare
            Gîte
            giter
            Gorge
            Où
            ou
            Règne
            renne
            Aigüe
            aïeul
            Ôté
            ôter
            flûte
            Batir
            bâton
            Canoë
            Été
            île
            Reflet
            régner
            Ilote
            ångström
            Âpreté
            à-propos
            Escale
            étude
            offense
            Odyssée
            

            After running the Edit > Line Operations > Sort Lines Lexicographically Ascending option, we get :

            A
            Aigüe
            Batir
            Canoë
            Cote
            Côte
            Côtière
            Escale
            Forer
            Gorge
            Gîte
            Ilote
            La
            Odyssée
            Où
            Premier
            Reflet
            Règne
            aïeul
            bâton
            cotée
            côtelée
            côté
            flûte
            foret
            forêt
            gare
            giter
            là
            offense
            ou
            prairie
            près
            pré
            renne
            régner
            Âpreté
            Été
            Ôté
            à
            à-propos
            ångström
            étude
            île
            ôter
            

            Obviously, as a Frenchman, this sorting seems rather awful to sight and make the words’s search really not easy !


            Now, let’s take again, our initial list and duplicate all words, in each line, with the simple regex S/R :

            SEARCH (?-s).+

            REPLACE $0\t\t\t\t\t$0

            So, the list is changed into :

            foret					foret
            Forer					Forer
            forêt					forêt
            Cote					Cote
            Côte					Côte
            côtelée					côtelée
            côté					côté
            Côtière					Côtière
            cotée					cotée
            là					là
            La					La
            prairie					prairie
            A					A
            à					à
            près					près
            pré					pré
            Premier					Premier
            gare					gare
            Gîte					Gîte
            giter					giter
            Gorge					Gorge
            Où					Où
            ou					ou
            Règne					Règne
            renne					renne
            Aigüe					Aigüe
            aïeul					aïeul
            Ôté					Ôté
            ôter					ôter
            flûte					flûte
            Batir					Batir
            bâton					bâton
            Canoë					Canoë
            Été					Été
            île					île
            Reflet					Reflet
            régner					régner
            Ilote					Ilote
            ångström					ångström
            Âpreté					Âpreté
            à-propos					à-propos
            Escale					Escale
            étude					étude
            offense					offense
            Odyssée					Odyssée
            

            Now, with the next regex S/R, we change, only in the first coluimn :

            • Any accentuated vowel to its corresponding lowercase vowel

            • Any uppercase consonant to its corresponding lowercase letter

            SEARCH (?-i)(?:([[=A=]])|([[=E=]])|([[=I=]])|([[=O=]])|([[=U=]])|([[=Y=]])|([A-Z]))(?=.*\t)

            REPLACE (?1a)(?2e)(?3i)(?4o)(?5u)(?6y)(?7\l\7)

            foret					foret
            forer					Forer
            foret					forêt
            cote					Cote
            cote					Côte
            cotelee					côtelée
            cote					côté
            cotiere					Côtière
            cotee					cotée
            la					là
            la					La
            prairie					prairie
            a					A
            a					à
            pres					près
            pre					pré
            premier					Premier
            gare					gare
            gite					Gîte
            giter					giter
            gorge					Gorge
            ou					Où
            ou					ou
            regne					Règne
            renne					renne
            aigue					Aigüe
            aieul					aïeul
            ote					Ôté
            oter					ôter
            flute					flûte
            batir					Batir
            baton					bâton
            canoe					Canoë
            ete					Été
            ile					île
            reflet					Reflet
            regner					régner
            ilote					Ilote
            angstrom					ångström
            aprete					Âpreté
            a-propos					à-propos
            escale					Escale
            etude					étude
            offense					offense
            odyssee					Odyssée
            

            Now, after a new Edit > Line Operations > Sort Lines Lexicographically Ascending action, we get this list :

            a					A
            a					à
            a-propos					à-propos
            aieul					aïeul
            aigue					Aigüe
            angstrom					ångström
            aprete					Âpreté
            batir					Batir
            baton					bâton
            canoe					Canoë
            cote					Cote
            cote					Côte
            cote					côté
            cotee					cotée
            cotelee					côtelée
            cotiere					Côtière
            escale					Escale
            ete					Été
            etude					étude
            flute					flûte
            forer					Forer
            foret					foret
            foret					forêt
            gare					gare
            gite					Gîte
            giter					giter
            gorge					Gorge
            ile					île
            ilote					Ilote
            la					La
            la					là
            odyssee					Odyssée
            offense					offense
            ote					Ôté
            oter					ôter
            ou					Où
            ou					ou
            prairie					prairie
            pre					pré
            premier					Premier
            pres					près
            reflet					Reflet
            regne					Règne
            regner					régner
            renne					renne
            

            Finally, here is our final expected, well sorted, list of words, with the simple regex S/R :

            SEARCH ^.+\t

            REPLACE Leave EMPTY

            A
            à
            à-propos
            aïeul
            Aigüe
            ångström
            Âpreté
            Batir
            bâton
            Canoë
            Cote
            Côte
            côté
            cotée
            côtelée
            Côtière
            Escale
            Été
            étude
            flûte
            Forer
            foret
            forêt
            gare
            Gîte
            giter
            Gorge
            île
            Ilote
            La
            là
            Odyssée
            offense
            Ôté
            ôter
            Où
            ou
            prairie
            pré
            Premier
            près
            Reflet
            Règne
            régner
            renne
            

            Of course, in this first approach, I did not include the accented consonants as well as the Æ, Œ, æ, œ characters but I suppose that you’ll get the general idea !

            The fine side of this method is that if the first column is identical, between several raws, the sort still acts onto the second field ;-)). For instance :

            cote					Cote
            cote					Côte
            cote					côté
            

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 1
            • First post
              Last post
            Copyright © 2014 NodeBB Forums | Contributors