File sorting



  • I’m new here so please be gentle!!!
    Is it possible to sort a file by line length??
    Thats all!
    Thanks
    Dave



  • @Dave-Pruce

    As gently as possible…quiet now…here it comes: no (sorry), not with Notepad++ itself.

    A lot of other ways, though: think “programming”.



  • Hello, @dave-pruce, @alan-kilborn and All,

    @dave-pruce :

    Still, as gently as possible, I can whisper to you : There a possible work-around, which only uses native N++ features ;-))

    As it’s about 1.40 a.m, presently, in France, I hope to be able to post my solution, tomorrow. So just be patient, a while !

    Best Regards,

    guy038



  • @guy038

    I’ll have to see the number of steps involved to see if it invalidates my original “no”. :)



  • While we’re waiting for @guy038, while not native to Notepad++, a Pythonscript one-liner can do the job:

    editor.setText(['\r\n','\r','\n'][editor.getEOLMode()].join(sorted(editor.getText().splitlines(),key=len)))
    

    And since it is a one-liner, one doesn’t even have to create a file for it. Just open a Pythonscript Console window (Plugins > Pythonscript > Show Console) and then find the little box that has >>> to its left in the console window and paste the above there. Press Enter to execute it for the active Notepad++ file.

    Sadly, the bigger hurdle would be getting Pythonscript installed. :-(



  • Hi, @dave-pruce, @alan-kilborn and All,

    The work-around comes from a simple idea. Imagine these 5 lines, below !

    wxyz
    defghijklm
    no
    abcd
    pqrstuv
    

    To begin with, right justify these 5 lines. So you get :

          wxyz
    defghijklm
            no
          abcd
       pqrstuv
    

    Now, run a simple ascending alphabetic sort

            no
          abcd
          wxyz
       pqrstuv
    defghijklm
    

    Nice ! We have, automatically, all the lines sorted by line length.

    To end, you just have to get rid of the leading spaces, giving the expected text :

    no
    abcd
    wxyz
    pqrstuv
    defghijklm
    

    In addition, notice that lines of same length are, also, sorted alphabetically, too ;-))


    OK ! Let’s use a real list. From, the link, below :

    https://en.wikipedia.org/wiki/List_of_rivers_by_length

    I got, for instance, after some re-formating, an English world list of 243 rivers, below, pasted in a new N++ tab :

    Nile
    White Nile
    Kagera
    Nyabarongo
    Mwogo
    Rukarara
    Amazon
    Ucayali
    Tambo
    Ene
    Mantaro
    Yangtze
    Mississippi
    Missouri
    Jefferson
    Beaverhead
    Red Rock
    Hell Roaring
    Yenisei
    Angara
    Selenge
    Ider
    Yellow River
    Ob
    Irtysh
    Río de la Plata
    Paraná
    Congo
    Chambeshi
    Amur
    Argun
    Kherlen
    Lena
    Mekong
    Mackenzie
    Slave
    Peace
    Finlay
    Niger
    Brahmaputra
    Tsangpo
    Murray
    Darling
    Culgoa
    Balonne
    Condamine
    Tocantins
    Araguaia
    Volga
    Indus
    Sênggê Zangbo
    Shatt al-Arab
    Euphrates
    Murat
    Madeira
    Mamoré
    Caine
    Rocha
    Purús
    Yukon
    São Francisco
    Syr Darya
    Naryn
    Salween
    Saint Lawrence
    Niagara
    Detroit
    Saint Clair
    Saint Marys
    Saint Louis
    North
    Nizhnyaya Tunguska
    Danube
    Breg
    Zambezi
    Vilyuy
    Araguaia
    Ganges
    Hooghly
    Padma
    Amu Darya
    Panj
    Japurá
    Nelson
    Saskatchewan
    Paraguay
    Kolyma
    Pilcomayo
    Biya
    Katun
    Ishim
    Juruá
    Ural
    Arkansas
    Colorado
    Olenyok
    Dnieper
    Aldan
    Ubangi
    Uele
    Negro
    Columbia
    Zhujiang
    Red
    Ayeyarwady
    Kasai
    Ohio
    Allegheny
    Orinoco
    Tarim
    Xingu
    Orange
    Salado
    Vitim
    Tigris
    Songhua
    Tapajós
    Don
    Podkamennaya Tunguska
    Pechora
    Kama
    Limpopo
    Chulym
    Guaporé
    Indigirka
    Snake
    Senegal
    Uruguay
    Blue Nile
    Churchill
    Khatanga
    Okavango
    Volta
    Beni
    Platte
    Tobol
    Alazeya
    Jubba
    Shebelle
    Içá
    Magdalena
    Han
    Kura
    Oka
    Murray
    Guaviare
    Pecos
    Murrumbidgee
    Yenisei
    Godavari
    Colorado
    Río Grande
    Belaya
    Cooper
    Barcoo
    Marañón
    Dniester
    Benue
    Ili
    Warburton
    Georgina
    Sutlej
    Yamuna
    Vyatka
    Fraser
    Brazos
    Liao
    Lachlan
    Yalong
    Iguaçu
    Olyokma
    Northern Dvina
    Sukhona
    Krishna
    Iriri
    Narmada
    Lomami
    Ottawa
    Lerma
    Grande de Santiago
    Elbe
    Vltava
    Zeya
    Juruena
    Rhine
    Athabasca
    Canadian
    North Saskatchewan
    Vistula
    Bug
    Vaal
    Shire
    Ogooué
    Nen
    Kızılırmak
    Markha
    Green
    Milk
    Chindwin
    Sankuru
    Wu
    Red
    James
    Kapuas
    Desna
    Helmand
    Madre de Dios
    Tietê
    Vychegda
    Sepik
    Cimarron
    Anadyr
    Paraíba do Sul
    Jialing
    Liard
    Cumberland
    White
    Huallaga
    Kwango
    Draa
    Gambia
    Tyung
    Chenab
    Yellowstone
    Ghaghara
    Huai
    Aras
    Chu
    Seversky Donets
    Bermejo
    Fly
    Kuskokwim
    Tennessee
    Oder
    Warta
    Aruwimi
    Daugava
    Gila
    Loire
    Essequibo
    Khoper
    Tagus
    Flinders
    

    Ironically, we’re going to classify them, according to the length of their name and not according to their length ;-))


    First, we’ll, roughly, estimate the maximum length of the listed names, with the generic regex (?-s)^.{N,}

    • Open the Replace window ( Ctrl + H )

    • Select the Regular expression search mode

      • (?-s)^.{30,} and a click on the Count button => 0 matches

      • (?-s)^.{25,} and a click on the Count button => 0 matches

      • (?-s)^.{20,} and a click on the Count button => 1 match

    => The maximum length is between 20 and 25. So, we’ll rely on the upper boundary 25 in the subsequent regexes :


    For all the subsequent regex S/R :

    • Tick the Wrap around option

    • Click on the Replace All button, exclusively, to process each S/R

    We’ll begin to add 25 space chars, at end of each line of the list :

    SEARCH (?-s)^.+

    REPLACE $0 ( and type in 25 space characters, right after $0, in the Replace zone

    Note : In case, you would need, for an other list, additional space chars, at end of lines, just re-run this S/R to get 50, 75, 100, spaces and so on !


    Then, use the following regex S/R, in order to truncate any standard character, located after the 25 column :

    SEARCH (?-s)^.{25}\K.+

    REPLACE Leave EMPTY


    Now, we’re going to right justify all these names, with the regex S/R :

    SEARCH (?-s)^(.+?)(\x20{2,})$

    REPLACE \2\1

    You should get the following text ( I simply put the beginning and end of the list, in order to limit my post length ! ) :

                         Nile
                   White Nile
                       Kagera
                   Nyabarongo
                        Mwogo
                     Rukarara
                       Amazon
                      Ucayali
                        Tambo
                          Ene
    .........................
    .........................
    .........................
                         Oder
                        Warta
                      Aruwimi
                      Daugava
                         Gila
                        Loire
                    Essequibo
                       Khoper
                        Tagus
                     Flinders
    

    Now, we perform the usual alphabetic sort ( Edit > Line Operations > Sort Lines Lexicographically Ascending ) and we get :

                           Ob
                           Wu
                          Bug
                          Chu
                          Don
                          Ene
                          Fly
                          Han
                          Ili
                          Içá
                          Nen
                          Oka
                          Red
                          Red
                         Amur
                         Aras
                         Beni
    .........................
    .........................
    .........................
                 Hell Roaring
                 Murrumbidgee
                 Saskatchewan
                 Yellow River
                Madre de Dios
                Shatt al-Arab
                São Francisco
                Sênggê Zangbo
               Northern Dvina
               Paraíba do Sul
               Saint Lawrence
              Río de la Plata
              Seversky Donets
           Grande de Santiago
           Nizhnyaya Tunguska
           North Saskatchewan
        Podkamennaya Tunguska
    

    To end, we get rid of all the leading spaces, with :

    SEARCH ^\x20+

    REPLACE Leave EMPTY

    and we get our expected list :

    Ob
    Wu
    Bug
    Chu
    Don
    Ene
    Fly
    Han
    Ili
    Içá
    Nen
    Oka
    Red
    Red
    Amur
    Aras
    Beni
    Biya
    Breg
    Draa
    ..............
    ..............
    ..............
    Saint Louis
    Saint Marys
    Yellowstone
    Hell Roaring
    Murrumbidgee
    Saskatchewan
    Yellow River
    Madre de Dios
    Shatt al-Arab
    São Francisco
    Sênggê Zangbo
    Northern Dvina
    Paraíba do Sul
    Saint Lawrence
    Río de la Plata
    Seversky Donets
    Grande de Santiago
    Nizhnyaya Tunguska
    North Saskatchewan
    Podkamennaya Tunguska
    

    Note that this kind of text manipulation should certainly be programmed, in a more elegant way, with a Python or Lua script ;-)) Unfortunately, my skills in that matter are quite poor :-((

    However, I’m sure that some gurus, as @alan-kilborn, @ekopalypse @peterjones or dail, will probably be able to give you a script solution, that, of course, will require you to install the Python or Lua interpreter !

    Hey, guys, it’s not a competition, OK !

    Best Regards,

    guy038



  • @guy038 said:

    Hey, guys, it’s not a competition, OK !

    Haha. No, definitely not. A support forum is about giving posters options to solving problems where there is not a very clear answer. It seems we’ve done that so far in this thread! :)

    BTW, that was what I anticipated: A lot of manual steps. :)



  • Hi, @dave-pruce, @alan-kilborn and All,

    My previous list of rivers contained 5 duplicate names :

    Red, Murray, Yenisei, Araguaia and Colorado

    But this is not important, regarding our problem, anyway !

    As you can see, @@dave-pruce, the Python solution, from Alan, is neater ! Isn’t it ?


    Now, Alan, I’ve just tested your one-line script and, to my mind, there’s two problems :

    • Inside a section of river names, of a same length, the names are not sorted alphabetically !

    • Secondly, some names, containing accentuated characters, as, for instance, the Içá river, are located outside their section, as noticed, below :

    Snake
    Volta
    Tobol
    Jubba
    Içá
    Pecos
    Benue
    Iriri
    Lerma
    

    Cheers,

    guy038



  • @guy038 said:

    names are not sorted alphabetically

    This is outside the scope of the originally stated problem! :)

    containing accentuated characters…are located outside their section

    The Python len function is apparently simple-minded in this case (using a simple byte count for the length of these strings containing multibyte characters).



  • @Alan-Kilborn said:

    The Python len function is apparently simple-minded in this case

    Perhaps this new one-liner is better, for the case where the OP has Unicode data:

    editor.setText(['\r\n','\r','\n'][editor.getEOLMode()].join(sorted(editor.getText().splitlines(),key=lambda x:len(unicode(x,'utf-8')))))
    

    Of course, still big assumption that the OP is using (or is willing to use) Pythonscript! ;)



  • @dave-pruce, @alan-kilborn,

    Yes, your new attempt, Alan, is the solution, when working with UTF8 encoded files, which may content multi-bytes encoded chars !

    As for me, I was thinking about the opposite solution : to convert UTf8-files to ANSI. However, when using this solution, some characters may result in question marks or may be changed for an approximate character, because, they do not belong to the the corresponding ANSI table of 256 characters !

    For instance, in my previous list of rivers, the Turkish Kızılırmak river, containing the Latin lowercase pointless letter ı, ( of code-point \x{0131} ), is changed into the approximate name Kizilirmak, after conversion to ANSI !

    Anyway, we just did our best to solve the OP’s problem ;-))

    BR

    guy038


Log in to reply