File sorting
-
I’m new here so please be gentle!!!
Is it possible to sort a file by line length??
Thats all!
Thanks
Dave -
As gently as possible…quiet now…here it comes: no (sorry), not with Notepad++ itself.
A lot of other ways, though: think “programming”.
-
Hello, @dave-pruce, @alan-kilborn and All,
Still, as gently as possible, I can whisper to you : There a possible work-around, which only uses native N++ features ;-))
As it’s about
1.40 a.m
, presently, in France, I hope to be able to post my solution, tomorrow. So just be patient, a while !Best Regards,
guy038
-
I’ll have to see the number of steps involved to see if it invalidates my original “no”. :)
-
While we’re waiting for @guy038, while not native to Notepad++, a Pythonscript one-liner can do the job:
editor.setText(['\r\n','\r','\n'][editor.getEOLMode()].join(sorted(editor.getText().splitlines(),key=len)))
And since it is a one-liner, one doesn’t even have to create a file for it. Just open a Pythonscript Console window (Plugins > Pythonscript > Show Console) and then find the little box that has
>>>
to its left in the console window and paste the above there. Press Enter to execute it for the active Notepad++ file.Sadly, the bigger hurdle would be getting Pythonscript installed. :-(
-
Hi, @dave-pruce, @alan-kilborn and All,
The work-around comes from a simple idea. Imagine these
5
lines, below !wxyz defghijklm no abcd pqrstuv
To begin with, right justify these
5
lines. So you get :wxyz defghijklm no abcd pqrstuv
Now, run a simple ascending alphabetic sort
no abcd wxyz pqrstuv defghijklm
Nice ! We have, automatically, all the lines sorted by line length.
To end, you just have to get rid of the leading spaces, giving the expected text :
no abcd wxyz pqrstuv defghijklm
In addition, notice that lines of same length are, also, sorted alphabetically, too ;-))
OK ! Let’s use a real list. From, the link, below :
https://en.wikipedia.org/wiki/List_of_rivers_by_length
I got, for instance, after some re-formating, an English world list of
243
rivers, below, pasted in a new N++ tab :Nile White Nile Kagera Nyabarongo Mwogo Rukarara Amazon Ucayali Tambo Ene Mantaro Yangtze Mississippi Missouri Jefferson Beaverhead Red Rock Hell Roaring Yenisei Angara Selenge Ider Yellow River Ob Irtysh Río de la Plata Paraná Congo Chambeshi Amur Argun Kherlen Lena Mekong Mackenzie Slave Peace Finlay Niger Brahmaputra Tsangpo Murray Darling Culgoa Balonne Condamine Tocantins Araguaia Volga Indus Sênggê Zangbo Shatt al-Arab Euphrates Murat Madeira Mamoré Caine Rocha Purús Yukon São Francisco Syr Darya Naryn Salween Saint Lawrence Niagara Detroit Saint Clair Saint Marys Saint Louis North Nizhnyaya Tunguska Danube Breg Zambezi Vilyuy Araguaia Ganges Hooghly Padma Amu Darya Panj Japurá Nelson Saskatchewan Paraguay Kolyma Pilcomayo Biya Katun Ishim Juruá Ural Arkansas Colorado Olenyok Dnieper Aldan Ubangi Uele Negro Columbia Zhujiang Red Ayeyarwady Kasai Ohio Allegheny Orinoco Tarim Xingu Orange Salado Vitim Tigris Songhua Tapajós Don Podkamennaya Tunguska Pechora Kama Limpopo Chulym Guaporé Indigirka Snake Senegal Uruguay Blue Nile Churchill Khatanga Okavango Volta Beni Platte Tobol Alazeya Jubba Shebelle Içá Magdalena Han Kura Oka Murray Guaviare Pecos Murrumbidgee Yenisei Godavari Colorado Río Grande Belaya Cooper Barcoo Marañón Dniester Benue Ili Warburton Georgina Sutlej Yamuna Vyatka Fraser Brazos Liao Lachlan Yalong Iguaçu Olyokma Northern Dvina Sukhona Krishna Iriri Narmada Lomami Ottawa Lerma Grande de Santiago Elbe Vltava Zeya Juruena Rhine Athabasca Canadian North Saskatchewan Vistula Bug Vaal Shire Ogooué Nen Kızılırmak Markha Green Milk Chindwin Sankuru Wu Red James Kapuas Desna Helmand Madre de Dios Tietê Vychegda Sepik Cimarron Anadyr Paraíba do Sul Jialing Liard Cumberland White Huallaga Kwango Draa Gambia Tyung Chenab Yellowstone Ghaghara Huai Aras Chu Seversky Donets Bermejo Fly Kuskokwim Tennessee Oder Warta Aruwimi Daugava Gila Loire Essequibo Khoper Tagus Flinders
Ironically, we’re going to classify them, according to the length of their name and not according to their length ;-))
First, we’ll, roughly, estimate the maximum length of the listed names, with the generic regex
(?-s)^.{
N,}
-
Open the
Replace
window (Ctrl + H
) -
Select the
Regular expression
search mode-
(?-s)^.{30,}
and a click on theCount
button =>0
matches -
(?-s)^.{25,}
and a click on theCount
button =>0
matches -
(?-s)^.{20,}
and a click on theCount
button =>1
match
-
=> The maximum length is between
20
and25
. So, we’ll rely on the upper boundary25
in the subsequent regexes :
For all the subsequent regex S/R :
-
Tick the
Wrap around
option -
Click on the
Replace All
button, exclusively, to process each S/R
We’ll begin to add
25
space chars, at end of each line of the list :SEARCH
(?-s)^.+
REPLACE
$0
( and type in25
space characters, right after$0
, in the Replace zoneNote : In case, you would need, for an other list, additional space chars, at end of lines, just re-run this S/R to get
50
,75
,100
, spaces and so on !
Then, use the following regex S/R, in order to truncate any standard character, located after the
25
column :SEARCH
(?-s)^.{25}\K.+
REPLACE
Leave EMPTY
Now, we’re going to right justify all these names, with the regex S/R :
SEARCH
(?-s)^(.+?)(\x20{2,})$
REPLACE
\2\1
You should get the following text ( I simply put the beginning and end of the list, in order to limit my post length ! ) :
Nile White Nile Kagera Nyabarongo Mwogo Rukarara Amazon Ucayali Tambo Ene ......................... ......................... ......................... Oder Warta Aruwimi Daugava Gila Loire Essequibo Khoper Tagus Flinders
Now, we perform the usual alphabetic sort (
Edit > Line Operations > Sort Lines Lexicographically Ascending
) and we get :Ob Wu Bug Chu Don Ene Fly Han Ili Içá Nen Oka Red Red Amur Aras Beni ......................... ......................... ......................... Hell Roaring Murrumbidgee Saskatchewan Yellow River Madre de Dios Shatt al-Arab São Francisco Sênggê Zangbo Northern Dvina Paraíba do Sul Saint Lawrence Río de la Plata Seversky Donets Grande de Santiago Nizhnyaya Tunguska North Saskatchewan Podkamennaya Tunguska
To end, we get rid of all the leading spaces, with :
SEARCH
^\x20+
REPLACE
Leave EMPTY
and we get our expected list :
Ob Wu Bug Chu Don Ene Fly Han Ili Içá Nen Oka Red Red Amur Aras Beni Biya Breg Draa .............. .............. .............. Saint Louis Saint Marys Yellowstone Hell Roaring Murrumbidgee Saskatchewan Yellow River Madre de Dios Shatt al-Arab São Francisco Sênggê Zangbo Northern Dvina Paraíba do Sul Saint Lawrence Río de la Plata Seversky Donets Grande de Santiago Nizhnyaya Tunguska North Saskatchewan Podkamennaya Tunguska
Note that this kind of text manipulation should certainly be programmed, in a more elegant way, with a Python or Lua script ;-)) Unfortunately, my skills in that matter are quite poor :-((
However, I’m sure that some gurus, as @alan-kilborn, @ekopalypse @peterjones or dail, will probably be able to give you a script solution, that, of course, will require you to install the Python or Lua interpreter !
Hey, guys, it’s not a competition, OK !
Best Regards,
guy038
-
-
@guy038 said:
Hey, guys, it’s not a competition, OK !
Haha. No, definitely not. A support forum is about giving posters options to solving problems where there is not a very clear answer. It seems we’ve done that so far in this thread! :)
BTW, that was what I anticipated: A lot of manual steps. :)
-
Hi, @dave-pruce, @alan-kilborn and All,
My previous list of rivers contained
5
duplicate names :Red
,Murray
,Yenisei
,Araguaia
andColorado
But this is not important, regarding our problem, anyway !
As you can see, @@dave-pruce, the Python solution, from Alan, is neater ! Isn’t it ?
Now, Alan, I’ve just tested your one-line script and, to my mind, there’s two problems :
-
Inside a section of river names, of a same length, the names are not sorted alphabetically !
-
Secondly, some names, containing accentuated characters, as, for instance, the
Içá
river, are located outside their section, as noticed, below :
Snake Volta Tobol Jubba Içá Pecos Benue Iriri Lerma
Cheers,
guy038
-
-
@guy038 said:
names are not sorted alphabetically
This is outside the scope of the originally stated problem! :)
containing accentuated characters…are located outside their section
The Python
len
function is apparently simple-minded in this case (using a simple byte count for the length of these strings containing multibyte characters). -
@Alan-Kilborn said:
The Python len function is apparently simple-minded in this case
Perhaps this new one-liner is better, for the case where the OP has Unicode data:
editor.setText(['\r\n','\r','\n'][editor.getEOLMode()].join(sorted(editor.getText().splitlines(),key=lambda x:len(unicode(x,'utf-8')))))
Of course, still big assumption that the OP is using (or is willing to use) Pythonscript! ;)
-
Yes, your new attempt, Alan, is the solution, when working with
UTF8
encoded files, which may content multi-bytes encoded chars !As for me, I was thinking about the opposite solution : to convert
UTf8
-files toANSI
. However, when using this solution, some characters may result in question marks or may be changed for an approximate character, because, they do not belong to the the corresponding ANSI table of256
characters !For instance, in my previous list of rivers, the Turkish
Kızılırmak
river, containing the Latin lowercase pointless letterı
, ( of code-point\x{0131}
), is changed into the approximate nameKizilirmak
, after conversion toANSI
!Anyway, we just did our best to solve the OP’s problem ;-))
BR
guy038