sort file removing duplicates possible?
-
I can envision the following specification for a general purpose script (goes beyond what @patrickdrd has asked for):
- sort lines case sensitive, keep duplicate lines
- sort lines case sensitive, remove duplicate lines
- sort lines case sensitive, keep duplicate lines, reverse order on the sort
- sort lines case sensitive, remove duplicate lines, reverse order on the sort
- sort lines case insensitive, keep duplicate lines
- sort lines case insensitive, remove duplicate lines
- sort lines case insensitive, keep duplicate lines, reverse order on the sort
- sort lines case insensitive, remove duplicate lines, reverse order on the sort
- no sort, remove duplicate lines (case sensitive)
- no sort, remove duplicate lines (case insensitive)
A selection active when invoking should define:
- lines to be affected (only those touched by selection vertically); act on all lines if no selection
- columns to be used as the sort key (if rectangular selection use the selected columns as the sort key; if zero-width rect selection consider the key to start at the selection column out to the end of the variable-length lines)
Probably I’ve forgotten something important to this “sort” of thing…
:-D
-
@Claudia-Frank said:
@Patrick
sorry, don’t know the term “accent insensitive” , what does it mean?
For example that è is the same as e?Can you provide example data (just need a couple of lines) to see if it is working correctly?
The speed test I will do with the easylist text.Cheers
Claudiayes, exactly that
-
OK - let’s see what we can do.
Cheers
Claudia -
…the following spec…
Hey Scott! You did forget some things! How about when removing duplicates, we need the options to:
- keep one occurrence of a duplicated line (when sorting)
- keep no occurrences of a duplicated line (when sorting or not sorting)
- keep LAST occurrence of a duplicated line (when not sorting)
- keep FIRST occurrence of a duplicated line (when not sorting)
-
ahh - sorry too late - specs already defined for version 1 you need to open a feature request for version 2 :-D
Cheers
Claudia -
You can find the 1st version of the script here .
In order to make it run there are two requirements, apart from the obvious one that you need to have python script plugin installed, which needs to be full-filled.
1.) be sue you have either installed the full package or download and unzip the TclTk into the NPP_INSTALL_DIR. Latest releases
2.) in order to make the “accent insensitive” feature working it is needed to install a python library called unidecode .
Unzip the .whl package into NPP_INSTALL_DIR\plugins\lib\To check both requirements, open the python script console and do the following commands
import Tkinter import unidecode
If you don’t see any errors - done.
Usage is simple - run the script and check the different options.What should work is
- sort/delete duplicates on whole text (aka nothing is selected)
- sort/delete duplicates on vertically selected text
not supported yet:
- sort/delete duplicates on rectangular selection
Cheers
ClaudiaBtw. I spent most of the time creating this ugly window - so if someone wants to create a nicer gui - please go for it. I’m not really good in designing UIs.
-
tkinter doesn’t work:
Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
Initialisation took 219ms
Ready.import Tkinter
Traceback (most recent call last):
File “<console>”, line 1, in <module>
ImportError: No module named Tkinter
Traceback (most recent call last):
File “D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\scripts\Sorter.py ”, line 5, in <module>
import Tkinter as tk
ImportError: No module named Tkinter -
I found out something else, about that easylist file,
textfx’s case insensitive sort results in 69234 entries,
which is the same as ue’s result! -
Hello, @patrickdrd, and All,
Well, I must complete my previous post !
- Firstly, I realized that your list, below, is constantly updated ( Last modified: 02 Jun 2018 08:09 UTC )
https://easylist.to/easylist/easylist.txt
So, today, this list contains
69917
lines
- Secondly, when performing the regex S/R, we must consider, both, sensitive and insensitive search => The two search regexes :
Regex A :
(?-is)(^.+\R)\1+
Regex B :
(?i-s)(^.+\R)\1+
give, after sorting and removing duplicates with the regex, a file containing :
A
69852
lines ( so,65
lines deleted, in56
matches )B
69817
lines ( so,100
lines deleted, in88
matches )
- Thirdly, we, also, must take in account the possibility that the sort, itself, is run in a sensitive or insensitive way !
Natively, Notepad++ sort text, according to the Unicode value ( code-point ) of characters ( a kind of sensitive sort ! ) whereas some other text editors may consider these two case options, leading to different results !
For instance, using the RJ TextEd software, here are the differences with a simple list of three-characters strings ( 1 x ‘ABC’, 2 x ‘AbC’, 3 x ‘Abc’,1 x ‘DEF’, 3 x ‘DEf’, 2 x ‘aBC’, 3 x ‘aBc’, 3 x ‘dEF’ and 3 x ‘def’ )
•-----------------------•---------------------------• | with Notepad++ | with RJ TextEd | •---------------•-----------------------•---------------------------• | Before Sort | After UNICODE Sort | After SENSITIVE Sort | •---------------•-----------------------•---------------------------• | Abc | ABC | AbC | | Abc | AbC | aBC | | DEf | AbC | aBc | | AbC | Abc | Abc | | aBc | Abc | ABC | | dEF | Abc | aBC | | aBc | DEF | Abc | | DEf | DEf | AbC | | aBC | DEf | aBc | | def | DEf | aBc | | AbC | aBC | Abc | | def | aBC | DEf | | def | aBc | dEF | | ABC | aBc | dEF | | Abc | aBc | DEf | | DEF | dEF | dEF | | aBc | dEF | DEf | | dEF | dEF | def | | dEF | def | DEF | | DEf | def | def | | aBC | def | def | •---------------•-----------------------•---------------------------•
So, it’s easy to understand that removing consecutive duplicates, after the sort, with the regexes above, will, necessarily, give results totally different, depending of the software used :-(
- Fourthly, sort may give different results, after being run several times, one after another. For instance, with RJ TextEd, running
6
times the insensitive sort on the3
character list above, I was left with3
sets of data (4
times, identical to the sensitive sort and two other lists !! Luckily, as for Notepad++, its Unicode sort always give identical results :-))
That’s why, @patrickdrd, it’s very difficult, finally, to compare results between different softwares, at each piece have its own behavior !
Cheers,
guy038
P.S. :
Here are the results of my tests :
1) With Notepad++ and RJ TextEd, using sensitive sort :
•---------------------------------------------------------------------------------------------------• | with Notepad++ Features | •---------------•-------------------------•---------------------------•-----------------------------• | Before Sort | After Sensitive Sort | After Sensitive Regex + | After INsensitive Regex + | | | | Suppression Duplicates | Suppression Duplicates | •---------------•-------------------------•---------------------------•-----------------------------• | Abc | ABC | ABC | ABC | | Abc | AbC | AbC | DEF | | DEf | AbC | Abc | aBC | | AbC | Abc | DEF | dEF | | aBc | Abc | DEf | | | dEF | Abc | aBC | | | aBc | DEF | aBc | | | DEf | DEf | dEF | | | aBC | DEf | def | | | def | DEf | | | | AbC | aBC | | | | def | aBC | | | | def | aBc | | | | ABC | aBc | | | | Abc | aBc | | | | DEF | dEF | | | | aBc | dEF | | | | dEF | dEF | | | | dEF | def | | | | DEf | def | | | | aBC | def | | | •---------------•-------------------------•---------------------------•-----------------------------• •---------------------------------------------------------------------------------------------------• | with RJ TextEd Features | •---------------•-------------------------•---------------------------•-----------------------------• | Before Sort | After Sensitive Sort | After Sensitive Regex + | After INsensitive Regex + | | | | Suppression Duplicates | Suppression Duplicates | •---------------•-------------------------•---------------------------•-----------------------------• | Abc | AbC | AbC | AbC | | Abc | aBC | aBC | DEf | | DEf | aBc | aBc | | | AbC | Abc | Abc | | | aBc | ABC | ABC | | | dEF | aBC | aBC | | | aBc | Abc | Abc | | | DEf | AbC | AbC | | | aBC | aBc | aBc | | | def | aBc | Abc | | | AbC | Abc | DEf | | | def | DEf | dEF | | | def | dEF | DEf | | | ABC | dEF | dEF | | | Abc | DEf | DEf | | | DEF | dEF | def | | | aBc | DEf | DEF | | | dEF | def | def | | | dEF | DEF | | | | DEf | def | | | | aBC | def | | | •---------------•-------------------------•---------------------------•-----------------------------•
2) When running, several times, an insensitive sort, with RJ TextEd, I obtained
3
different lists :-
The first one was identical to the table just above, which uses a sensitive sort
-
The two others are listed below !
•---------------------------------------------------------------------------------------------------• | with RJ TextEd Features | •---------------•-------------------------•---------------------------•-----------------------------• | Before Sort | After INsensitive Sort | After Sensitive Regex + | After INsensitive Regex + | | | | Suppression Duplicates | Suppression Duplicates | •---------------•-------------------------•---------------------------•-----------------------------• | Abc | ABC | ABC | ABC | | Abc | AbC | AbC | dEF | | DEf | aBC | aBC | | | AbC | aBC | aBc | | | aBc | aBc | Abc | | | dEF | Abc | AbC | | | aBc | AbC | aBc | | | DEf | aBc | Abc | | | aBC | Abc | aBc | | | def | Abc | dEF | | | AbC | aBc | DEf | | | def | dEF | def | | | def | dEF | dEF | | | ABC | DEf | DEF | | | Abc | DEf | def | | | DEF | DEf | | | | aBc | def | | | | dEF | def | | | | dEF | dEF | | | | DEf | DEF | | | | aBC | def | | | •---------------•-------------------------•---------------------------•-----------------------------• •---------------------------------------------------------------------------------------------------• | with RJ TextEd Features | •---------------•-------------------------•---------------------------•-----------------------------• | Before Sort | After INsensitive Sort | After Sensitive Regex + | After INsensitive Regex + | | | | Suppression Duplicates | Suppression Duplicates | •---------------•-------------------------•---------------------------•-----------------------------• | Abc | ABC | ABC | ABC | | Abc | AbC | AbC | dEF | | DEf | aBC | aBC | | | AbC | aBC | aBc | | | aBc | aBc | Abc | | | dEF | Abc | aBc | | | aBc | aBc | AbC | | | DEf | AbC | Abc | | | aBC | Abc | aBc | | | def | Abc | dEF | | | AbC | aBc | DEf | | | def | dEF | def | | | def | dEF | DEF | | | ABC | dEF | DEf | | | Abc | DEf | | | | DEF | DEf | | | | aBc | def | | | | dEF | def | | | | dEF | def | | | | DEf | DEF | | | | aBC | DEf | | | •---------------•-------------------------•---------------------------•-----------------------------•
-
Patrick, did you downlaod and unzip the TclTk into the
NPP_INSTALL_DIR ? (in your case into D:\Utilities\PortableApps\Notepad++)If so, can you run the following in the python script console
import sys; print '\n'.join(sys.path)
and post the output?
Did the unidecode library installation work?
Cheers
Claudia -
This post is deleted! -
yes, unidecode works fine, import command works
-
-
D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib
D:\Utilities\PortableApps\Notepad++\plugins\Config\PythonScript\lib
D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\scripts
D:\Utilities\PortableApps\Notepad++\plugins\Config\PythonScript\scripts
D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk
D:\Utilities\PortableApps\Notepad++\python27.zip
D:\Utilities\PortableApps\Notepad++\DLLs
D:\Utilities\PortableApps\Notepad++\lib
D:\Utilities\PortableApps\Notepad++\lib\plat-win
D:\Utilities\PortableApps\Notepad++\lib\lib-tk
D:\Utilities\PortableApps\Notepad++ -
The correct one is
D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk
but those
D:\Utilities\PortableApps\Notepad++\lib
D:\Utilities\PortableApps\Notepad++\lib\plat-win
D:\Utilities\PortableApps\Notepad++\lib\lib-tkare strange, could it be that you unzipped only part of tk packages into
D:\Utilities\PortableApps\Notepad++\ ?Can you check if you have the following files under D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk
Canvas.py
Dialog.py
FileDialog.py
FixTk.py
ScrolledText.py
SimpleDialog.py
Tix.py
tkColorChooser.py
tkCommonDialog.py
Tkconstants.py
Tkdnd.py
tkFileDialog.py
tkFont.py
Tkinter.py
tkMessageBox.py
tkSimpleDialog.py
ttk.py
turtle.pyYou might see additional files with extension pyc - that’s ok.
If you do have the files, delete the D:\Utilities\PortableApps\Notepad++\lib directory.
If you don’t have the files under D:\Utilities\PortableApps\Notepad++\lib\lib-tk but
within D:\Utilities\PortableApps\Notepad++\lib then cut D:\Utilities\PortableApps\Notepad++\lib and paste it into
D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\Cheers
Claudia -
I can’t find either D:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\lib-tk or D:\Utilities\PortableApps\Notepad++\lib folder in explorer!
-
so how did you install Tcl/Tk libraries?
Cheers
Claudia -
I extracted the zip of course, the folder you say is in:
d:\Utilities\PortableApps\Notepad++\plugins\PythonScript\lib\tcl\lib-tk\both in zip file and my explorer!
-
I’ve just read guy038’s post and I’m more confused :S
I downloaded the file again and now it’s Last modified: 02 Jun 2018 16:00 UTC
and 69930 results,
sorting with insensitive (ue and textfx) yields 69284 and the output should be similar,
so I should be satisfied by that consensus I guess? -
the easylist file is adblocker file it will change consistently.
Regarding the Tcl/Tk installation - you should have unzipped it into
D:\Utilities\PortableApps\Notepad++\ directory.The zip contains the complete folder hierachy - as you see on the left side (archive tree)
if you did this you normally got a message saying that the plugins folder already exists and
if you want it to overwrite -> you should have answered this with yes, didn’t you?Cheers
Claudia