Change menu text for "Remove Consecutive Duplicate Lines" ?
-
Nice.
Which took 5.8 seconds on my environment
Nicer.
script would remove ANY duplicate, not only the ones which are consecutive.
Perhaps nicest.
:)
I was just generalizing in my earlier reply; I didn’t know a script was going to come out of it. :)
-
I was just generalizing in my earlier reply; I didn’t know a script was going to come out of it. :)
I had it already but never tested it with really big data and this thread just gave me the trigger to do the test :-)
-
Hello, @ekopalypse,
I’ve just tried out your script, about removing duplicates lines, with a local N++
v7.6.3
, 32 bits release and nothing occured :-((My Python script version is
1.3.0.0
and NO error message is displayed on the console !My Python interpreter seems OK, as other scripts just work as expected !
I used this simple sample text below :
abcde fgh abcde jk opq abcde fgh jk fgh abcde
I also, tried to sort it out first, to select a line, a block of lines or all text => No result --(( I also suppressed the line numbering, just in case…
Here is my debug info :
Notepad++ v7.6.3 (32-bit) Build time : Jan 27 2019 - 17:20:30 Path : D:\@@\763\notepad++.exe Admin mode : OFF Local Conf mode : ON OS : Windows XP (32-bit) Plugins : BetterMultiSelection.dll ComparePlugin.dll DSpellCheck.dll ElasticTabstops.dll mimeTools.dll NppConverter.dll NppExport.dll PythonScript.dll TabIndentSpaceAlign.dll
Note that the
v7.6.3
version is my last version, where I installed the PythonScript plugin and that my Win XP laptop contains numerous portable N++ versions, with various plugins in each ;-))So, am I missing something obvious ?!
BR
guy038
-
sorry, yes, I only posted the function itself - it must be called of course :-)
def remove_duplicates(): unique_lines = set() duplicates = [] for line_num, line in enumerate(editor.getCharacterPointer().splitlines()): if line not in unique_lines: unique_lines.add(line) else: duplicates.append(line_num) for line_num in reversed(duplicates): editor.deleteLine(line_num) remove_duplicates()
-
Somewhat equivalently, one could remove the
def remove_duplicates():
line (and now also theremove_duplicates()
line), and outdent the remaining lines, and it will also work fine. :) -
I just tried it out. With the call, it works for me on @guy038’s data.
The one thing I would suggest would be to wrap it in a
editor.beginUndoAction()
/editor.endUndoAction()
pair. If I’m doing a bulk delete, I want to be able to bulk undo, too. :-) -
depending on the how many duplicates it found, yes, it could become quite cumbersome
if one would try to undo it :-)def remove_duplicates(): unique_lines = set() duplicates = [] for line_num, line in enumerate(editor.getCharacterPointer().splitlines()): if line not in unique_lines: unique_lines.add(line) else: duplicates.append(line_num) for line_num in reversed(duplicates): editor.deleteLine(line_num) editor.beginUndoAction() remove_duplicates() editor.endUndoAction()
-
Hi, @Ekopalypse, @alan-kilborn, @peterjones and all,
Oh… my bad ! I’m feeling really silly, right now :-(( So elementary !
Now, as the native
Remove consecutive duplicate lines
N++ option does not take any selection in account, @ekopalypse, would it be easy enough to just consider the current main selection ? If so, it could be an interesting enhancement of this native N++ command ;-))Cheers,
guy038
-
yes but what should happen with the selection afterwards?
Should it simply disappear or should it select the remaining unique lines? -
To my mind, I don’t think that it’s necessary to keep the selection. Indeed, it just would be a mean to define the part of file to be processed, afterwards !
What’s your feeling about it ?
Cheers,
guy038
-
not sure, I guess providing a flag which can be set is good enough. In case one wants it
turn it on, if not, turn it off.If not someone else is jumping in then I will follow up tomorrow, as it is already past midnight but I know you know this as you are from France as I remember.
'til tomorrow.
-
Hi @guy038
as promised here a version with a selection option.def remove_duplicates(): unselect_after_removable = False unique_lines = set() duplicates = [] if editor.getSelectionEmpty(): for line_num, line in enumerate(editor.getCharacterPointer().splitlines()): if line not in unique_lines: unique_lines.add(line) else: duplicates.append(line_num) else: start, end = editor.getUserLineSelection() for line_num in range(start, end+1) : line = editor.getLine(line_num) if line not in unique_lines: unique_lines.add(line) else: duplicates.append(line_num) for line_num in reversed(duplicates): editor.deleteLine(line_num) if unselect_after_removable: editor.clearSelections()
-
Hi, @ekopalypse and All,
This time, I was warned ;-)) So, adding the part, below, to your script allowed me to appreciate your last version :
editor.beginUndoAction() remove_duplicates() editor.endUndoAction()
If no main selection is present, all file contents are processed. Else, the selection range, only, is concerned. Nice, indeed ;-))
I built a sample file containing, roughly,
497,000
lines, all different and I added a block of15
lines,128
times, each block being separated from the next one, with between800
to7500
lines, which, finally, gave me a file of almost500,000
lines. On my out-dated laptop ( Win XP, 1GB of RAM ! ), No problem. It took31s
about to be processed !BR
guy038
P.S. :
Yes, I know ! Why can’t he buy a recent laptop, with a 250 Gb SSD for Windows 10, 8 Gb of SDRAM, a 2 To SATA HD and 2 Go NVIDIA GeForce, as everybody ? Well, I think I’m about to reach the tipping point ;-))
Note that I did not emphasize these laptop’s characteristics as I’m not quite certain they are all accurate !!
-
hehe :-)
31s is long time - would be interesting to see your results using this little test
I assume the ram might be bottleneck, would you mind making some test
with 50 000 instead of 500 000 lines as well?import time from random import randint def create_unique_lines(num_of_lines): lines = ['sample data {0} on line {0}\r\n'.format(x) for x in range(num_of_lines)] editor.setText(''.join(lines)) def create_duplicates(num_of_duplicates): max_lines = editor.getLineCount() duplicated_lines = [] for i in range(num_of_duplicates): duplicate_line = randint(1, max_lines) duplicated_lines.append(editor.getLine(duplicate_line)) editor.appendText(''.join(duplicated_lines)) editor.scrollToEnd() def remove_duplicates(): unselect_after_removable = False unique_lines = set() # here much faster than lists duplicates = [] if editor.getSelectionEmpty(): for line_num, line in enumerate(editor.getCharacterPointer().splitlines()): if line not in unique_lines: unique_lines.add(line) else: duplicates.append(line_num) else: start, end = editor.getUserLineSelection() for line_num in range(start, end+1) : line = editor.getLine(line_num) if line not in unique_lines: unique_lines.add(line) else: duplicates.append(line_num) for line_num in reversed(duplicates): editor.deleteLine(line_num) if unselect_after_removable: editor.clearSelections() def main(): keep_going = '0' while keep_going: keep_going = notepad.prompt('1 = Create data 2 = Create duplicates 3 = Remove duplicates', 'Choose action', '1' if keep_going == '3' else str(int(keep_going) + 1)) if keep_going == '1': create_unique_lines(500000) elif keep_going == '2': create_duplicates(10) elif keep_going == '3': editor.beginUndoAction() s = time.time() remove_duplicates() print(time.time() - s) editor.endUndoAction() break else: break main()
The last action, number 3, or pressing the cancel button breaks the loop.
Btw. I get 0.33 seconds removing 10 duplicates from 500 000 lines on my machine.
First gen i5 2500 but with plenty of ram - 16gb :-) -
Hello @ekopalypse and All,
My original file consisted of
500,000
lines about, between40
and270
characters.So, this could explain the long execution time noticed, on my weak configuration !To give you an idea, running several times your second script, yesterday, and this morning, took between
0.937 s
and1.328 s
on my old laptop !
Now, after running once more time, your script, I cancelled it, right before the “remove duplicate lines” action. Then :
-
I deleted the first line
sample data 0 on line 0
-
I added the line
sample data 500000 on line 500000
, at the end of file -
I moved the random block of the
10
duplicate lines, below, to the very beginning of file :
sample data 215497 on line 215497 sample data 444992 on line 444992 sample data 413618 on line 413618 sample data 117035 on line 117035 sample data 185573 on line 185573 sample data 25978 on line 25978 sample data 275256 on line 275256 sample data 251521 on line 251521 sample data 328003 on line 328003 sample data 342755 on line 342755
-
Afterwards, I used a regex S/R, in order to add this
10
lines block, right after line5,000
,10,000
,15,000
, and so on … till495,000
and finally500,000
-
SEARCH
[50]000\R\K
-
REPLACE
sample data 215497 on line 215497\r\nsample data 444992 on line 444992\r\nsample data 413618 on line 413618\r\nsample data 117035 on line 117035\r\nsample data 185573 on line 185573\r\nsample data 25978 on line 25978\r\nsample data 275256 on line 275256\r\nsample data 251521 on line 251521\r\nsample data 328003 on line 328003\r\nsample data 342755 on line 342755\r\n
-
So, I obtained a
501,000
lines file ( =500,000
initial lines +100
blocks of10
duplicate lines )
To end, in your initial
Remove duplicates
script I added the timer. Running the script, on the501,000
lines file, took between9.86 s
and11.94 s
, depending of the location of caret, before execution. Everything was OK ! I did get, again, each time, a file containing500,000
lines ;-))For information, the bock of the
10
duplicate lines kept, was the one located between lines5,000
and5,001
!Important remark : Your script is correct as long as the
word wrap
feature is NOT set ! Otherwise, results are quite incoherent :-(( This seems quite natural as I often noticed that navigation, throughout an huge file, is very, very slow when theWord wrap
feature is enabled, even when there’s no highlighting !Cheers,
guy038
-
-
thank you for your tests. ~10s for your aged laptop seems to be ok, I guess.
What do you mean by
Your script is correct as long as the word wrap feature is NOT set
?
While testing I didn’t encounter a situation that duplicates still exists after running the script. WordWrap and long lines are performance killers, that’s for sure, but the script should still work correctly.