Remove duplicate lines not possible?
-
@Cletos said in Remove duplicate lines not possible?:
There once was such option to remove spread duplicates, if I remember it right.
No, only a way to do it via regular expressions discussed here on the Community – that’s probably what you remember.
So it is not possible with Notepad++ at the moment.
Well, you can try it with the regular expression technique; search the Community site and you’ll rediscover the links with instructions.
-
Alright, thank you very much!
-
Hi, @cletos, @alan-kilborn and All,
Alan, as you know, I’ve certainly answered this question, many times ! But, I’m a bit lazy and, instead of finding the different links, for the OP, I prefer to “re-invent the wheel” ;-))
So @cletos, here is the magic regular expression S/R, which deletes all duplicates lines, without changing the order of the lines
-
SEARCH
(?-s)^(.+\R)(?=(?s:.*)^\1)
-
REPLACE
Leave EMPTY
-
Tick the
Match case
option, if you prefer a case detection -
Tick the
Wrap around
option, preferably -
Select the
Regular expression
search mode -
Click on the
Replace All
button ( or use the “step by step”Replace
button to verify how the regex works ! )
Remark :
Let’s suppose that your initial text is :
aaa bbb ccc ddd bbb bbb eee fff bbb ggg bbb hhh iii
Then this regex S/R will delete :
- The
bbb
line between linesaaa
andccc
- The
bbb
line between linesddd
andbbb
- The
bbb
line between linesbbb
andeee
- The
bbb
line between linesfff
andggg
And keeps, only the line
bbb
, located between linesggg
andhhh
So, to sum up, this regex S/R keep all the last duplicate lines found, in the input text !
So your final text becomes :
aaa ccc ddd eee fff ggg bbb hhh iii
I cannot get an other layout, with a correct regex S/R ! ( For instance, keeping the line
bbb
between linesaaa
andccc
and deleting all subsequentbbb
lines ) Sorry for this limitation !
IMPORTANT :
-
The last line of your list must always be followed with a line-break
-
Be aware that the behaviour of this regex S/R is rather weird ! It works nice with small or middle-size text to process. But :
-
If your file has a big size, over
10 Mb
about, even not concerned with duplicates lines, OR -
If
2
duplicate lines are separated with, let’s say, more than10,000
lines
-
It may happen that this S/R is completely wrong, with an extra occurrence, matching all the file contents :-(( It mainly depends on our Boost regex engine and, probably, on the amount of your system memory !
As always, give it a try, with your real files, to see how this regex S/R acts !?
Two possible solutions, if any problem occurs :
-
Use, the
Replace
button repeatedly ( or theAlt + R
shortcut ) and stop when a particular replacement wipe out, wrongly, all file contents ! -
Split your text in smaller parts, processing this regex S/R on each part, first. Then, merge all the pieces and process, again, the regex S/R on the whole set !
Best Regards,
guy038
-
-
Hello guy038,
Thank you very much!
I cannot get an other layout, with a correct regex S/R ! ( For instance, keeping the line bbb between lines aaa and ccc and deleting all subsequent bbb lines ) Sorry for this limitation !
No, no, it works great!The last line of your list must always be followed with a line-break
So one has to just press ENTER at the end of that last line in the txt file.
If your file has a big size, over 10 Mb about, even not concerned with duplicates lines, OR
So I could try splitting the processing on the first half of the txt file and the last half or even smaller and hope there are many lines removed and the file gets smaller.
Be aware that the behaviour of this regex S/R is rather weird ! It works nice with small or middle-size text to process. But :
Works great after some testing.
Two possible solutions, if any problem occurs :
Use, the Replace button repeatedly ( or the Alt + R shortcut ) and stop when a particular replacement wipe out, wrongly, all file contents ! Split your text in smaller parts, processing this regex S/R on each part, first. Then, merge all the pieces and process, again, the regex S/R on the whole set !
I will try it like that.
Thank you very much, again!
-
@guy038 said in Remove duplicate lines not possible?:
I cannot get an other layout, with a correct regex S/R ! ( For instance, keeping the line bbb between lines aaa and ccc and deleting all subsequent bbb lines ) Sorry for this limitation !
Hi guy038, Cletos, All:
Not a regex solution, but if you reverse the list —for example, by means of the Reverse Lines plugin— and run the nice regex you provided, you will get the first “bbb” with all duplicates being deleted. Once you are finished, reverse the list again to get the original order of lines.
Hope you find this, my first post here, useful.
Best Regards.
-
Hello Sofistanpp,
OK, sounds very good! Many thanks!
-
@Cletos Glad to be of help.
-
Maybe explain how reversing the lines helps?
-
@Alan-Kilborn Sure. It looks to overcome a limitation pointed out by guy038, who wrote that the regex he posted remove all the duplicates except the last one, but it seems that he wanted to keep the first one. So if you reverse the order of lines and run the regex, you will remove, of course, all the instances except the last duplicate — now reverse the list back to the original order and you would have actually kept the first instance of the line —the “bbb” between “aaa” and “ccc” of the example.
Hope it is clear now (English is not my first language).
Best Regards.
-
Ah, okay, I missed the point about wanting to keep the first rather than the last. Thanks for the clarification.
-
Hi, @cletos, @sofistanpp, @alan-kilborn and All,
@sofistanpp, I didn’t want to privilege any solution but, indeed, it’s good to be able to chose, with your clever idea of using the
Reverse Lines
plugin, between these two solutions :-
Keep the first duplicate line and delete all subsequent duplicate lines
-
Delete any duplicate but just keep the last duplicate line
Now, thinking about it, I found out a solution which can be processed within N++ only, preventing from using any external tool
If we go back to my previous example, open the Column editor (
Edit > Column Editor...
) and, moving the caret to the first column of the first line of your text, create a new number’s list ( Don’t forget to tick theLeading zeros
option ! )Then after adding
1
or several blank character(s), after each number, with the column mode selection, you should get :01 aaa 02 bbb 03 ccc 04 ddd 05 bbb 06 bbb 07 eee 08 fff 09 bbb 10 ggg 11 bbb 12 hhh 13 iii
Now, sort the lines with the option
Edit > Line Operations > Sort Lines Lexicographically Descending
, giving :13 iii 12 hhh 11 bbb 10 ggg 09 bbb 08 fff 07 eee 06 bbb 05 bbb 04 ddd 03 ccc 02 bbb 01 aaa
Finally, after running this new version of my previous regex S/R :
-
SEARCH
(?-s)^\d+\h+(.+\R)(?=(?s:.*)^\d+\h+\1)
-
REPLACE
Leave EMPTY
You’re left with :
13 iii 12 hhh 10 ggg 08 fff 07 eee 04 ddd 03 ccc 02 bbb 01 aaa
Finally, after the second sort
Edit > Line Operations > Sort Lines Lexicographically Ascending
, in the reverse order, we have the following output text :01 aaa 02 bbb 03 ccc 04 ddd 07 eee 08 fff 10 ggg 12 hhh 13 iii
As expected, it remains the duplicate
bbb
line between linesaaa
andccc
only ;-))Best Regards,
guy038
-
-
Hi guy038, All:
Well done. I’m glad my post somehow inspired you to develop a more comprehensive solution to the current issue. As I learned reading archived posts, ancillary lists are a frequently used resource of your toolbox.
On my side, reversing lines wasn’t my first thought. What would happen, I asked myself, if I run that regex in backward direction from the last line? Would I get, by symmetry, the first “bbb”? Enabled the Backward direction button via an AutoHotkey script and clicked on Replace All, but no joy. You will get exactly the same outcome as if you run the regex in normal direction.
I suspect that lookarounds are the culprits (simpler regexes do the expected job), but haven’t thoroughly tested it.
Maybe you or someone else can elaborate on this issue.
Best Regards.
-
Hello guy038,
Thank you you very much for the new method!
-
run that regex in backward direction from the last line
Searching backwards with regex is “discouraged” and is partially disabled in Notepad++.
The reason, I think, is that thru a given text, if you search backwards versus forwards, you won’t get the same hits. Sometimes (simpler regexes, as you noted) you will, but not always (depends upon the regex and maybe the data).Enabled the Backward direction button via an AutoHotkey script
In general, enabling disabled controls and then performing an operation and expecting good results is a dubious premise.
-
@Cletos Yes this feature is buggy, I see it fairly often. Usually I can click “Remove duplicate lines” and it removes them all, regardless of order, but sometimes it doesn’t remove any of them. Something wrong with the software, but I can’t pinpoint what’s wrong. It depends on the text? Or I have to create a new blank document and then it works there, and then copy it back into the original?
-
@endolith said in Remove duplicate lines not possible?:
It depends on the text?
Could be a line-ending problem?
If line-endings are different on otherwise duplicate lines, they won’t be considered true duplicates.