Delete duplicate lines ?
Is it possible to find an remove duplicate lines in a document ? If so, how ?
Do you want ONE occurrence of a duplicated line to remain? If so, does it matter which one (first? last?)? Or, if a line is duplicated, do you want ALL occurrences of it wiped out? Really, you must be more explicit in what you want to get the best help! :)
I want one to remain, which one doesnt matter.
Okay, well in that case there is a really cool regex search and replace you can do which will delete duplicate lines, leaving one copy (the LAST one that occurs in the file):
Find what: (?s)^(.*?)$\s+?^(?=.*^\1$)
Replace with: <------make sure this box contains nothing
Make sure you specify that this is to be a regular expression search.
Execute the replace.
It replaces only one ocurrence per time Im doing the search
Doesnt matter if I choose to replace all
Actually it doesnt, it just says that it does…
guy038 last edited by guy038
Hello Björn and Scott,
On 11/13/16, I updated this post. Indeed, I realized that my regex may bug, when dealing with important files :-(( I suppose that it was due to the global in-line modifier
(?s), at the beginning of the regex ? Don’t time to point out what something goes wrong, in some cases !
So, to keep all the unique lines AND the last item of all the duplicate lines, you would rather use the following safer S/R :
Of course, I also updated, the old post’s contents, below.
Ah, Scott, very nice regex, indeed !
Scott and Björn, last week, I replied a post, below, to sophey hence :
where I tried to fully discuss about TWO general methods of keeping :
Only unique lines
All the duplicate lines
Only the first line, from all the duplicate ones
But Scott, I didn’t think about that fourth case : To keep all the unique lines AND the last item of all the duplicate lines !
So, with your regex,
(?s)^(.*?)$\s+?^(?=.*^\1$), here is, below, an example of all the lines kept, and the contents of this file before and after the S/R :
File Lines File BEFORE KEPT AFTER ------------------------------------------- aaa ccc ccc --> ccc bbb bbb --> bbb eee ddd aaa aaa fff eee --> eee ddd ddd ggg aaa --> aaa hhh fff --> fff iii ddd --> ddd ggg --> ggg hhh iii hhh --> hhh iii --> iii -------------------------------------------
Thinking about it, I found an other syntax, which can achieve the same modifications :
(?-s)^(.+\R)(?s)(?=(.+\R)?\1)|^\R. Like you, the replacement zone must be EMPTY
However, my regex needs a condition : the last line ( as the string “iii”, in the above example ) must be followed by its EOL character(s) !
The first part
(?-s)^(.+\R), with the modifier
(?-s), which ensures that the dot will match Standard characters, matches any complete line, with its EOL characters
In the second part
(?s)(?=(.+\R)?\1), with the modifier
(?s), which means that dot matches, absolutely, any character ( standard or EOL characters ), the syntax
(.+\R)?\1, then, represents the largest optional range of characters, going further on, till an EOL character, followed itself by the contents of group 1 ( the current line )
Therefore, the part
(?=(.+\R)?\1), which is a positive look-ahead, imposes a condition for an overall match : that exists, further on, even closed to, an identical complete line to the current one ! If so, the complete current line is deleted, in replacement
Finally the third part
^\R, after the alternative symbol
|, matches any pure blank line, which will be deleted, in replacement, too
I’m glad you like my regex. There is a 99% chance that you were the original author and I obtained it from you via this community over the last 1.5 years I’ve been reading it!
Did you get it working in your file(s), using either my or guy038’s methods?
None works, I get the same result with both.
Getting a message that one occurence were replaced, but it doesnt seems like thats a fact.
My file has 1797 lines.
Maybe I can post it somewhere to you to try ?
I just reverified my own regex as well as guy038’s regex on the “aaa, bbb, …” data guy038 provided. These regexes work to transform that data as described, so I’m not really able to tell you what is going wrong in your case. :(
Could it be not working because all lines starts with -
- Odling av andra fleråriga växter
Or because its some Swedish characters
guy038 last edited by
Very strange, indeed ! I first thought it could be because of the Wrap around option, by the regexes worked well, whether this option is checked or not !
I also verified that if the Wrap around is checked and the the caret is, somewhere, inside the list, the resulted text is correct, at the end !
I also tried with your Swedish text, building the original text below :
Odling av andra fleråriga växter Odling av andra fleråriga växter aaaa Odling av andra fleråriga växter bbbb Odling av andra fleråriga växter aaaa Odling av andra fleråriga växter
After clicking on the Replace All button, I, normally, got the changed text, below :
bbbb aaaa Odling av andra fleråriga växter
So the best thing is to begin… at the beginning !
First of all, using my simple test text of my previous post :
aaa ccc bbb ddd aaa eee ddd aaa fff ddd ggg hhh iii hhh iii
do you obtain, after replacing, the text below :
ccc bbb eee aaa fff ddd ggg hhh iii
Moreover, just to verify, after clicking on the Show All Characters button ( or the menu option View - Show Symbol - Show All Characters ), how look the EOL characters of your file ?
CR? Are they all identical ?
Remember, if you’re using my regex, just take care that the last item, of your list, is, normally, followed by EOL character(s) ! It’s the only minor restriction !
See you later,
Vasile Caraus last edited by
yes, but If I have special characters, just like "-- Mother is home – " won’t work any of your regex completed.
guy038 last edited by
My updated regex ( See the second post, above ) works perfectly well, even if I insert your expression – Mother is home –, in a list !?. For instance, the original text, below :
aaa ccc bbb ddd aaa -- Mother is home – eee ddd aaa fff -- Mother is home – -- Mother is home – ddd ggg hhh -- Mother is home – iii hhh iii
with the S/R :
will be changed into :
ccc bbb eee aaa fff ddd ggg -- Mother is home – iii hhh iii
=> It did keep all the unique lines AND the last item of all the duplicate lines, whose your string – Mother is home – !