remove duplicated line

pinuzzu99

i have a very long txt file like this. on this file some line are duplicate. i have try with “replace” on reg-ex this commnad:
find: ^(.*)(\r?\n\1)+$
replace: $1

but not work on my specific case. also i have try:
find: ^(.*\r?\n)\1+
replace: empty

but this also does not work in my case. how to remove duplicate lines?

dangsjceamkales@gsnail.com:c6718e7c
Tom34f@sogbug.com:y7vk5z9292
zesorex@gmail.com:ploksfasd
j096875244@gmail.com:st608g410000
doniel.ctz@homail.com:Cotvxbza22523286
levjaamel@hetmail.com:camxmel2004
Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
szaborefeupert666@gail.com:Rupejffgano666
jodgsjny0531@cofx.net:Draskakgon357
zesorex@gmail.com:ploksfasd
wse_adgel_one@hogmail.com:6947903024
j096875244@gmail.com:st608g410000
jringahdhsque@hotmail.com:nadfjddkalgo
Andrewhsjfmesjones00@yahoo.com:Winpfgston99001

Ekopalypse

@pinuzzu99

if it is not needed to keep the ordering you can do
Edit->Line Operations->Sort Lines …
Edit->Line Operations->Remove Consecutive Duplicate Lines

pinuzzu99

oh great, tanxs.
anyway i need reg ex string to delete my duplicate line without intervening in the order…

rinku singh

@pinuzzu99
use remove duplicate line plugin

PeterJones

@pinuzzu99 ,

If you are willing to hit Replace All multiple times, until all duplicates are removed, this worked for me with your example:

FIND = (?s)((^.*?$)\R.*)\R*\2(\R|\Z)
REPLACE = $1
MODE = regular expression

After three runs, it had become:

dangsjceamkales@gsnail.com:c6718e7c
Tom34f@sogbug.com:y7vk5z9292
zesorex@gmail.com:ploksfasd
j096875244@gmail.com:st608g410000
doniel.ctz@homail.com:Cotvxbza22523286
levjaamel@hetmail.com:camxmel2004
Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
szaborefeupert666@gail.com:Rupejffgano666
jodgsjny0531@cofx.net:Draskakgon357
wse_adgel_one@hogmail.com:6947903024
jringahdhsque@hotmail.com:nadfjddkalgo

… which I think is what you wanted.

But yes, @gurikbal-singh’s Remove Duplicate Lines plugin should do what you want, too. Just go to Plugijns > Plugins Admin to install it.

pinuzzu99

oh yes PeterJones, work well! tanxs
but I have to click each time to delete 1 row at a time … and if I had 5000 double rows ???
isn’t there a single command to bulk remove everything in one go?

and thanks for the advice of the “remove duplicate line” plug-in. I didn’t know it existed, now I prove it. thank you

PeterJones

@pinuzzu99 said in remove duplicated line:

but I have to click each time to delete 1 row at a time … and if I had 5000 double rows ???
isn’t there a single command to bulk remove everything in one go?

Regex aren’t infinitely powerful. You can do a lot with them, but if you want to do super-complicated things, sometimes it’s better to use a full-blown programming language (which is what the plugin does, obviously).

For example, in perl, running from the command line, it could be done with a readable 3-line script, or the condensed oneliner: perl -pi.bak -e "chomp($k=$_);$_=''if$h{$k};++$h{$k}" filename, which would save the original to filename.bak, and delete the duplicate lines when re-generating filename, assuming there’s enough memory to create the hash (map) which checks for duplicates. If memory became a concern, you could sacrifice speed for memory and generate a shorter key (maybe using crc32 or similar algorithm) to get a 1:1 mapping of line-of-text to key, but have the keys be short enough that they don’t overflow your memory – but this isn’t a general programming-help forum, so I won’t go any farther than that.

pinuzzu99

ok, understand. you have been very clear.
at this point I will use the reg-ex for simple things, and the plug-in for the more complicated txt. thank you for your support.

pinuzzu99

hey guy038 do you don’t have valid recipe to do it all in one shot?
I do not mean like string (?s)((^.?$)\R.)\R*\2(\R|\Z)
REPLACE = $1
work only with one value at a time…
plug-in duplicate line work fine, but refine reg-ex it’s not possible?

Alan Kilborn

@pinuzzu99

It is possible that regex could work, but it is possible to overwhelm the regex engine with such an execution. You will know you have done this because the entire document will become selected. Better to do it in a non-regex way.

guy038

Hello @pinuzzu99, @ekopalypse, @gurikbal-singh, @peterjones, @alan-kilborn and All,

Sorry for my late answer : I did a 3-days ski trip to Les Arcs 1800 French resort. We were a group of 14 people. Unfortunately, sun was not there the first two days and on the last day, no skiing due to snow showers !

Luckily, a one-go regex S/R is possible ;-))

So, assuming the input text, below :

Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
dangsjceamkales@gsnail.com:c6718e7c
Tom34f@sogbug.com:y7vk5z9292
zesorex@gmail.com:ploksfasd
j096875244@gmail.com:st608g410000
doniel.ctz@homail.com:Cotvxbza22523286
zesorex@gmail.com:ploksfasd
levjaamel@hetmail.com:camxmel2004
Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
szaborefeupert666@gail.com:Rupejffgano666
jodgsjny0531@cofx.net:Draskakgon357
zesorex@gmail.com:ploksfasd
wse_adgel_one@hogmail.com:6947903024
j096875244@gmail.com:st608g410000
j096875244@gmail.com:st608g410000
jringahdhsque@hotmail.com:nadfjddkalgo
Andrewhsjfmesjones00@yahoo.com:Winpfgston99001

Use the following regex S/R :

SEARCH (?-is)^(.+)\R(?=(?s).*^\1)

REPLACE Leave EMPTY

And you’ll get the output text

dangsjceamkales@gsnail.com:c6718e7c
Tom34f@sogbug.com:y7vk5z9292
doniel.ctz@homail.com:Cotvxbza22523286
levjaamel@hetmail.com:camxmel2004
szaborefeupert666@gail.com:Rupejffgano666
jodgsjny0531@cofx.net:Draskakgon357
zesorex@gmail.com:ploksfasd
wse_adgel_one@hogmail.com:6947903024
j096875244@gmail.com:st608g410000
jringahdhsque@hotmail.com:nadfjddkalgo
Andrewhsjfmesjones00@yahoo.com:Winpfgston99001

Notes :

This regex searches for any non-empty line, separated from an identical line, case included, by any range of characters, possibly nul and/or multi-lines Thus, it deletes all duplicates of a line, located before this original line
The first part (?-is) is the traditional in-line modifiers ( so dot = 1 standard char and case taken in account )
Then, the part ^(.+)\R, searches the contents of any non-empty line, from the beginning, stored as group 1 and followed with its line-break \R
The last part (?=(?s).*^\1) is a positive look-ahead structure, (?=........), that is to say a condition which must be true, in order to validate the overall match, but which is never part of the overall match !
- The part (?s).* represents any range, even nul, of any kind of characters ( standard or EOL chars ), due to the (?s) modifier
- The part ^\1 matches the same range of characters \1, beginning a line
As the replacement zone is empty, any line, with its line-break, which is repeated downwards, is then deleted

Remark :

In an huge file, if two identical lines are separated by a lot of text/lines, this regex S/R may fail and wrongly finds an all contents file match. For instance :

Two lines, separated with 1600 all different lines, of 32 characters each, give a correct result of 1 occurrence ( The line with a duplicate )
Two lines, separated with 1700 all different lines, of 32 characters each, give a incorrect result of 2 occurrences ( The line with a duplicate and all file contents )

Best Regards,

guy038

pinuzzu99

tanxs guy038.
I’m glad you went ski, even if the weather was not perfect… every now and then it is good to detach from the pc!
tanxs for your reply, but not just for the answer itself, as for the spirit you put into it…
thank you so much for your very appreciated answers.