remove duplicated line



  • i have a very long txt file like this. on this file some line are duplicate. i have try with “replace” on reg-ex this commnad:
    find: ^(.*)(\r?\n\1)+$
    replace: $1

    but not work on my specific case. also i have try:
    find: ^(.*\r?\n)\1+
    replace: empty

    but this also does not work in my case. how to remove duplicate lines?

    dangsjceamkales@gsnail.com:c6718e7c
    Tom34f@sogbug.com:y7vk5z9292
    zesorex@gmail.com:ploksfasd
    j096875244@gmail.com:st608g410000
    doniel.ctz@homail.com:Cotvxbza22523286
    levjaamel@hetmail.com:camxmel2004
    Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
    szaborefeupert666@gail.com:Rupejffgano666
    jodgsjny0531@cofx.net:Draskakgon357
    zesorex@gmail.com:ploksfasd
    wse_adgel_one@hogmail.com:6947903024
    j096875244@gmail.com:st608g410000
    jringahdhsque@hotmail.com:nadfjddkalgo
    Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
    


  • @pinuzzu99

    if it is not needed to keep the ordering you can do
    Edit->Line Operations->Sort Lines …
    Edit->Line Operations->Remove Consecutive Duplicate Lines



  • oh great, tanxs.
    anyway i need reg ex string to delete my duplicate line without intervening in the order…



  • @pinuzzu99
    use remove duplicate line plugin



  • @pinuzzu99 ,

    If you are willing to hit Replace All multiple times, until all duplicates are removed, this worked for me with your example:

    • FIND = (?s)((^.*?$)\R.*)\R*\2(\R|\Z)
    • REPLACE = $1
    • MODE = regular expression

    After three runs, it had become:

    dangsjceamkales@gsnail.com:c6718e7c
    Tom34f@sogbug.com:y7vk5z9292
    zesorex@gmail.com:ploksfasd
    j096875244@gmail.com:st608g410000
    doniel.ctz@homail.com:Cotvxbza22523286
    levjaamel@hetmail.com:camxmel2004
    Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
    szaborefeupert666@gail.com:Rupejffgano666
    jodgsjny0531@cofx.net:Draskakgon357
    wse_adgel_one@hogmail.com:6947903024
    jringahdhsque@hotmail.com:nadfjddkalgo
    
    

    … which I think is what you wanted.

    But yes, @gurikbal-singh’s Remove Duplicate Lines plugin should do what you want, too. Just go to Plugijns > Plugins Admin to install it.



  • oh yes PeterJones, work well! tanxs
    but I have to click each time to delete 1 row at a time … and if I had 5000 double rows ???
    isn’t there a single command to bulk remove everything in one go?

    and thanks for the advice of the “remove duplicate line” plug-in. I didn’t know it existed, now I prove it. thank you



  • @pinuzzu99 said in remove duplicated line:

    but I have to click each time to delete 1 row at a time … and if I had 5000 double rows ???
    isn’t there a single command to bulk remove everything in one go?

    Regex aren’t infinitely powerful. You can do a lot with them, but if you want to do super-complicated things, sometimes it’s better to use a full-blown programming language (which is what the plugin does, obviously).

    For example, in perl, running from the command line, it could be done with a readable 3-line script, or the condensed oneliner: perl -pi.bak -e "chomp($k=$_);$_=''if$h{$k};++$h{$k}" filename, which would save the original to filename.bak, and delete the duplicate lines when re-generating filename, assuming there’s enough memory to create the hash (map) which checks for duplicates. If memory became a concern, you could sacrifice speed for memory and generate a shorter key (maybe using crc32 or similar algorithm) to get a 1:1 mapping of line-of-text to key, but have the keys be short enough that they don’t overflow your memory – but this isn’t a general programming-help forum, so I won’t go any farther than that.



  • ok, understand. you have been very clear.
    at this point I will use the reg-ex for simple things, and the plug-in for the more complicated txt. thank you for your support.



  • hey guy038 do you don’t have valid recipe to do it all in one shot?
    I do not mean like string (?s)((^.?$)\R.)\R*\2(\R|\Z)
    REPLACE = $1
    work only with one value at a time…
    plug-in duplicate line work fine, but refine reg-ex it’s not possible?



  • @pinuzzu99

    It is possible that regex could work, but it is possible to overwhelm the regex engine with such an execution. You will know you have done this because the entire document will become selected. Better to do it in a non-regex way.



  • Hello @pinuzzu99, @ekopalypse, @gurikbal-singh, @peterjones, @alan-kilborn and All,

    Sorry for my late answer : I did a 3-days ski trip to Les Arcs 1800 French resort. We were a group of 14 people. Unfortunately, sun was not there the first two days and on the last day, no skiing due to snow showers !


    Luckily, a one-go regex S/R is possible ;-))

    So, assuming the input text, below :

    Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
    dangsjceamkales@gsnail.com:c6718e7c
    Tom34f@sogbug.com:y7vk5z9292
    zesorex@gmail.com:ploksfasd
    j096875244@gmail.com:st608g410000
    doniel.ctz@homail.com:Cotvxbza22523286
    zesorex@gmail.com:ploksfasd
    levjaamel@hetmail.com:camxmel2004
    Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
    szaborefeupert666@gail.com:Rupejffgano666
    jodgsjny0531@cofx.net:Draskakgon357
    zesorex@gmail.com:ploksfasd
    wse_adgel_one@hogmail.com:6947903024
    j096875244@gmail.com:st608g410000
    j096875244@gmail.com:st608g410000
    jringahdhsque@hotmail.com:nadfjddkalgo
    Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
    

    Use the following regex S/R :

    SEARCH (?-is)^(.+)\R(?=(?s).*^\1)

    REPLACE Leave EMPTY

    And you’ll get the output text

    dangsjceamkales@gsnail.com:c6718e7c
    Tom34f@sogbug.com:y7vk5z9292
    doniel.ctz@homail.com:Cotvxbza22523286
    levjaamel@hetmail.com:camxmel2004
    szaborefeupert666@gail.com:Rupejffgano666
    jodgsjny0531@cofx.net:Draskakgon357
    zesorex@gmail.com:ploksfasd
    wse_adgel_one@hogmail.com:6947903024
    j096875244@gmail.com:st608g410000
    jringahdhsque@hotmail.com:nadfjddkalgo
    Andrewhsjfmesjones00@yahoo.com:Winpfgston99001
    

    Notes :

    • This regex searches for any non-empty line, separated from an identical line, case included, by any range of characters, possibly nul and/or multi-lines Thus, it deletes all duplicates of a line, located before this original line

    • The first part (?-is) is the traditional in-line modifiers ( so dot = 1 standard char and case taken in account )

    • Then, the part ^(.+)\R, searches the contents of any non-empty line, from the beginning, stored as group 1 and followed with its line-break \R

    • The last part (?=(?s).*^\1) is a positive look-ahead structure, (?=........), that is to say a condition which must be true, in order to validate the overall match, but which is never part of the overall match !

      • The part (?s).* represents any range, even nul, of any kind of characters ( standard or EOL chars ), due to the (?s) modifier

      • The part ^\1 matches the same range of characters \1, beginning a line

    • As the replacement zone is empty, any line, with its line-break, which is repeated downwards, is then deleted

    Remark :

    In an huge file, if two identical lines are separated by a lot of text/lines, this regex S/R may fail and wrongly finds an all contents file match. For instance :

    • Two lines, separated with 1600 all different lines, of 32 characters each, give a correct result of 1 occurrence ( The line with a duplicate )

    • Two lines, separated with 1700 all different lines, of 32 characters each, give a incorrect result of 2 occurrences ( The line with a duplicate and all file contents )

    Best Regards,

    guy038



  • tanxs guy038.
    I’m glad you went ski, even if the weather was not perfect… every now and then it is good to detach from the pc!
    tanxs for your reply, but not just for the answer itself, as for the spirit you put into it…
    thank you so much for your very appreciated answers.


Log in to reply