Duplicate lines that included same strings after / film /



  • Hi everybody,
    Explain is hard i want to start with example.Please look the Bold text

    i have very long list like the following list :

    http://dl3/film/Daylights.End.3*******************
    http://dl4/film/The.Phenom.2016*******************
    http://dl45/film/The.Wild.Life.720***************
    http://dl58/film/Pele.Birth.Of.A*******************
    http://dl4/film/Sultan.2016.720*******************
    http://dl3pw/film/The.Guvnor.2016*******************
    http://dl3.pw/film/The.Wild.Life.2*******************
    http://dl3.f/film/An.Almost.Perfe*******************
    http://dl3.ftk.pw/film/Scooby.Doo.And.*******************
    http://d2/film/A.Conspiracy.Of*******************
    http://dl45/film/Daylights.End.2*******************

    I want to find all Duplicate lines that included same strings after / film /, but I don’t want to search the line to the end, the search should look for duplicate strings just like 10 or 20 characters after / film /
    The strings after /film/ are different.

    The method should search a list with like 5 thousand line and find all duplicate line and maybe after founding delete or mark One or Two of them.

    Thanks in advance



  • Hello @kaveh-202 and All,

    The first step to solve your problem is to know exactly which lines do you consider as duplicates ?!

    Let’s simplify the problem and consider, for instance, the input text, below :

    /film/abcdefghij
    /film/abcdefghijklmnopqrstuvwxyz
    /film/abcdefghijklm
    /film/abcd123
    /film/abcdefghijklmnopqrst123
    /film/abcdefg
    /film/abcdefghijklmn1234
    /film/abcdefghijklmnop
    /film/abcdefghijklmn1234567890
    /film/abcdefghijklmnopqrst
    /film/abcdefghijklmn
    /film/abcd1234567890
    /film/abcdefghij1234567890
    /film/abcdefghijklmnopqrst1234567890
    /film/abcdefghij123
    /film/abcd
    

    After sorting the lines in a lexicographically Descending order, we get the text :

    /film/abcdefghijklmnopqrstuvwxyz
    /film/abcdefghijklmnopqrst1234567890
    /film/abcdefghijklmnopqrst123
    /film/abcdefghijklmnopqrst
    /film/abcdefghijklmnop
    /film/abcdefghijklmn1234567890
    /film/abcdefghijklmn1234
    /film/abcdefghijklmn
    /film/abcdefghijklm
    /film/abcdefghij1234567890
    /film/abcdefghij123
    /film/abcdefghij
    /film/abcdefg
    /film/abcd1234567890
    /film/abcd123
    /film/abcd
    

    As you said :

    the search should look for duplicate strings just like 10 or 20 characters after / film /

    Then :

    • A) Do you consider that the 4 lines below, all containing the string abcdefghijklmnopqrst ( 20 chars ) are duplicates ?

    /film/abcdefghijklmnopqrstuvwxyz
    /film/abcdefghijklmnopqrst1234567890
    /film/abcdefghijklmnopqrst123
    /film/abcdefghijklmnopqrst

    • B) Do you consider that the 4 lines below, all containing the string abcdefghijklmn ( 14 chars ) are duplicates ?

    /film/abcdefghijklmnop
    /film/abcdefghijklmn1234567890
    /film/abcdefghijklmn1234
    /film/abcdefghijklmn

    • C) Do you consider that the 4 lines below, all containing the string abcdefghij ( 10 chars ) are duplicates ?

    /film/abcdefghijklm
    /film/abcdefghij1234567890
    /film/abcdefghij123
    /film/abcdefghij

    • D) Finally, do you consider that the 4 lines below, all containing the string abcd ( 4 chars ) are duplicates or not ( because the identical part is smaller than 10 chars ) ?

    /film/abcdefg
    /film/abcd1234567890
    /film/abcd123
    /film/abcd


    See you later,

    Best Regards,

    guy038

    P. S. :

    Note that sorting is not sufficient to classify data, according to longest identical leading part. For instance, given the input text :

    /film/zyxdefghij123
    /film/abcdefghijklmnopqrstuvwxyz
    /film/zyxdefghijklm
    /film/abcdefghijklmnopqrst1234567890
    /film/zyxdefghij1234567890
    /film/abcdefghijklmnopqrst123
    /film/zyxdefghij
    /film/abcdefghijklmnopqrst
    

    After sorting, we get :

    /film/zyxdefghijklm
    /film/zyxdefghij1234567890
    /film/zyxdefghij123
    /film/zyxdefghij
    /film/abcdefghijklmnopqrstuvwxyz
    /film/abcdefghijklmnopqrst1234567890
    /film/abcdefghijklmnopqrst123
    /film/abcdefghijklmnopqrst
    

    As you see, the first four lines have an identical part of 10 characters ( abcdefghij ). So less than the last four lines which have an identical part of 20 characters ( abcdefghijklmnopqrst ) !



  • i wanna use a Regular Expressions in Notepad++, i can’t search all list manually.



  • @kaveh-202

    @guy038 gave you FOUR questions to answer, conveniently labeled A, B, C and D.

    Please answer them if you choose to continue this thread.



  • i wanna check at list 10 character after /film/
    answer for A and B is NO, C and D in Yes.

    if it helps, it is a linux bash command that i can find the duplicate name with that

    cat long-list |grep -Eo ‘/film/.{15}’ |sort |uniq --repeated



  • Hi, @kaveh-202 @alan-kilborn and All,

    Let suppose, to begin with, to focus on the first 10 chars, after the string /film/

    Then, from your example :

    http://dl3/film/Daylights.End.3*******************
    http://dl4/film/The.Phenom.2016*******************
    http://dl45/film/The.Wild.Life.720***************
    http://dl58/film/Pele.Birth.Of.A*******************
    http://dl4/film/Sultan.2016.720*******************
    http://dl3pw/film/The.Guvnor.2016*******************
    http://dl3.pw/film/The.Wild.Life.2*******************
    http://dl3.f/film/An.Almost.Perfe*******************
    http://dl3.ftk.pw/film/Scooby.Doo.And.*******************
    http://d2/film/A.Conspiracy.Of*******************
    http://dl45/film/Daylights.End.2*******************
    

    With the simple regex S/R, below :

    SEARCH (?-s)^.+/film/(.{10})

    REPLACE \1\t$0

    We get the text :

    Daylights.	http://dl3/film/Daylights.End.3*******************
    The.Phenom	http://dl4/film/The.Phenom.2016*******************
    The.Wild.L	http://dl45/film/The.Wild.Life.720***************
    Pele.Birth	http://dl58/film/Pele.Birth.Of.A*******************
    Sultan.201	http://dl4/film/Sultan.2016.720*******************
    The.Guvnor	http://dl3pw/film/The.Guvnor.2016*******************
    The.Wild.L	http://dl3.pw/film/The.Wild.Life.2*******************
    An.Almost.	http://dl3.f/film/An.Almost.Perfe*******************
    Scooby.Doo	http://dl3.ftk.pw/film/Scooby.Doo.And.*******************
    A.Conspira	http://d2/film/A.Conspiracy.Of*******************
    Daylights.	http://dl45/film/Daylights.End.2*******************
    

    Then, it’s obvious that the first and last line are duplicates ( Daylights. ) , as well as lines 3 and 5 ( The.Wild.L )

    Now, what do you want to do regarding lines 1 and 11 and lines 3 and 5 ?

    Presently, it’s quite easy to build a regex which would delete all duplicates lines, keeping only the last one found !

    Two other questions :

    • Do you mind if a sort process is used, which, of course, would alter the initial order of lines ?

    • How many duplicates lines may have a line ? Only 1 duplicate or more ?

    BR

    guy038


Log in to reply