Community
    • Login

    Duplicate lines that included same strings after / film /

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    7 Posts 4 Posters 295 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • kaveh 202K
      kaveh 202
      last edited by

      Hi everybody,
      Explain is hard i want to start with example.Please look the Bold text

      i have very long list like the following list :

      http://dl3/film/Daylights.End.3*******************
      http://dl4/film/The.Phenom.2016*******************
      http://dl45/film/The.Wild.Life.720***************
      http://dl58/film/Pele.Birth.Of.A*******************
      http://dl4/film/Sultan.2016.720*******************
      http://dl3pw/film/The.Guvnor.2016*******************
      http://dl3.pw/film/The.Wild.Life.2*******************
      http://dl3.f/film/An.Almost.Perfe*******************
      http://dl3.ftk.pw/film/Scooby.Doo.And.*******************
      http://d2/film/A.Conspiracy.Of*******************
      http://dl45/film/Daylights.End.2*******************

      I want to find all Duplicate lines that included same strings after / film /, but I don’t want to search the line to the end, the search should look for duplicate strings just like 10 or 20 characters after / film /
      The strings after /film/ are different.

      The method should search a list with like 5 thousand line and find all duplicate line and maybe after founding delete or mark One or Two of them.

      Thanks in advance

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello @kaveh-202 and All,

        The first step to solve your problem is to know exactly which lines do you consider as duplicates ?!

        Let’s simplify the problem and consider, for instance, the input text, below :

        /film/abcdefghij
        /film/abcdefghijklmnopqrstuvwxyz
        /film/abcdefghijklm
        /film/abcd123
        /film/abcdefghijklmnopqrst123
        /film/abcdefg
        /film/abcdefghijklmn1234
        /film/abcdefghijklmnop
        /film/abcdefghijklmn1234567890
        /film/abcdefghijklmnopqrst
        /film/abcdefghijklmn
        /film/abcd1234567890
        /film/abcdefghij1234567890
        /film/abcdefghijklmnopqrst1234567890
        /film/abcdefghij123
        /film/abcd
        

        After sorting the lines in a lexicographically Descending order, we get the text :

        /film/abcdefghijklmnopqrstuvwxyz
        /film/abcdefghijklmnopqrst1234567890
        /film/abcdefghijklmnopqrst123
        /film/abcdefghijklmnopqrst
        /film/abcdefghijklmnop
        /film/abcdefghijklmn1234567890
        /film/abcdefghijklmn1234
        /film/abcdefghijklmn
        /film/abcdefghijklm
        /film/abcdefghij1234567890
        /film/abcdefghij123
        /film/abcdefghij
        /film/abcdefg
        /film/abcd1234567890
        /film/abcd123
        /film/abcd
        

        As you said :

        the search should look for duplicate strings just like 10 or 20 characters after / film /

        Then :

        • A) Do you consider that the 4 lines below, all containing the string abcdefghijklmnopqrst ( 20 chars ) are duplicates ?

        /film/abcdefghijklmnopqrstuvwxyz
        /film/abcdefghijklmnopqrst1234567890
        /film/abcdefghijklmnopqrst123
        /film/abcdefghijklmnopqrst

        • B) Do you consider that the 4 lines below, all containing the string abcdefghijklmn ( 14 chars ) are duplicates ?

        /film/abcdefghijklmnop
        /film/abcdefghijklmn1234567890
        /film/abcdefghijklmn1234
        /film/abcdefghijklmn

        • C) Do you consider that the 4 lines below, all containing the string abcdefghij ( 10 chars ) are duplicates ?

        /film/abcdefghijklm
        /film/abcdefghij1234567890
        /film/abcdefghij123
        /film/abcdefghij

        • D) Finally, do you consider that the 4 lines below, all containing the string abcd ( 4 chars ) are duplicates or not ( because the identical part is smaller than 10 chars ) ?

        /film/abcdefg
        /film/abcd1234567890
        /film/abcd123
        /film/abcd


        See you later,

        Best Regards,

        guy038

        P. S. :

        Note that sorting is not sufficient to classify data, according to longest identical leading part. For instance, given the input text :

        /film/zyxdefghij123
        /film/abcdefghijklmnopqrstuvwxyz
        /film/zyxdefghijklm
        /film/abcdefghijklmnopqrst1234567890
        /film/zyxdefghij1234567890
        /film/abcdefghijklmnopqrst123
        /film/zyxdefghij
        /film/abcdefghijklmnopqrst
        

        After sorting, we get :

        /film/zyxdefghijklm
        /film/zyxdefghij1234567890
        /film/zyxdefghij123
        /film/zyxdefghij
        /film/abcdefghijklmnopqrstuvwxyz
        /film/abcdefghijklmnopqrst1234567890
        /film/abcdefghijklmnopqrst123
        /film/abcdefghijklmnopqrst
        

        As you see, the first four lines have an identical part of 10 characters ( abcdefghij ). So less than the last four lines which have an identical part of 20 characters ( abcdefghijklmnopqrst ) !

        1 Reply Last reply Reply Quote 1
        • kaveh 202K
          kaveh 202
          last edited by

          i wanna use a Regular Expressions in Notepad++, i can’t search all list manually.

          1 Reply Last reply Reply Quote -1
          • Alan KilbornA
            Alan Kilborn
            last edited by

            @kaveh-202

            @guy038 gave you FOUR questions to answer, conveniently labeled A, B, C and D.

            Please answer them if you choose to continue this thread.

            1 Reply Last reply Reply Quote 0
            • kaveh 202K
              kaveh 202
              last edited by

              i wanna check at list 10 character after /film/
              answer for A and B is NO, C and D in Yes.

              if it helps, it is a linux bash command that i can find the duplicate name with that

              cat long-list |grep -Eo ‘/film/.{15}’ |sort |uniq --repeated

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by

                Hi, @kaveh-202 @alan-kilborn and All,

                Let suppose, to begin with, to focus on the first 10 chars, after the string /film/

                Then, from your example :

                http://dl3/film/Daylights.End.3*******************
                http://dl4/film/The.Phenom.2016*******************
                http://dl45/film/The.Wild.Life.720***************
                http://dl58/film/Pele.Birth.Of.A*******************
                http://dl4/film/Sultan.2016.720*******************
                http://dl3pw/film/The.Guvnor.2016*******************
                http://dl3.pw/film/The.Wild.Life.2*******************
                http://dl3.f/film/An.Almost.Perfe*******************
                http://dl3.ftk.pw/film/Scooby.Doo.And.*******************
                http://d2/film/A.Conspiracy.Of*******************
                http://dl45/film/Daylights.End.2*******************
                

                With the simple regex S/R, below :

                SEARCH (?-s)^.+/film/(.{10})

                REPLACE \1\t$0

                We get the text :

                Daylights.	http://dl3/film/Daylights.End.3*******************
                The.Phenom	http://dl4/film/The.Phenom.2016*******************
                The.Wild.L	http://dl45/film/The.Wild.Life.720***************
                Pele.Birth	http://dl58/film/Pele.Birth.Of.A*******************
                Sultan.201	http://dl4/film/Sultan.2016.720*******************
                The.Guvnor	http://dl3pw/film/The.Guvnor.2016*******************
                The.Wild.L	http://dl3.pw/film/The.Wild.Life.2*******************
                An.Almost.	http://dl3.f/film/An.Almost.Perfe*******************
                Scooby.Doo	http://dl3.ftk.pw/film/Scooby.Doo.And.*******************
                A.Conspira	http://d2/film/A.Conspiracy.Of*******************
                Daylights.	http://dl45/film/Daylights.End.2*******************
                

                Then, it’s obvious that the first and last line are duplicates ( Daylights. ) , as well as lines 3 and 5 ( The.Wild.L )

                Now, what do you want to do regarding lines 1 and 11 and lines 3 and 5 ?

                Presently, it’s quite easy to build a regex which would delete all duplicates lines, keeping only the last one found !

                Two other questions :

                • Do you mind if a sort process is used, which, of course, would alter the initial order of lines ?

                • How many duplicates lines may have a line ? Only 1 duplicate or more ?

                BR

                guy038

                1 Reply Last reply Reply Quote 3
                • Roger ebertR
                  Roger ebert
                  last edited by

                  This post is deleted!
                  1 Reply Last reply Reply Quote -1
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors