Duplicate lines that included same strings after / film /
-
Hi everybody,
Explain is hard i want to start with example.Please look the Bold texti have very long list like the following list :
http://dl3/film/Daylights.End.3*******************
http://dl4/film/The.Phenom.2016*******************
http://dl45/film/The.Wild.Life.720***************
http://dl58/film/Pele.Birth.Of.A*******************
http://dl4/film/Sultan.2016.720*******************
http://dl3pw/film/The.Guvnor.2016*******************
http://dl3.pw/film/The.Wild.Life.2*******************
http://dl3.f/film/An.Almost.Perfe*******************
http://dl3.ftk.pw/film/Scooby.Doo.And.*******************
http://d2/film/A.Conspiracy.Of*******************
http://dl45/film/Daylights.End.2*******************I want to find all Duplicate lines that included same strings after / film /, but I don’t want to search the line to the end, the search should look for duplicate strings just like 10 or 20 characters after / film /
The strings after /film/ are different.The method should search a list with like 5 thousand line and find all duplicate line and maybe after founding delete or mark One or Two of them.
Thanks in advance
-
Hello @kaveh-202 and All,
The first step to solve your problem is to know exactly which lines do you consider as duplicates ?!
Let’s simplify the problem and consider, for instance, the input text, below :
/film/abcdefghij /film/abcdefghijklmnopqrstuvwxyz /film/abcdefghijklm /film/abcd123 /film/abcdefghijklmnopqrst123 /film/abcdefg /film/abcdefghijklmn1234 /film/abcdefghijklmnop /film/abcdefghijklmn1234567890 /film/abcdefghijklmnopqrst /film/abcdefghijklmn /film/abcd1234567890 /film/abcdefghij1234567890 /film/abcdefghijklmnopqrst1234567890 /film/abcdefghij123 /film/abcd
After sorting the lines in a lexicographically Descending order, we get the text :
/film/abcdefghijklmnopqrstuvwxyz /film/abcdefghijklmnopqrst1234567890 /film/abcdefghijklmnopqrst123 /film/abcdefghijklmnopqrst /film/abcdefghijklmnop /film/abcdefghijklmn1234567890 /film/abcdefghijklmn1234 /film/abcdefghijklmn /film/abcdefghijklm /film/abcdefghij1234567890 /film/abcdefghij123 /film/abcdefghij /film/abcdefg /film/abcd1234567890 /film/abcd123 /film/abcd
As you said :
the search should look for duplicate strings just like 10 or 20 characters after / film /
Then :
- A) Do you consider that the
4
lines below, all containing the string abcdefghijklmnopqrst (20 chars
) are duplicates ?
/film/abcdefghijklmnopqrstuvwxyz
/film/abcdefghijklmnopqrst1234567890
/film/abcdefghijklmnopqrst123
/film/abcdefghijklmnopqrst- B) Do you consider that the
4
lines below, all containing the string abcdefghijklmn (14 chars
) are duplicates ?
/film/abcdefghijklmnop
/film/abcdefghijklmn1234567890
/film/abcdefghijklmn1234
/film/abcdefghijklmn- C) Do you consider that the
4
lines below, all containing the string abcdefghij (10 chars
) are duplicates ?
/film/abcdefghijklm
/film/abcdefghij1234567890
/film/abcdefghij123
/film/abcdefghij- D) Finally, do you consider that the
4
lines below, all containing the string abcd (4 chars
) are duplicates or not ( because the identical part is smaller than10
chars ) ?
/film/abcdefg
/film/abcd1234567890
/film/abcd123
/film/abcd
See you later,
Best Regards,
guy038
P. S. :
Note that sorting is not sufficient to classify data, according to longest identical leading part. For instance, given the input text :
/film/zyxdefghij123 /film/abcdefghijklmnopqrstuvwxyz /film/zyxdefghijklm /film/abcdefghijklmnopqrst1234567890 /film/zyxdefghij1234567890 /film/abcdefghijklmnopqrst123 /film/zyxdefghij /film/abcdefghijklmnopqrst
After sorting, we get :
/film/zyxdefghijklm /film/zyxdefghij1234567890 /film/zyxdefghij123 /film/zyxdefghij /film/abcdefghijklmnopqrstuvwxyz /film/abcdefghijklmnopqrst1234567890 /film/abcdefghijklmnopqrst123 /film/abcdefghijklmnopqrst
As you see, the first four lines have an identical part of
10
characters ( abcdefghij ). So less than the last four lines which have an identical part of20
characters ( abcdefghijklmnopqrst ) ! - A) Do you consider that the
-
i wanna use a Regular Expressions in Notepad++, i can’t search all list manually.
-
@guy038 gave you FOUR questions to answer, conveniently labeled A, B, C and D.
Please answer them if you choose to continue this thread.
-
i wanna check at list 10 character after /film/
answer for A and B is NO, C and D in Yes.if it helps, it is a linux bash command that i can find the duplicate name with that
cat long-list |grep -Eo ‘/film/.{15}’ |sort |uniq --repeated
-
Hi, @kaveh-202 @alan-kilborn and All,
Let suppose, to begin with, to focus on the first
10
chars, after the string /film/Then, from your example :
http://dl3/film/Daylights.End.3******************* http://dl4/film/The.Phenom.2016******************* http://dl45/film/The.Wild.Life.720*************** http://dl58/film/Pele.Birth.Of.A******************* http://dl4/film/Sultan.2016.720******************* http://dl3pw/film/The.Guvnor.2016******************* http://dl3.pw/film/The.Wild.Life.2******************* http://dl3.f/film/An.Almost.Perfe******************* http://dl3.ftk.pw/film/Scooby.Doo.And.******************* http://d2/film/A.Conspiracy.Of******************* http://dl45/film/Daylights.End.2*******************
With the simple regex S/R, below :
SEARCH
(?-s)^.+/film/(.{10})
REPLACE
\1\t$0
We get the text :
Daylights. http://dl3/film/Daylights.End.3******************* The.Phenom http://dl4/film/The.Phenom.2016******************* The.Wild.L http://dl45/film/The.Wild.Life.720*************** Pele.Birth http://dl58/film/Pele.Birth.Of.A******************* Sultan.201 http://dl4/film/Sultan.2016.720******************* The.Guvnor http://dl3pw/film/The.Guvnor.2016******************* The.Wild.L http://dl3.pw/film/The.Wild.Life.2******************* An.Almost. http://dl3.f/film/An.Almost.Perfe******************* Scooby.Doo http://dl3.ftk.pw/film/Scooby.Doo.And.******************* A.Conspira http://d2/film/A.Conspiracy.Of******************* Daylights. http://dl45/film/Daylights.End.2*******************
Then, it’s obvious that the first and last line are duplicates ( Daylights. ) , as well as lines
3
and5
( The.Wild.L )Now, what do you want to do regarding lines
1
and11
and lines3
and5
?Presently, it’s quite easy to build a regex which would delete all duplicates lines, keeping only the last one found !
Two other questions :
-
Do you mind if a sort process is used, which, of course, would alter the initial order of lines ?
-
How many duplicates lines may have a line ? Only
1
duplicate or more ?
BR
guy038
-
-
This post is deleted!