remove duplicate urls

El FAROUZ

Hello can somone help with this please

input:

http://www.abc.com/123
http://www.abc.com/456
http://www.def.com/223
http://www.def.com/556
http://www.def.com/602
http://www.ghi.com/700
http://www.ghi.com/731
http://www.qwe.com/667
http://www.qwe.com/667
http://www.qwe.com/667

Output:

http://www.abc.com/123
http://www.def.com/223
http://www.ghi.com/700
http://www.qwe.com/667

i found this but it doesn’t work with notepad++

^(http://[^/]+/)(.*$\n?)((\1)(?2))+

replace with $1$2

El FAROUZ

@guy038 can you help please sir ? <3

Terry R

@El-FAROUZ said in remove duplicate urls:

Hello can somone help with this please

If it were me I would do the following:

Insert line numbers and order the lines descending (backwards)
Use a regex to remove the current line if the next line contains the same address
Re-order in line ascending order and then remove the line numbers.

So:

Have the cursor in the very first position of the file. Use the Column editor to first insert a ,(comma), then insert a number starting with 1, increasing by 1 and with “leading zero” ticked. Then use the Line Operation function to order lines in Integer Descending.
Using the Replace function we have
Find What:(?-s)^\d+,http://([^/]+)/.+\R(?=[^/]+?//\1)
Replace With: empty field here so it erases the line.
As this is a regex the “search mode” must be “regular expression” Click on "Replace All button.
Re-order the lines as Integer Ascending. Then use the Replace function again with:
Find What:^\d+,
Replace With: empty field here so it removes the line numbers and comma.

At this point you should have your required results.

Terry

Alan Kilborn

@Terry-R said in remove duplicate urls:

Step 1 might be a bit unclear for the novice user, because it packs a lot in. Terry, if you’ll allow, I’d specify it like this:

1a. Have the cursor in the very first position of the file. Use the Column editor to insert a ,(comma) via Text to Insert; the caret will remain in the very first position of the file after the insertion.

1b. Use the Column editor’s Number to Insert option to insert a number starting with 1, increasing by 1 and with “leading zero” ticked to add incrementing numbers to the start of every line. Then use the Line Operation function to order lines in Integer Descending.

Overall, a nice solution!

guy038

Hello @el-farouz, @terry-r, @alan-kilborn and All,

Terry, I don’t see the necessity of inserting line numbers !?

For instance, given the @el-farouz’s list, not sorted at all, as below :

http://www.def.com/602
http://www.abc.com/123
http://www.qwe.com/667
http://www.ghi.com/700
http://www.def.com/556
http://www.abc.com/456
http://www.ghi.com/731
http://www.qwe.com/667
http://www.qwe.com/667
http://www.def.com/223

We select this block of addresses and perform an ascending sort ( Edit > Line Operations > Sort Lines Lexicographically Ascending )

http://www.abc.com/123
http://www.abc.com/456
http://www.def.com/223
http://www.def.com/556
http://www.def.com/602
http://www.ghi.com/700
http://www.ghi.com/731
http://www.qwe.com/667
http://www.qwe.com/667
http://www.qwe.com/667

And, with the following regex S/R :

SEARH ^(http://(.+?)/.+\R)(?:http://\2.+\R)+

REPLACE /1

We directly get our expected list :

http://www.abc.com/123
http://www.def.com/223
http://www.ghi.com/700
http://www.qwe.com/667

Am I missing something obvious ?

Best Regards,

guy038

Alan Kilborn

@guy038

Perhaps Terry is just trying to cover the more general case, where the lines are not in any kind of pre-sorted order, and one wants to keep the original order while removing the duplicate URLs.

Terry R

@guy038 said in remove duplicate urls:

Am I missing something obvious ?

I made no assumptions about the list, I just wanted to keep the order that did exist in reverse. The OP had pivoted my solution suggesting it worked for them.

Terry