DELETE DUPICATE URLS WITH SAME DOMAIN



  • https://www.bizjournals.com/houston/news/2021/07/27/buc-ees-travel-center-calhoun-georgia-opening-date.html
    https://www.bizjournals.com/orlando/news/2020/10/19/3-ways-to-use-twitter-fleets-for-business.html
    i have these trpe of urls
    but i need only any one is there any plugin or method to delete duplicate lines only pointing towards domain?

    thanks if helped!!!



  • @Varun-Teja ,

    If it doesn’t matter which one you keep (that is, if it’s okay to keep only the last instance of a specific domain), then I would suggest doing it this way:

    • FIND = ((?-s)^.*?(https?://[^/]*/).*?$(\R|\Z))(?=(?s).*\2)
    • REPLACE = leave empty
    • SEARCH MODE = regular expression
      3eb6a6e8-c5c8-47ca-9e2d-8ad11dbdcb4f-image.png

    If it doesn’t matter what order they are in, then you could sort first (Edit > Line Operations > Sort Lexiocographically Ascending) and then use that replacement. (edit: Though that’s pointless, because just doing the first also is thus “doesn’t matter what order it’s in”)

    If it does matter what order they are in, then you could use column-select (alt+click+drag) to select the zeroth column in the file, then use Edit > Column Editor > Number to Insert to insert numbers:
    c54d34e2-89c3-4097-a432-00db12883acc-image.png
    (You might want to do a second column select and then also insert a space between the numbers and the lines by selecting the zero-width column after the numbers and then typing a space)

    1 https://www.fourthdomain.example/misc
    2 https://www.fifthdomain.example/elsewhat
    3 https://www.seconddomain.example/elsewhat
    4 https://www.firstdomain.example/blah
    5 https://www.seconddomain.example/blah
    6 https://www.fourthdomain.example/blah
    7 https://www.thirddomain.example/blah
    8 https://www.firstdomain.example/elsewhat
    

    Then, after that, sort descending (so in my example, it would be 8 down to 1). Then do the replacement I showed above. Then sort ascending again. Then remove the leading numbers (another column select followed by cut or backspace, or do a search-and-replace regular expression of FIND=^\d+\x20* and replace with nothing)


Log in to reply