Community
    • Login

    remove duplicate urls

    Scheduled Pinned Locked Moved General Discussion
    7 Posts 4 Posters 1.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • El FAROUZE
      El FAROUZ
      last edited by

      Hello can somone help with this please

      input:

      http://www.abc.com/123
      http://www.abc.com/456
      http://www.def.com/223
      http://www.def.com/556
      http://www.def.com/602
      http://www.ghi.com/700
      http://www.ghi.com/731
      http://www.qwe.com/667
      http://www.qwe.com/667
      http://www.qwe.com/667

      Output:

      http://www.abc.com/123
      http://www.def.com/223
      http://www.ghi.com/700
      http://www.qwe.com/667

      i found this but it doesn’t work with notepad++

      ^(http://[^/]+/)(.*$\n?)((\1)(?2))+

      replace with $1$2

      1 Reply Last reply Reply Quote 0
      • El FAROUZE
        El FAROUZ
        last edited by

        @guy038 can you help please sir ? <3

        1 Reply Last reply Reply Quote 0
        • Terry RT
          Terry R
          last edited by

          @El-FAROUZ said in remove duplicate urls:

          Hello can somone help with this please

          If it were me I would do the following:

          1. Insert line numbers and order the lines descending (backwards)
          2. Use a regex to remove the current line if the next line contains the same address
          3. Re-order in line ascending order and then remove the line numbers.

          So:

          1. Have the cursor in the very first position of the file. Use the Column editor to first insert a ,(comma), then insert a number starting with 1, increasing by 1 and with “leading zero” ticked. Then use the Line Operation function to order lines in Integer Descending.
          2. Using the Replace function we have
            Find What:(?-s)^\d+,http://([^/]+)/.+\R(?=[^/]+?//\1)
            Replace With: empty field here so it erases the line.
            As this is a regex the “search mode” must be “regular expression” Click on "Replace All button.
          3. Re-order the lines as Integer Ascending. Then use the Replace function again with:
            Find What:^\d+,
            Replace With: empty field here so it removes the line numbers and comma.

          At this point you should have your required results.

          Terry

          Alan KilbornA 1 Reply Last reply Reply Quote 4
          • Alan KilbornA
            Alan Kilborn @Terry R
            last edited by

            @Terry-R said in remove duplicate urls:

            Step 1 might be a bit unclear for the novice user, because it packs a lot in. Terry, if you’ll allow, I’d specify it like this:

            1a. Have the cursor in the very first position of the file. Use the Column editor to insert a ,(comma) via Text to Insert; the caret will remain in the very first position of the file after the insertion.

            1b. Use the Column editor’s Number to Insert option to insert a number starting with 1, increasing by 1 and with “leading zero” ticked to add incrementing numbers to the start of every line. Then use the Line Operation function to order lines in Integer Descending.

            Overall, a nice solution!

            1 Reply Last reply Reply Quote 4
            • guy038G
              guy038
              last edited by

              Hello @el-farouz, @terry-r, @alan-kilborn and All,

              Terry, I don’t see the necessity of inserting line numbers !?

              For instance, given the @el-farouz’s list, not sorted at all, as below :

              http://www.def.com/602
              http://www.abc.com/123
              http://www.qwe.com/667
              http://www.ghi.com/700
              http://www.def.com/556
              http://www.abc.com/456
              http://www.ghi.com/731
              http://www.qwe.com/667
              http://www.qwe.com/667
              http://www.def.com/223
              

              We select this block of addresses and perform an ascending sort ( Edit > Line Operations > Sort Lines Lexicographically Ascending )

              http://www.abc.com/123
              http://www.abc.com/456
              http://www.def.com/223
              http://www.def.com/556
              http://www.def.com/602
              http://www.ghi.com/700
              http://www.ghi.com/731
              http://www.qwe.com/667
              http://www.qwe.com/667
              http://www.qwe.com/667
              

              And, with the following regex S/R :

              SEARH ^(http://(.+?)/.+\R)(?:http://\2.+\R)+

              REPLACE /1

              We directly get our expected list :

              http://www.abc.com/123
              http://www.def.com/223
              http://www.ghi.com/700
              http://www.qwe.com/667
              

              Am I missing something obvious ?

              Best Regards,

              guy038

              Alan KilbornA Terry RT 2 Replies Last reply Reply Quote 2
              • Alan KilbornA
                Alan Kilborn @guy038
                last edited by

                @guy038

                Perhaps Terry is just trying to cover the more general case, where the lines are not in any kind of pre-sorted order, and one wants to keep the original order while removing the duplicate URLs.

                1 Reply Last reply Reply Quote 2
                • Terry RT
                  Terry R @guy038
                  last edited by

                  @guy038 said in remove duplicate urls:

                  Am I missing something obvious ?

                  I made no assumptions about the list, I just wanted to keep the order that did exist in reverse. The OP had pivoted my solution suggesting it worked for them.

                  Terry

                  1 Reply Last reply Reply Quote 2
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors