Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    how to remove all lines that contains duplicates.

    Help wanted · · · – – – · · ·
    3
    5
    230
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Andrea Mark Shaw
      Andrea Mark Shaw last edited by PeterJones

      Capture.PNG

      hi guys, i have a 8k lines text. and for some reason 800 lines, are duplicated, but only the first 10 characters are duplicated (phone number) like the first line with 5415359225 is duplicated like 2 times, but the second line continue with other name so is not the whole line unique. see example:

      9144506800 ; Dear James A Brantley.
      9144506800 ; Dear James A.

      Somehow i need to delete lines that repeat(duplicate) the first 10 characters only. Sorry for my english, hope you understand!
      i alrady tried this:

      SEARVH (?-s)((.{15}).*\R)(?:\2.*\R)+
      REPLACE \1

      nothing happened, something i do wrong… idk what. i need a step by step guide i guess… Thank!

      –moderator added forum formatting tags to regex to unhide the * in the regex

      PeterJones Neil Schipper 2 Replies Last reply Reply Quote 0
      • PeterJones
        PeterJones @Andrea Mark Shaw last edited by

        @andrea-mark-shaw said in how to remove all lines that contains duplicates.:

        (?-s)((.{15}).*\R)(?:\2.*\R)+

        I’m not sure why you think it didn’t do anything.

        5415259225 ; Dear Steven W Haptonstall
        6154191258 ; Dear Someone
        9144506800 ; Dear James A Brantley.
        9144506800 ; Dear James A.
        

        then run that regex, I get

        5415259225 ; Dear Steven W Haptonstall
        6154191258 ; Dear Someone
        9144506800 ; Dear James A Brantley.
        

        But I’m also not sure why you say “first 10 characters only”, but then use {15} in your regex: 10 ≠ 15.

        1 Reply Last reply Reply Quote 0
        • Andrea Mark Shaw
          Andrea Mark Shaw last edited by

          wow, that was a fast answer, thank you very much. I did with 10 but idk what should exactly to do… is that right?Capture.PNG

          PeterJones 1 Reply Last reply Reply Quote 0
          • PeterJones
            PeterJones @Andrea Mark Shaw last edited by PeterJones

            @andrea-mark-shaw said in how to remove all lines that contains duplicates.:

            I did with 10 but idk what should exactly to do… is that right?

            “right” by what definition? Given the example data that I showed, (?-s)((.{10}).*\R)(?:\2.*\R)+ will also remove the duplicate, and it means it’s really only matching on 10 characters (which is what you said you wanted), instead of matching on the 15 characters from your original regex. Whether that gets rid of all the “duplicates” that you want it to can only be known by you – and that’s the only meaningful definition of “right” in such tasks.

            (And note that either regex shown will have a problem if the duplicate is on the last two lines of the file, without a blank line after, because then the final \R doesn’t match, and so the final duplicate isn’t removed. Just something to watch out for. If you want it to work even if the last line of the file doesn’t have a newline, then (?-s)((.{10}).*\R)(?:\2.*(\R|\Z))+, so it can match newline or end-of-file.

            1 Reply Last reply Reply Quote 2
            • Neil Schipper
              Neil Schipper @Andrea Mark Shaw last edited by

              @andrea-mark-shaw

              Clarifications are needed here.

              True or False: the duplicates will always be adjacent (= consecutive = one immediately following the other). You have sent mixed signals on this.

              True or False: when a duplicate is identified, the first of the pair is always to be preserved (kept, retained), and the second of the pair is always to be removed (deleted, discarded). This is implied but should be clearly stated.

              1 Reply Last reply Reply Quote 2
              • First post
                Last post
              Copyright © 2014 NodeBB Forums | Contributors