Community
    • Login

    possible to delete almost duplicate lines?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 2 Posters 108 Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • tenchyUKT Offline
      tenchyUK
      last edited by

      I use the line operations to delete duplicate lines in a comma delimited text file. But I get left with a lot of this almost duplicate lines, where I want to only keep the longest line.
      Is this possible easily enough?
      The shorter lines have double comma at the end, in case not immediately visible. Longer has (usually) 2 chars between those commas
      example:
      I just want to keep the 2nd line
      G7ODA,IO93WS,
      G7ODA,IO93WS,PE,

      PeterJonesP 1 Reply Last reply Reply Quote 0
      • PeterJonesP Offline
        PeterJones @tenchyUK
        last edited by

        @tenchyUK,

        Does order of the lines matter in the final results?
        Can there ever be 3 or more lines that you want to compress into one (ie, could there ever be three or more of the G7ODA lines, or will it always only be a single short and a single long?)

        Assuming order doesn’t matter, assuming never more than a pair of almost-duplicate lines:

        P01AZ,IO55WS,XY,
        P01AZ,IO55WS,,
        G7ODA,IO93WS,
        G7ODA,IO93WS,PE,
        
        1. Edit > Line Operations > Sort Lines Lexicographically Ascending
        2. Search > Replace
          FIND WHAT = ^(.*?,.*?,),*\R\1
          REPLACE WITH = $1
          SEARCH MODE = regular expression
          REPLACE ALL

        End Result:

        G7ODA,IO93WS,PE,
        P01AZ,IO55WS,XY,
        

        If one or both of my assumptions are wrong, provide enough example data to counter my assumptions (use the </> button on the toolbar and put the text between the ``` lines it creates), showing both the original data, and how you want it to look at the end…

        (It’s possible to restore the order, by adding/removing numbers in extra steps… but that gets complicated, and I didn’t want to overwhelm you if the final order of data doesn’t matter. Similarly, the FIND WHAT regex can be made more complex to handle removing one-or-more short lines, but if your data is as simple as my example, then this should be sufficient.)

        tenchyUKT 2 Replies Last reply Reply Quote 0
        • tenchyUKT Offline
          tenchyUK @PeterJones
          last edited by

          @PeterJones

          Hi Peter,
          No there is only ever the 2 forms of the lines. I usually applut a lex sort then remove duplicate lines.
          So I would end up with:

          G7ODA,IO93WS,
          G7ODA,IO93WS,PE,
          P01AZ,IO55WS,
          P01AZ,IO55WS,XY,

          I can sort again after as that takes split second.

          Thanks for the suggestion, I shall try that.

          PeterJonesP 1 Reply Last reply Reply Quote 1
          • tenchyUKT Offline
            tenchyUK @PeterJones
            last edited by

            @PeterJones

            Hi Peter,

            Just created a sample test file and ran this and it works perfectly, many thanks!
            I tried to break down how that works but have given up LOL.

            Now saved that as my first macro, thanks!

            1 Reply Last reply Reply Quote 1
            • PeterJonesP Offline
              PeterJones @tenchyUK
              last edited by

              @tenchyUK said:

              I tried to break down how that works but have given up LOL.

              Given:

              ^(.*?,.*?,),*\R\1
              
              • ^ = start match at beginning of line
              • (...) = put what is found inside the parentheses in capture group#1
              • .*?, = find 0 or more of any character, non-greedy, until it hits a comma (non-greedy means it won’t try to match multiple commas)
              • since there’s two of that set inside the parentheses, it will match everything through the second comma, and put it all in group#1
              • ,* = match 0 or more commas – so if your line ends with just a single comma, that will be part of group#1, but if it ends with two or more, the extra commas will be thrown away
              • \R = match a newline (whether CR, LF, or CRLF)
              • \1 = match exactly the same thing that was matched in group#1 – this is what checks for the “duplicate” up through the second comma of a line

              And the REPLACE WITH being $1 means the replacement will just be the contents of group#1. Since the lexicagraphical sort earlier made it sort alphabetically, with longer lines coming after shorter, there will still be more on the second line, and that will be untouched by the regular expression

              tenchyUKT 1 Reply Last reply Reply Quote 0
              • tenchyUKT Offline
                tenchyUK @PeterJones
                last edited by

                @PeterJones

                Thanks again. I think I’d need to do that myself to learn it fully, bit like you can’t learn to drive from a book…

                I wish I’d asked about this years ago! I periodically create these files which may start with 2 or 3000 lines and end up with around 1500 lines after exact dupes are removed.
                I then compare with CTRL ALT C with the compare plugin to the master file which is about 16K lines.
                I then manually add completely new lines if found to the master file.
                And any that have the 2 letter in the new file that aren’t in the master file, I add though two letters in to the master.
                Having these lines in the new file:
                G7ODA,IO93WS,
                G7ODA,IO93WS,PE,

                Does tend to confuse the compare plugin so this will make life easier for me!

                thanks again

                1 Reply Last reply Reply Quote 0

                Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                With your input, this post could be even better 💗

                Register Login
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors