possible to delete almost duplicate lines?

tenchyUK

I use the line operations to delete duplicate lines in a comma delimited text file. But I get left with a lot of this almost duplicate lines, where I want to only keep the longest line.
Is this possible easily enough?
The shorter lines have double comma at the end, in case not immediately visible. Longer has (usually) 2 chars between those commas
example:
I just want to keep the 2nd line
G7ODA,IO93WS,
G7ODA,IO93WS,PE,

PeterJones

@tenchyUK,

Does order of the lines matter in the final results?
Can there ever be 3 or more lines that you want to compress into one (ie, could there ever be three or more of the G7ODA lines, or will it always only be a single short and a single long?)

Assuming order doesn’t matter, assuming never more than a pair of almost-duplicate lines:

P01AZ,IO55WS,XY,
P01AZ,IO55WS,,
G7ODA,IO93WS,
G7ODA,IO93WS,PE,

Edit > Line Operations > Sort Lines Lexicographically Ascending
Search > Replace
FIND WHAT = ^(.*?,.*?,),*\R\1
REPLACE WITH = $1
SEARCH MODE = regular expression
REPLACE ALL

End Result:

G7ODA,IO93WS,PE,
P01AZ,IO55WS,XY,

If one or both of my assumptions are wrong, provide enough example data to counter my assumptions (use the </> button on the toolbar and put the text between the ``` lines it creates), showing both the original data, and how you want it to look at the end…

(It’s possible to restore the order, by adding/removing numbers in extra steps… but that gets complicated, and I didn’t want to overwhelm you if the final order of data doesn’t matter. Similarly, the FIND WHAT regex can be made more complex to handle removing one-or-more short lines, but if your data is as simple as my example, then this should be sufficient.)

tenchyUK

@PeterJones

Hi Peter,
No there is only ever the 2 forms of the lines. I usually applut a lex sort then remove duplicate lines.
So I would end up with:

G7ODA,IO93WS,
G7ODA,IO93WS,PE,
P01AZ,IO55WS,
P01AZ,IO55WS,XY,

I can sort again after as that takes split second.

Thanks for the suggestion, I shall try that.

tenchyUK

@PeterJones

Hi Peter,

Just created a sample test file and ran this and it works perfectly, many thanks!
I tried to break down how that works but have given up LOL.

Now saved that as my first macro, thanks!

PeterJones

@tenchyUK said:

I tried to break down how that works but have given up LOL.

Given:

^(.*?,.*?,),*\R\1

^ = start match at beginning of line
(...) = put what is found inside the parentheses in capture group#1
.*?, = find 0 or more of any character, non-greedy, until it hits a comma (non-greedy means it won’t try to match multiple commas)
since there’s two of that set inside the parentheses, it will match everything through the second comma, and put it all in group#1
,* = match 0 or more commas – so if your line ends with just a single comma, that will be part of group#1, but if it ends with two or more, the extra commas will be thrown away
\R = match a newline (whether CR, LF, or CRLF)
\1 = match exactly the same thing that was matched in group#1 – this is what checks for the “duplicate” up through the second comma of a line

And the REPLACE WITH being $1 means the replacement will just be the contents of group#1. Since the lexicagraphical sort earlier made it sort alphabetically, with longer lines coming after shorter, there will still be more on the second line, and that will be untouched by the regular expression

tenchyUK

@PeterJones

Thanks again. I think I’d need to do that myself to learn it fully, bit like you can’t learn to drive from a book…

I wish I’d asked about this years ago! I periodically create these files which may start with 2 or 3000 lines and end up with around 1500 lines after exact dupes are removed.
I then compare with CTRL ALT C with the compare plugin to the master file which is about 16K lines.
I then manually add completely new lines if found to the master file.
And any that have the 2 letter in the new file that aren’t in the master file, I add though two letters in to the master.
Having these lines in the new file:
G7ODA,IO93WS,
G7ODA,IO93WS,PE,

Does tend to confuse the compare plugin so this will make life easier for me!

thanks again