Remove duplicates if only part of the string matches

Viktor Serdiuk

Thank you so much!

Viktor Serdiuk

The only thing is that if the right part of the row is missing, it is not checked.
part of row #1:part of row #20
part of row #1:
part of row #3:part of row #41.
part of row #3:
part of row #3:part of row #43
part of row #5:part of row #60

I get the result :
part of row #1:part of row #20
part of row #1:
part of row #3:
part of row #3:part of row #41.
part of row #5:part of row #60

guy038

Hi, @viktor-serdiuk, @alan-kilborn and All,

Ah, yes…, we forgot this case ! No problem, just two modifications to allow the part after the colon to be empty :

SEARCH (?-s)^((.+?):.*\R)(?s:.*)^\2.*\R?

REPLACE ${1}

Thus, given your INPUT text :

part of row #1:part of row #20
part of row #1:
part of row #3:part of row #41.
part of row #3:
part of row #3:part of row #43
part of row #5:part of row #60

You should get the following OUTPUT text :

part of row #1:part of row #20
part of row #3:part of row #41.
part of row #5:part of row #60

Notes :

When using :.+\R, the regex expects, at least, one standard character, between : and the line-break \R, as the + quantifier is a shortcut for the {1,} quantifier
When using :.*\R, the regex allows the part, between : and the line-break \R, to be empty ( so to not exist ), as the * quantifier is a shortcut for the {0,} quantifier
And same reason for the part between \2 and \R
However, note that the zone, between the beginning of line ^ and the :, must contain, at least, one standard char because of the + and, also, without an other colon because of the additional ? char, giving the (.+?) sub-regex, re-used later, as group 2 with the \2 syntax

Best Regards,

guy038

Alan Kilborn

@guy038 said in Remove duplicates if only part of the string matches:

we forgot this case

No, this case was not cited by the OP originally.
We aren’t mind readers here.

Viktor Serdiuk

@Alan-Kilborn said in Remove duplicates if only part of the string matches:

No, this case was not cited by the OP originally.
We aren’t mind readers here.

Agree!
THANK YOU again!

Alan Kilborn

@guy038 said :

By putting the ?s modifier inside its own group, this means that, by default, the whole regex considers the ?-s modifier. Thus, no need to repeat the ?-s, near the end

Minor quibble about this: For the given data, I think doing it this way might make it more confusing – the syntax requires an extra :, which could be confused with the literal : used in the problem statement, and earlier in the expression.

Viktor Serdiuk

@guy038 said in Remove duplicates if only part of the string matches:

Hello, @viktor-serdiuk, @alan-kilborn and All,

Alan, just a small variant of your search regex :

SEARCH (?-s)^((.+?):.+\R)(?s:.*)^\2.+\R?

REPLACE ${1}

Notes :

By putting the ?s modifier inside its own group, this means that, by default, the whole regex considers the ?-s modifier. Thus, no need to repeat the ?-s, near the end

By adding the question mark at the end of the regex, it covers the case of a last line of current file without any line-break !

BTW, it would be nice that the two N++ options Remove Duplicate Lines and Remove Consecutive Duplicate Lines would ask us about the contiguous zone to consider when removing duplicates :-) For example, from column 5 to 20 or from column 30 to end of line and, of course, the entire line, by default, if nothing is typed !

Best Regards,

guy038

It doesn’t work if it starts with numbers, for example :

1part of row #1:part of row #20
1part of row #1:part of row #21
3part of row #3:part of row #41
3part of row #3:part of row #42
3part of row #3:part of row #43
1part of row #1:part of row #60

PeterJones

@Viktor-Serdiuk said in Remove duplicates if only part of the string matches:

It doesn’t work if it starts with numbers, for example :

It’s not because it starts with a number. It’s because it’s non-contiguous. The sixth line cannot be merged with the rest of the first group because there are other lines between. You could see this yourself by seeing that

1part of row #1:part of row #20
1part of row #1:part of row #21
1part of row #1:part of row #60
3part of row #3:part of row #41
3part of row #3:part of row #42
3part of row #3:part of row #43

does work.

None of your examples showed that you wanted to be able to split and have the prefixes out of order, so Guy didn’t develop the regex to be able to handle that edge case. Unfortunately, I cannot think of an easy way to change his regex to meet your new requirements. Hopefully for you, he’ll have an idea when he comes back.

While waiting, I suggest you avail yourself of the following advice, which someone should have pointed you to previously:

Jim Dailey

@Viktor-Serdiuk: I would like to add to @PeterJones suggestions that you also consider some scripting language to perform tasks like this.

While @guy038 and others in this forum are able to accomplish some amazing feats using regular expressions in Notepad++, I think many times the tasks could be accomplished easier using Python, Perl, or my scripting language of choice, GAWK.

If I understand your desire correctly, I believe this GAWK script would do the trick. The code in the END{} is not needed, but shows how easily you can display statistics about the processed data:

{
    split($0, /:/, Parts)       # Parts[1] <- text before the ":"
    if (Parts[1] in Prefixes) { # If we've seen this prefix ...
        next                    # ... skip this line.
    }
    print Parts[1]              # Print the prefix
    Prefixes[Parts[1]]++        # Add it to the Prefixes[] array
}
END {
    for (p in Prefixes) {       # Print # of times we saw each
        printf("Prefix %s appeared %n times.\n", p, Prefixes[p]
    }
}

Not to start a language war or get too far off topic, but …

I prefer AWK (GNU’s version being GAWK) to newer scripting languages due to its smaller installation footprint and its relative simplicity. Admittedly, its simplicity causes traditional AWK scripts to look quite differently than ones written in most other languages (you don’t see any code above that reads the input file because AWK does it for you), but once one understands how AWK reads the input on his/her behalf, writing simple scripts becomes extremely easy.

Jim Dailey

I made some silly mistakes (several syntax errors) in the AWK script above. Also, the additional code in the END block won’t print the total number of times each prefix appeared as I intended. This script does, however:

{
    split($0, Parts, /:/)          # Parts[1] <- text before the ":".
    if (!(Parts[1] in Prefixes)) { # If we've NOT seen this prefix ...
        print Parts[1]             # ... print it.
    }
    Prefixes[Parts[1]]++           # Count this prefix.
}
END {
    for (p in Prefixes) {          # Print # of times we saw each one.
        printf("Prefix '%s' appeared %d times.\n", p, Prefixes[p])
    }
}