Remove duplicates if only part of the string matches
-
Thank you so much!
-
The only thing is that if the right part of the row is missing, it is not checked.
part of row #1:part of row #20
part of row #1:
part of row #3:part of row #41.
part of row #3:
part of row #3:part of row #43
part of row #5:part of row #60I get the result :
part of row #1:part of row #20
part of row #1:
part of row #3:
part of row #3:part of row #41.
part of row #5:part of row #60 -
Hi, @viktor-serdiuk, @alan-kilborn and All,
Ah, yes…, we forgot this case ! No problem, just two modifications to allow the part after the colon to be empty :
SEARCH
(?-s)^((.+?):.*\R)(?s:.*)^\2.*\R?
REPLACE
${1}
Thus, given your INPUT text :
part of row #1:part of row #20 part of row #1: part of row #3:part of row #41. part of row #3: part of row #3:part of row #43 part of row #5:part of row #60
You should get the following OUTPUT text :
part of row #1:part of row #20 part of row #3:part of row #41. part of row #5:part of row #60
Notes :
-
When using
:.+\R
, the regex expects, at least, one standard character, between:
and the line-break\R
, as the+
quantifier is a shortcut for the{1,}
quantifier -
When using
:.*\R
, the regex allows the part, between:
and the line-break\R
, to be empty ( so to not exist ), as the*
quantifier is a shortcut for the{0,}
quantifier -
And same reason for the part between
\2
and\R
-
However, note that the zone, between the beginning of line
^
and the:
, must contain, at least, one standard char because of the+
and, also, without an other colon because of the additional?
char, giving the(.+?)
sub-regex, re-used later, as group2
with the\2
syntax
Best Regards,
guy038
-
-
@guy038 said in Remove duplicates if only part of the string matches:
we forgot this case
No, this case was not cited by the OP originally.
We aren’t mind readers here. -
@Alan-Kilborn said in Remove duplicates if only part of the string matches:
No, this case was not cited by the OP originally.
We aren’t mind readers here.Agree!
THANK YOU again! -
@guy038 said :
By putting the ?s modifier inside its own group, this means that, by default, the whole regex considers the ?-s modifier. Thus, no need to repeat the ?-s, near the end
Minor quibble about this: For the given data, I think doing it this way might make it more confusing – the syntax requires an extra
:
, which could be confused with the literal:
used in the problem statement, and earlier in the expression. -
@guy038 said in Remove duplicates if only part of the string matches:
Hello, @viktor-serdiuk, @alan-kilborn and All,
Alan, just a small variant of your search regex :
SEARCH
(?-s)^((.+?):.+\R)(?s:.*)^\2.+\R?
REPLACE
${1}
Notes :
-
By putting the
?s
modifier inside its own group, this means that, by default, the whole regex considers the?-s
modifier. Thus, no need to repeat the?-s
, near the end -
By adding the question mark at the end of the regex, it covers the case of a last line of current file without any
line-break
!
BTW, it would be nice that the two N++ options
Remove Duplicate Lines
andRemove Consecutive Duplicate Lines
would ask us about the contiguous zone to consider when removing duplicates :-) For example, from column5
to20
or from column30
toend
of line and, of course, the entire line, by default, if nothing is typed !Best Regards,
guy038
It doesn’t work if it starts with numbers, for example :
1part of row #1:part of row #20
1part of row #1:part of row #21
3part of row #3:part of row #41
3part of row #3:part of row #42
3part of row #3:part of row #43
1part of row #1:part of row #60 -
-
@Viktor-Serdiuk said in Remove duplicates if only part of the string matches:
It doesn’t work if it starts with numbers, for example :
It’s not because it starts with a number. It’s because it’s non-contiguous. The sixth line cannot be merged with the rest of the first group because there are other lines between. You could see this yourself by seeing that
1part of row #1:part of row #20 1part of row #1:part of row #21 1part of row #1:part of row #60 3part of row #3:part of row #41 3part of row #3:part of row #42 3part of row #3:part of row #43
does work.
None of your examples showed that you wanted to be able to split and have the prefixes out of order, so Guy didn’t develop the regex to be able to handle that edge case. Unfortunately, I cannot think of an easy way to change his regex to meet your new requirements. Hopefully for you, he’ll have an idea when he comes back.
While waiting, I suggest you avail yourself of the following advice, which someone should have pointed you to previously:
-
@Viktor-Serdiuk: I would like to add to @PeterJones suggestions that you also consider some scripting language to perform tasks like this.
While @guy038 and others in this forum are able to accomplish some amazing feats using regular expressions in Notepad++, I think many times the tasks could be accomplished easier using Python, Perl, or my scripting language of choice, GAWK.
If I understand your desire correctly, I believe this GAWK script would do the trick. The code in the END{} is not needed, but shows how easily you can display statistics about the processed data:
{ split($0, /:/, Parts) # Parts[1] <- text before the ":" if (Parts[1] in Prefixes) { # If we've seen this prefix ... next # ... skip this line. } print Parts[1] # Print the prefix Prefixes[Parts[1]]++ # Add it to the Prefixes[] array } END { for (p in Prefixes) { # Print # of times we saw each printf("Prefix %s appeared %n times.\n", p, Prefixes[p] } }
Not to start a language war or get too far off topic, but …
I prefer AWK (GNU’s version being GAWK) to newer scripting languages due to its smaller installation footprint and its relative simplicity. Admittedly, its simplicity causes traditional AWK scripts to look quite differently than ones written in most other languages (you don’t see any code above that reads the input file because AWK does it for you), but once one understands how AWK reads the input on his/her behalf, writing simple scripts becomes extremely easy.
-
I made some silly mistakes (several syntax errors) in the AWK script above. Also, the additional code in the END block won’t print the total number of times each prefix appeared as I intended. This script does, however:
{ split($0, Parts, /:/) # Parts[1] <- text before the ":". if (!(Parts[1] in Prefixes)) { # If we've NOT seen this prefix ... print Parts[1] # ... print it. } Prefixes[Parts[1]]++ # Count this prefix. } END { for (p in Prefixes) { # Print # of times we saw each one. printf("Prefix '%s' appeared %d times.\n", p, Prefixes[p]) } }