Erase content from duplicate lines, but keeping the first unchanged
-
Hello. So, what I want to do is to turn this:
31 31 31 31 32 32 32 33 33 33 33 34 35 35 35
into this:
31 32 33 34 35
So I want to eliminate the content from all duplicate lines (while keeping them) after the first line. Only the first line keeps its value: the content of all the others is erased.
Thanks in advance!
-
@Luís-Gonçalves
So it turns out that it’s actually much easier to remove all but the last occurrence of each number you found. Hopefully that is sufficient for your needs.- Select the
Mark
tab on the find/replace form. - Enter the regex
(^\d+$)(?=.+?^\1$)
into theFind what:
tab.- How this regex works:
- Find a line containing only digits (
(^\d+$)
). - This line can only be matched if there was at least one line containing exactly the same digits earlier in the document (
(?=.+?^\1$)
)
- Make sure that
Bookmark line
,Purge for each search
,Regular expression
, and. matches newline
are all checked. - Hit the
Mark all
button. Now every line that has an identical line before it will be marked. - Copy a single tab or space character to the clipboard.
- Select
Search->Bookmark->Paste to (Replace) bookmarked lines
from the main menu. - If you want to clear all space from the empty lines, just use
Edit->Blank Operations->Trim Trailing Space
or use the find/replace form to replace^[ \t]+$
with nothing.
I also wrote another regex to replace all but the first occurrence, but it’s much slower to execute (takes time proportional to the
N^2 log(N)
, whereN
is the max number of repeats) and requires you to hit the replace button several times.
To do that:- Use
^(\d+)$(.+?)^\1$
in theFind
box and\1\2
in theReplace with
in theReplace
tab of the find/replace form. Make sureregular expressions
is checked. - As noted above, you will have to keep hitting the
replace
button until the little indicator at the bottom says 0 things were replaced.
- Select the
-
@Mark-Olson
If you want to find and replace identical lines (not just numbers like in this toy example), just replace^\d+$
with^regex-that-matches-an-entire-line$
wherever you saw me write^\d+
.
For example:^[abc]{3,5}$
would match a line containing any combination of the letters a, b, and c with total length 3 to 5.^[^\r\n]*$
would match any line (even an empty line)
-
@Mark-Olson said:
I also wrote another regex to replace all but the first occurrence, but it’s much slower to execute
I’m glad you provided this, even if it is slower, because if you had just provided the first part of your solution, you didn’t solve the problem, as it didn’t give the OP what they wanted.
I presume they have good reason for wanting the replace output the way they specified!
-
@Mark-Olson’s second method could get tedious if there are 50 duplicate lines in a row instead of just 3-5 in a row.
I’d do it in a multistep
- FIND WHAT:
(?-s)(^\d+$)(\R\1)*
REPLACE WITH:☺$0
SEARCH MODE = Regular Expression
REPLACE ALL- ie, look for a line (in this case, all digits) that has 0 or more copies immediately following, and prefix with a smiley
- FIND WHAT:
(?-s)(^\d+$)
REPLACE WITH: <nothing/empty field>
SEARCH MODE = Regular Expression
REPLACE ALL- any line that didn’t get transformed, but matches the “all digits” requirement, must’ve been a duplicate, so it should be cleared
- FIND WHAT:
^☺(?=\d+$)
REPLACE WITH: <nothing/empty field>
SEARCH MODE = Regular Expression
REPLACE ALL- any line that did get transformed should have the smiley removed
(Like Mark’s attempts, mine assumes the lines you want to transform are just one or more digits each, with no spaces or non-digit characters either before or after.)
- FIND WHAT:
-
@PeterJones
This approach is much better than mine in the case where all the duplicate lines are consecutive (that is, there are no numbers other than11
between the first occurrence of11
and the last occurrence of11
).
While my approach is far worse for this specific use case, it does not have this limitation. -
You are right. When I looked at the OP data, it only had consecutive duplicates. If it has to handle duplicates with other lines in between, then mine is not sufficient. The OP doesn’t state whether or not all the duplicates are consecutive, so we’re both working from a reasonable but different assumption/interpretation of the example data.
-
@Mark-Olson your solution worked perfectly, and it did exactly what I wanted. Thank you very much! =)
Thanks to all the other people who gave their help as well.
You’re the best! -
Hello, @luís-gonçalves, @mark-olson, @alan-kilborn, @peterjones and All,
Here is a quick way to mark all consecutive equal lines but the first !
-
First, add a final line-break at the end of your number’s list ! ( IMPORTANT )
-
MARK
(?x) ^ ( \d+ \R ) \K ( \1 )+
Bookmark line
,Purge for each search
andRegular expression
checked
Then, you can follow the @mark-olson’s instructions ! So :
-
Put a single
space
char in the clipboard withCtrl + C
-
Run the
Search > Bookmark > Paste to (Replace) Boomarked Lines
option -
Finally, run the simple S/R :
-
SEARCH
^\x20$
-
REPLACE
Leave EMPTY
-
Or use the
Edit > Blank Operations > Trim Trailing Space
optionBest Regards
guy038
-