Skip duplicate words at beginning of the line
-
Hi,
I have a huge list which is sorted in alphabetical order. In this list are duplicate words only at the beginning of the line which have to be removed. For example:
Line1: House[Tab]following text
Line2: House[Tab]following text
Line3: House[Tab]following text
Line4: Garden[Tab]following text
Line5: Garden[Tab]following text
Line6: Green [Tab]following textTarget result:
Line1: House[Tab]following text
Line2: [Tab]following text
Line3: [Tab]following text
Line4: Garden[Tab]following text
Line5: [Tab]following text
Line6: Green [Tab]following textAny ideas how to manage it in notepad++?
Thank you very much in advance.
Michael -
Hello kreien,
I was pretty sure that your sorted list could be modified with a regex S/R. Unfortunately, I was unable to perform what you want to, in one go :-(( Luckily, with two successive S/R, it’s quite OK !
-
First of all, add a dummy line, just before the first line of your list ( xxxx[TAB]xxxx )
-
We’ll also need a dummy character, not used yet, in your file, to identify specific lines. I chose the
#
symbol but any other symbol may be used. Just escape it if this symbol is a special regex character !
Two hypotheses :
-
I supposed that each line of your sorted list are NOT preceded by some blank characters, which could be different, between two consecutive lines !
-
I supposed that you don’t care about the case of the text, before the first tabulation character
So, we start, for instance, with the sorted example text, below :
xxxx xxxx Garden following text garden following text Garden following text Garden following text Green following text House following text House following text house following text street following text Street following text Wall following text
As you can see, the lines, beginning with House, are located after those beginning with the word Green. Better for a sorted list, isn’t it ?
The first regex S/R, below, will add a
#
symbol at the end of, either , any single line and OR the last line of a groupSEARCH
(?i-s)^(.+?)\t.+\K\R(?!\1)
REPLACE
#$0
NOTES :
-
The part
(?i-s)
forces the regex engine to consider the dot character,.
, as a single standard character, only and that all the process is done, in an insensitive way ! -
Then, the part
^(.+?)\t
represents, from beginning of line, the shortest range of standard characters, followed by a tabulation character. This range is stored as group 1, due to the surrounding round brackets -
The part
.+
, matches all the remaining standard characters, of the line, after the first tabulation -
The final part
\R(?!\1)
represents the End of Line character(s) of the current line, followed by a negative look-ahead, that is to say a condition which must be true for the regex engine considers the overall match. So, the beginning of the next line must be different from the beginning of the previous one (\1
)
Finally, the syntax
\K
forces the regex engine to forget all text matched, before\K
. So, this search regex just matches the End of line character(s) of the current line, if next line does NOT begin with the same string beginning the current one- So, in replacement, these End of Line character(s) ( the whole regex
$0
) are re-written, preceded by a#
symbol
And we obtain the changed text, below :
xxxx xxxx# Garden following text garden following text Garden following text Garden following text# Green following text# House following text House following text house following text# street following text Street following text# Wall following text#
The second regex S/R, below, deletes any
#
symbol, as well as any text, till the first tabulation character, in all the lines whose the previous line does NOT end with a#
symbolSEARCH
(?-s)#|[^#\r\n]\R\K.+?(?=\t)
REPLACE
EMPTY
NOTES :
-
Refer above, for the
(?-s)
syntax -
The first part of the alternative,
|
, matches a possible#
symbol, at the end of a line -
The second part of the alternative,
[^#\r\n]\R
, looks for a last standard character, different from a#
symbol, followed by the End of Line character(s) -
Then the
\K
syntax, again, reset the regex engine search location, at beginning of the next line -
Finally, the part
.+?(?=\t)
just matches the shortest range of characters, which is followed by the first tabulation character, of the next line -
In replacement, either, the
#
symbol OR all the characters, before the first tabulation, when the previous line does NOT end with a#
symbol, are, simply, deleted
So, we get the final text :
xxxx xxxx Garden following text following text following text following text Green following text House following text following text following text street following text following text Wall following text
To end with, delete the dummy first line. Et voilà !
IMPORTANT :
As we use the
\K
syntax, in the two S/R, you must click on the Replace All button, exclusively ! Don’t use the Replace button, ( step by step replacement ) for these S/R !Best Regards,
guy038
-