Skip duplicate words at beginning of the line

kreien

Hi,

I have a huge list which is sorted in alphabetical order. In this list are duplicate words only at the beginning of the line which have to be removed. For example:
Line1: House[Tab]following text
Line2: House[Tab]following text
Line3: House[Tab]following text
Line4: Garden[Tab]following text
Line5: Garden[Tab]following text
Line6: Green [Tab]following text

Target result:
Line1: House[Tab]following text
Line2: [Tab]following text
Line3: [Tab]following text
Line4: Garden[Tab]following text
Line5: [Tab]following text
Line6: Green [Tab]following text

Any ideas how to manage it in notepad++?

Thank you very much in advance.
Michael

guy038

Hello kreien,

I was pretty sure that your sorted list could be modified with a regex S/R. Unfortunately, I was unable to perform what you want to, in one go :-(( Luckily, with two successive S/R, it’s quite OK !

First of all, add a dummy line, just before the first line of your list ( xxxx[TAB]xxxx )
We’ll also need a dummy character, not used yet, in your file, to identify specific lines. I chose the # symbol but any other symbol may be used. Just escape it if this symbol is a special regex character !

Two hypotheses :

I supposed that each line of your sorted list are NOT preceded by some blank characters, which could be different, between two consecutive lines !
I supposed that you don’t care about the case of the text, before the first tabulation character

So, we start, for instance, with the sorted example text, below :

xxxx	xxxx
Garden	following	text
garden	following	text
Garden	following	text
Garden	following	text
Green 	following	text
House	following	text
House	following	text
house	following	text
street	following	text
Street	following	text
Wall	following	text

As you can see, the lines, beginning with House, are located after those beginning with the word Green. Better for a sorted list, isn’t it ?

The first regex S/R, below, will add a # symbol at the end of, either , any single line and OR the last line of a group

SEARCH (?i-s)^(.+?)\t.+\K\R(?!\1)

REPLACE #$0

NOTES :

The part (?i-s) forces the regex engine to consider the dot character, ., as a single standard character, only and that all the process is done, in an insensitive way !
Then, the part ^(.+?)\t represents, from beginning of line, the shortest range of standard characters, followed by a tabulation character. This range is stored as group 1, due to the surrounding round brackets
The part .+, matches all the remaining standard characters, of the line, after the first tabulation
The final part \R(?!\1) represents the End of Line character(s) of the current line, followed by a negative look-ahead, that is to say a condition which must be true for the regex engine considers the overall match. So, the beginning of the next line must be different from the beginning of the previous one ( \1 )

Finally, the syntax \K forces the regex engine to forget all text matched, before \K. So, this search regex just matches the End of line character(s) of the current line, if next line does NOT begin with the same string beginning the current one

So, in replacement, these End of Line character(s) ( the whole regex $0 ) are re-written, preceded by a # symbol

And we obtain the changed text, below :

xxxx	xxxx#
Garden	following	text
garden	following	text
Garden	following	text
Garden	following	text#
Green 	following	text#
House	following	text
House	following	text
house	following	text#
street	following	text
Street	following	text#
Wall	following	text#

The second regex S/R, below, deletes any # symbol, as well as any text, till the first tabulation character, in all the lines whose the previous line does NOT end with a # symbol

SEARCH (?-s)#|[^#\r\n]\R\K.+?(?=\t)

REPLACE EMPTY

NOTES :

Refer above, for the (?-s) syntax
The first part of the alternative, |, matches a possible # symbol, at the end of a line
The second part of the alternative, [^#\r\n]\R, looks for a last standard character, different from a # symbol, followed by the End of Line character(s)
Then the \K syntax, again, reset the regex engine search location, at beginning of the next line
Finally, the part .+?(?=\t) just matches the shortest range of characters, which is followed by the first tabulation character, of the next line
In replacement, either, the # symbol OR all the characters, before the first tabulation, when the previous line does NOT end with a # symbol, are, simply, deleted

So, we get the final text :

xxxx	xxxx
Garden	following	text
	following	text
	following	text
	following	text
Green 	following	text
House	following	text
	following	text
	following	text
street	following	text
	following	text
Wall	following	text

To end with, delete the dummy first line. Et voilà !

IMPORTANT :

As we use the \K syntax, in the two S/R, you must click on the Replace All button, exclusively ! Don’t use the Replace button, ( step by step replacement ) for these S/R !

Best Regards,

guy038