Skip duplicate words at beginning of the line



  • Hi,

    I have a huge list which is sorted in alphabetical order. In this list are duplicate words only at the beginning of the line which have to be removed. For example:
    Line1: House[Tab]following text
    Line2: House[Tab]following text
    Line3: House[Tab]following text
    Line4: Garden[Tab]following text
    Line5: Garden[Tab]following text
    Line6: Green [Tab]following text

    Target result:
    Line1: House[Tab]following text
    Line2: [Tab]following text
    Line3: [Tab]following text
    Line4: Garden[Tab]following text
    Line5: [Tab]following text
    Line6: Green [Tab]following text

    Any ideas how to manage it in notepad++?

    Thank you very much in advance.
    Michael



  • Hello kreien,

    I was pretty sure that your sorted list could be modified with a regex S/R. Unfortunately, I was unable to perform what you want to, in one go :-(( Luckily, with two successive S/R, it’s quite OK !

    • First of all, add a dummy line, just before the first line of your list ( xxxx[TAB]xxxx )

    • We’ll also need a dummy character, not used yet, in your file, to identify specific lines. I chose the # symbol but any other symbol may be used. Just escape it if this symbol is a special regex character !

    Two hypotheses :

    • I supposed that each line of your sorted list are NOT preceded by some blank characters, which could be different, between two consecutive lines !

    • I supposed that you don’t care about the case of the text, before the first tabulation character

    So, we start, for instance, with the sorted example text, below :

    xxxx	xxxx
    Garden	following	text
    garden	following	text
    Garden	following	text
    Garden	following	text
    Green 	following	text
    House	following	text
    House	following	text
    house	following	text
    street	following	text
    Street	following	text
    Wall	following	text
    

    As you can see, the lines, beginning with House, are located after those beginning with the word Green. Better for a sorted list, isn’t it ?


    The first regex S/R, below, will add a # symbol at the end of, either , any single line and OR the last line of a group

    SEARCH (?i-s)^(.+?)\t.+\K\R(?!\1)

    REPLACE #$0

    NOTES :

    • The part (?i-s) forces the regex engine to consider the dot character, ., as a single standard character, only and that all the process is done, in an insensitive way !

    • Then, the part ^(.+?)\t represents, from beginning of line, the shortest range of standard characters, followed by a tabulation character. This range is stored as group 1, due to the surrounding round brackets

    • The part .+, matches all the remaining standard characters, of the line, after the first tabulation

    • The final part \R(?!\1) represents the End of Line character(s) of the current line, followed by a negative look-ahead, that is to say a condition which must be true for the regex engine considers the overall match. So, the beginning of the next line must be different from the beginning of the previous one ( \1 )

    Finally, the syntax \K forces the regex engine to forget all text matched, before \K. So, this search regex just matches the End of line character(s) of the current line, if next line does NOT begin with the same string beginning the current one

    • So, in replacement, these End of Line character(s) ( the whole regex $0 ) are re-written, preceded by a # symbol

    And we obtain the changed text, below :

    xxxx	xxxx#
    Garden	following	text
    garden	following	text
    Garden	following	text
    Garden	following	text#
    Green 	following	text#
    House	following	text
    House	following	text
    house	following	text#
    street	following	text
    Street	following	text#
    Wall	following	text#
    

    The second regex S/R, below, deletes any # symbol, as well as any text, till the first tabulation character, in all the lines whose the previous line does NOT end with a # symbol

    SEARCH (?-s)#|[^#\r\n]\R\K.+?(?=\t)

    REPLACE EMPTY

    NOTES :

    • Refer above, for the (?-s) syntax

    • The first part of the alternative, |, matches a possible # symbol, at the end of a line

    • The second part of the alternative, [^#\r\n]\R, looks for a last standard character, different from a # symbol, followed by the End of Line character(s)

    • Then the \K syntax, again, reset the regex engine search location, at beginning of the next line

    • Finally, the part .+?(?=\t) just matches the shortest range of characters, which is followed by the first tabulation character, of the next line

    • In replacement, either, the # symbol OR all the characters, before the first tabulation, when the previous line does NOT end with a # symbol, are, simply, deleted

    So, we get the final text :

    xxxx	xxxx
    Garden	following	text
    	following	text
    	following	text
    	following	text
    Green 	following	text
    House	following	text
    	following	text
    	following	text
    street	following	text
    	following	text
    Wall	following	text
    

    To end with, delete the dummy first line. Et voilà !

    IMPORTANT :

    As we use the \K syntax, in the two S/R, you must click on the Replace All button, exclusively ! Don’t use the Replace button, ( step by step replacement ) for these S/R !

    Best Regards,

    guy038


Log in to reply