Community
    • Login

    Skip duplicate words at beginning of the line

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    duplicate words
    2 Posts 2 Posters 1.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • kreienK
      kreien
      last edited by

      Hi,

      I have a huge list which is sorted in alphabetical order. In this list are duplicate words only at the beginning of the line which have to be removed. For example:
      Line1: House[Tab]following text
      Line2: House[Tab]following text
      Line3: House[Tab]following text
      Line4: Garden[Tab]following text
      Line5: Garden[Tab]following text
      Line6: Green [Tab]following text

      Target result:
      Line1: House[Tab]following text
      Line2: [Tab]following text
      Line3: [Tab]following text
      Line4: Garden[Tab]following text
      Line5: [Tab]following text
      Line6: Green [Tab]following text

      Any ideas how to manage it in notepad++?

      Thank you very much in advance.
      Michael

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello kreien,

        I was pretty sure that your sorted list could be modified with a regex S/R. Unfortunately, I was unable to perform what you want to, in one go :-(( Luckily, with two successive S/R, it’s quite OK !

        • First of all, add a dummy line, just before the first line of your list ( xxxx[TAB]xxxx )

        • We’ll also need a dummy character, not used yet, in your file, to identify specific lines. I chose the # symbol but any other symbol may be used. Just escape it if this symbol is a special regex character !

        Two hypotheses :

        • I supposed that each line of your sorted list are NOT preceded by some blank characters, which could be different, between two consecutive lines !

        • I supposed that you don’t care about the case of the text, before the first tabulation character

        So, we start, for instance, with the sorted example text, below :

        xxxx	xxxx
        Garden	following	text
        garden	following	text
        Garden	following	text
        Garden	following	text
        Green 	following	text
        House	following	text
        House	following	text
        house	following	text
        street	following	text
        Street	following	text
        Wall	following	text
        

        As you can see, the lines, beginning with House, are located after those beginning with the word Green. Better for a sorted list, isn’t it ?


        The first regex S/R, below, will add a # symbol at the end of, either , any single line and OR the last line of a group

        SEARCH (?i-s)^(.+?)\t.+\K\R(?!\1)

        REPLACE #$0

        NOTES :

        • The part (?i-s) forces the regex engine to consider the dot character, ., as a single standard character, only and that all the process is done, in an insensitive way !

        • Then, the part ^(.+?)\t represents, from beginning of line, the shortest range of standard characters, followed by a tabulation character. This range is stored as group 1, due to the surrounding round brackets

        • The part .+, matches all the remaining standard characters, of the line, after the first tabulation

        • The final part \R(?!\1) represents the End of Line character(s) of the current line, followed by a negative look-ahead, that is to say a condition which must be true for the regex engine considers the overall match. So, the beginning of the next line must be different from the beginning of the previous one ( \1 )

        Finally, the syntax \K forces the regex engine to forget all text matched, before \K. So, this search regex just matches the End of line character(s) of the current line, if next line does NOT begin with the same string beginning the current one

        • So, in replacement, these End of Line character(s) ( the whole regex $0 ) are re-written, preceded by a # symbol

        And we obtain the changed text, below :

        xxxx	xxxx#
        Garden	following	text
        garden	following	text
        Garden	following	text
        Garden	following	text#
        Green 	following	text#
        House	following	text
        House	following	text
        house	following	text#
        street	following	text
        Street	following	text#
        Wall	following	text#
        

        The second regex S/R, below, deletes any # symbol, as well as any text, till the first tabulation character, in all the lines whose the previous line does NOT end with a # symbol

        SEARCH (?-s)#|[^#\r\n]\R\K.+?(?=\t)

        REPLACE EMPTY

        NOTES :

        • Refer above, for the (?-s) syntax

        • The first part of the alternative, |, matches a possible # symbol, at the end of a line

        • The second part of the alternative, [^#\r\n]\R, looks for a last standard character, different from a # symbol, followed by the End of Line character(s)

        • Then the \K syntax, again, reset the regex engine search location, at beginning of the next line

        • Finally, the part .+?(?=\t) just matches the shortest range of characters, which is followed by the first tabulation character, of the next line

        • In replacement, either, the # symbol OR all the characters, before the first tabulation, when the previous line does NOT end with a # symbol, are, simply, deleted

        So, we get the final text :

        xxxx	xxxx
        Garden	following	text
        	following	text
        	following	text
        	following	text
        Green 	following	text
        House	following	text
        	following	text
        	following	text
        street	following	text
        	following	text
        Wall	following	text
        

        To end with, delete the dummy first line. Et voilà !

        IMPORTANT :

        As we use the \K syntax, in the two S/R, you must click on the Replace All button, exclusively ! Don’t use the Replace button, ( step by step replacement ) for these S/R !

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 0
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors