Community
    • Login

    Erase content from duplicate lines, but keeping the first unchanged

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 5 Posters 428 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Luís GonçalvesL
      Luís Gonçalves
      last edited by

      Hello. So, what I want to do is to turn this:

      31
      31
      31
      31
      32
      32
      32
      33
      33
      33
      33
      34
      35
      35
      35
      

      into this:

      31
      
      
      
      32
      
      
      33
      
      
      
      34
      35
      
      
      

      So I want to eliminate the content from all duplicate lines (while keeping them) after the first line. Only the first line keeps its value: the content of all the others is erased.

      Thanks in advance!

      Mark OlsonM 1 Reply Last reply Reply Quote 2
      • Mark OlsonM
        Mark Olson @Luís Gonçalves
        last edited by Mark Olson

        @Luís-Gonçalves
        So it turns out that it’s actually much easier to remove all but the last occurrence of each number you found. Hopefully that is sufficient for your needs.

        1. Select the Mark tab on the find/replace form.
        2. Enter the regex (^\d+$)(?=.+?^\1$) into the Find what: tab.
          • How this regex works:
          • Find a line containing only digits ((^\d+$)).
          • This line can only be matched if there was at least one line containing exactly the same digits earlier in the document ((?=.+?^\1$))
        3. Make sure that Bookmark line, Purge for each search, Regular expression, and . matches newline are all checked.
        4. Hit the Mark all button. Now every line that has an identical line before it will be marked.
        5. Copy a single tab or space character to the clipboard.
        6. Select Search->Bookmark->Paste to (Replace) bookmarked lines from the main menu.
        7. If you want to clear all space from the empty lines, just use Edit->Blank Operations->Trim Trailing Space or use the find/replace form to replace ^[ \t]+$ with nothing.

        I also wrote another regex to replace all but the first occurrence, but it’s much slower to execute (takes time proportional to the N^2 log(N), where N is the max number of repeats) and requires you to hit the replace button several times.
        To do that:

        1. Use ^(\d+)$(.+?)^\1$ in the Find box and \1\2 in the Replace with in the Replace tab of the find/replace form. Make sure regular expressions is checked.
        2. As noted above, you will have to keep hitting the replace button until the little indicator at the bottom says 0 things were replaced.
        Mark OlsonM PeterJonesP Luís GonçalvesL 3 Replies Last reply Reply Quote 4
        • Mark OlsonM
          Mark Olson @Mark Olson
          last edited by Mark Olson

          @Mark-Olson
          If you want to find and replace identical lines (not just numbers like in this toy example), just replace ^\d+$ with ^regex-that-matches-an-entire-line$ wherever you saw me write ^\d+.
          For example:

          • ^[abc]{3,5}$ would match a line containing any combination of the letters a, b, and c with total length 3 to 5.
          • ^[^\r\n]*$ would match any line (even an empty line)
          Alan KilbornA 1 Reply Last reply Reply Quote 1
          • Alan KilbornA
            Alan Kilborn @Mark Olson
            last edited by

            @Mark-Olson said:

            I also wrote another regex to replace all but the first occurrence, but it’s much slower to execute

            I’m glad you provided this, even if it is slower, because if you had just provided the first part of your solution, you didn’t solve the problem, as it didn’t give the OP what they wanted.

            I presume they have good reason for wanting the replace output the way they specified!

            1 Reply Last reply Reply Quote 1
            • PeterJonesP
              PeterJones @Mark Olson
              last edited by

              @Mark-Olson’s second method could get tedious if there are 50 duplicate lines in a row instead of just 3-5 in a row.

              I’d do it in a multistep

              1. FIND WHAT: (?-s)(^\d+$)(\R\1)*
                REPLACE WITH: ☺$0
                SEARCH MODE = Regular Expression
                REPLACE ALL
                • ie, look for a line (in this case, all digits) that has 0 or more copies immediately following, and prefix with a smiley
              2. FIND WHAT: (?-s)(^\d+$)
                REPLACE WITH: <nothing/empty field>
                SEARCH MODE = Regular Expression
                REPLACE ALL
                • any line that didn’t get transformed, but matches the “all digits” requirement, must’ve been a duplicate, so it should be cleared
              3. FIND WHAT: ^☺(?=\d+$)
                REPLACE WITH: <nothing/empty field>
                SEARCH MODE = Regular Expression
                REPLACE ALL
                • any line that did get transformed should have the smiley removed

              (Like Mark’s attempts, mine assumes the lines you want to transform are just one or more digits each, with no spaces or non-digit characters either before or after.)

              Mark OlsonM 1 Reply Last reply Reply Quote 2
              • Mark OlsonM
                Mark Olson @PeterJones
                last edited by

                @PeterJones
                This approach is much better than mine in the case where all the duplicate lines are consecutive (that is, there are no numbers other than 11 between the first occurrence of 11 and the last occurrence of 11).
                While my approach is far worse for this specific use case, it does not have this limitation.

                PeterJonesP 1 Reply Last reply Reply Quote 1
                • PeterJonesP
                  PeterJones @Mark Olson
                  last edited by

                  @Mark-Olson ,

                  You are right. When I looked at the OP data, it only had consecutive duplicates. If it has to handle duplicates with other lines in between, then mine is not sufficient. The OP doesn’t state whether or not all the duplicates are consecutive, so we’re both working from a reasonable but different assumption/interpretation of the example data.

                  1 Reply Last reply Reply Quote 1
                  • Luís GonçalvesL
                    Luís Gonçalves @Mark Olson
                    last edited by

                    @Mark-Olson your solution worked perfectly, and it did exactly what I wanted. Thank you very much! =)

                    Thanks to all the other people who gave their help as well.
                    You’re the best!

                    1 Reply Last reply Reply Quote 0
                    • guy038G
                      guy038
                      last edited by guy038

                      Hello, @luís-gonçalves, @mark-olson, @alan-kilborn, @peterjones and All,

                      Here is a quick way to mark all consecutive equal lines but the first !

                      • First, add a final line-break at the end of your number’s list ! ( IMPORTANT )

                      • MARK (?x) ^ ( \d+ \R ) \K ( \1 )+

                        • Bookmark line, Purge for each search and Regular expression checked

                      Then, you can follow the @mark-olson’s instructions ! So :

                      • Put a single space char in the clipboard with Ctrl + C

                      • Run the Search > Bookmark > Paste to (Replace) Boomarked Lines option

                      • Finally, run the simple S/R :

                        • SEARCH ^\x20$

                        • REPLACE Leave EMPTY

                      Or use the Edit > Blank Operations > Trim Trailing Space option

                      Best Regards

                      guy038

                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors