Community
    • Login

    How to remove ALL the duplicates WITHOUT deleting the empty lines?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    11 Posts 6 Posters 17.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Marcos MiguelM
      Marcos Miguel
      last edited by

      I’ve tried several formulas to remove duplicates and they either change the total number of lines in the file or don’t remove all the duplicates.

      Does anyone know a formula that removes ALL the duplicates WITHOUT deleting the empty lines?

      THANK YOU.

      Thomas KnoefelT 1 Reply Last reply Reply Quote 0
      • Thomas KnoefelT
        Thomas Knoefel @Marcos Miguel
        last edited by Thomas Knoefel

        @Marcos-Miguel
        Can you provide an example of what your file structure looks like and how it should look at the end? This will make it easier to understand.

        Marcos MiguelM 1 Reply Last reply Reply Quote 0
        • Alan KilbornA
          Alan Kilborn
          last edited by

          I’d guess that OP wants to not remove duplicate lines, but rather erase their contents, e.g. make them have nothing but a line-ending on them?

          But really, yea, OP needs to provide more detail on the needed task.

          1 Reply Last reply Reply Quote 0
          • Marcos MiguelM
            Marcos Miguel @Thomas Knoefel
            last edited by

            @Thomas-Knoefel

            let’s say I have the following sequence
            1
            1
            1
            2
            2
            3

            If I run the following formula to remove the duplicates

            (.+)$\R+\K\1

            then this is what I get

            1

            1
            2

            3
            1

            What I need is a formula that keeps just the first number 1 and delete all others.

            CoisesC Alan KilbornA Thomas KnoefelT 3 Replies Last reply Reply Quote 0
            • CoisesC
              Coises @Marcos Miguel
              last edited by Coises

              This post is deleted!
              1 Reply Last reply Reply Quote 0
              • Alan KilbornA
                Alan Kilborn @Marcos Miguel
                last edited by

                PLEASE just try to ask a decent question.

                Supposing your BEORE data is:

                1
                1
                1
                2
                2
                3
                

                and your AFTER data that you want looks like this:

                1
                
                
                2
                
                3
                

                ???

                Marcos MiguelM 1 Reply Last reply Reply Quote 0
                • Marcos MiguelM
                  Marcos Miguel @Alan Kilborn
                  last edited by

                  @Alan-Kilborn Yes and thank you for your kind words. Very polite of you.

                  Alan KilbornA 1 Reply Last reply Reply Quote 0
                  • Alan KilbornA
                    Alan Kilborn @Marcos Miguel
                    last edited by Alan Kilborn

                    @Marcos-Miguel said in How to remove ALL the duplicates WITHOUT deleting the empty lines?:

                    Very polite of you.

                    I was polite in my first response.
                    I probably should have redirected you to the FAQ section (which you should have read anyway before posting on any site new to you) where there is detailed instructions on how to ask such a question as yours.

                    And someone else was polite to you, but you ignored their request (“Can you provide an example of what your file structure looks like and how it should look at the end?”)

                    1 Reply Last reply Reply Quote 1
                    • Thomas KnoefelT
                      Thomas Knoefel @Marcos Miguel
                      last edited by Thomas Knoefel

                      @Marcos-Miguel said in How to remove ALL the duplicates WITHOUT deleting the empty lines?:

                      What I need is a formula that keeps just the first number 1 and delete all others.

                      Hi Miguel, i don’t have a regex solution which works this flexibel to delete all the following duplicates. But maybe other Regex experts can.

                      But if you have the possibility to install Plugins you can simply do it with MultiReplace Plugin

                      1. Activate Checkbox “Use Variables”

                      2. Add into the list ->

                      Find what: 1
                      Replace with: cond(CNT > 1, '')

                      Find what: 2
                      Replace with: cond(CNT > 1, '')

                      Find what: 3
                      Replace with: cond(CNT > 1, '')

                      1. start Replace All with activated "Use List" Checkbox

                      PS: The next MultiReplace version 2.2.0.9 is in development, which will offer more flexibility in “Use Variables”, even for automatic duplicate detection. This solution can then be described within a single-line expression, apart from the expected duplicate values.

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hello, @marcos-miguel, @thomas-knoefel, @alan-kilborn, @coises and All,

                        Let’s use a real example ! So, we start with this INPUT text, containing a list of English/American first-names :

                        Ted
                        Mary
                        Alice
                        John
                        Mary
                        Alice
                        Peter
                        Alice
                        John
                        Andrew
                        Ted
                        Mary
                        John
                        Ted
                        Elisabeth
                        Andrew
                        Peter
                        Peter
                        Susan
                        

                        And, @marcos-miguel, you would like this expected OUTPUT text, wouldn’t you ?

                        Ted
                        Mary
                        Alice
                        John
                        
                        
                        Peter
                        
                        
                        Andrew
                        
                        
                        
                        
                        Elisabeth
                        
                        
                        
                        Susan
                        

                        If so, here are the steps for a regex solution, which does NOT even need any plugin and does NOT need previous sorted data, too !

                        • Open your file within N++

                        • Open the Replace dialog ( Ctrl + H )

                        • First, untick all box options

                        • SEARCH (?-is)^((.+)\R(?s:.+?))^\2

                        • REPLACE \1

                        • Check the Wrap around option ( IMPORTANT )

                        • Select the Regular expression search mode

                        • Click, repeatedly, on the Replace All button OR on the Replace button, until you get :

                        The message Replace: no occurrence was found OR the message Replace All: 0 occurrences were replaced in entire file


                        A second and equivalent solution would be :

                        • Open your file within N++

                        • Open the Replace dialog ( Ctrl + H )

                        • First, untick all box options

                        • SEARCH (?-is)^(.+)\R(?s:.+?)\K^\1

                        • REPLACE Leave EMPTY

                        • Check the Wrap around option ( IMPORTANT )

                        • Select the Regular expression search mode

                        • Click repeatedly on the Replace All button ( Do NOT use the Replace button ), until you get the message Replace All: 0 occurrences were replaced in entire file


                        Notes :

                        • If you use the first solution, on a file containing a huge number of lines and where two duplicate lines are possibly separated by a great amount of other lines, the search regex may not work properly. In this case, I advice you to prefer the last solution which should work nicely in most of the cases !

                        • When using the second solution, remember to NEVER use the Replace button but ONLY the Replace All one, because of the \K syntax in the search regex !

                        Best Regards

                        guy038

                        1 Reply Last reply Reply Quote 1
                        • Mark OlsonM
                          Mark Olson
                          last edited by Mark Olson

                          @guy038 proposed a very simple solution to this, but I have a multi-step solution using no plugins, without having to perform the same find-replace multiple times in a row:

                          Make sure Regular expressions are ON in the find/replace form before starting this.

                          The document starts looking like this:

                          foo
                          bar
                          
                          baz
                          foo
                          foo
                          
                          bar
                          bar
                          baz
                          quz
                          
                          foo
                          
                          1. Use the find/replace form to replace ^ with \x07. This adds a BEL character (convenient because it does not show up naturally in most text documents) to the beginning of each line.
                          2. Make a column selection in the first column of every line of the file, then use the column editor to insert a number (with leading zeros to facilitate sorting). This numbers the rows, so they can later be put back in order.
                            6e8abd36-ff51-41a8-96d7-7306ab623669-image.png
                          3. Use the find/replace form to replace (?-s)^(\d+)(\x07)(.*) with ${3}${2}${1}. This puts the column numbers after everything else.
                          4. Use the menu command Edit->Line Operations->Sort Lines Lex. Ascending Ignoring Case. Now all the lines with the same text are grouped together, and a single regex-replace can get rid of all but the first line with given text.
                            3c2274bc-a772-4b8f-8f5d-2f4f7149f143-image.png
                          5. Find/replace (?-s)^(.+)\x07\d+(?:\R\1\x07\d+)* with ~~~~${0}. This marks the first instance of each line, so that it won’t be deleted later. Note that the ~~~~ in this example should be replaced with some other text that occurs nowhere in your document.
                            c755e9c0-d0f2-4c5d-bd1c-cb8a5b008180-image.png
                          6. Find/replace (?-s)^(?!~~~~).*(\x07\d+) with ${1}. This clears the text but not the line number of any line that does not have the first instance of its text.
                            1d2fdb73-56c6-48f7-a59b-0a41ee2181a1-image.png
                          7. Find/replace ^~~~~ with nothing. This removes the starting marker.
                          8. Find/replace (?-s)^(.*)(\x07)(\d+)$ with ${3}${2}$1. This brings the line numbers back to the front so that the lines can go back in order.
                          9. Use the menu command Edit->Line Operations->Sort Lines Lex. Ascending Ignoring Case. Now all the lines are back in their original order.
                            15e0de09-df0e-4c9a-97b7-0e0314d296dc-image.png
                          10. Find/replace ^\d+\x07 with nothing. This removes the line numbers and separator.

                          Finally, you are left with the original document with the non-first instances of each line’s text replaced with nothing! A lot of steps, but every step is highly scalable and won’t exhibit bad performance on very large files.

                          foo
                          bar
                          
                          baz
                          
                          
                          
                          
                          
                          
                          quz
                          
                          
                          
                          1 Reply Last reply Reply Quote 2
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors