Community
    • Login

    Remove duplicate lines not possible?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    24 Posts 6 Posters 4.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Alan KilbornA
      Alan Kilborn @Cletos
      last edited by

      @Cletos said in Remove duplicate lines not possible?:

      I tried to do it with this option in “Line Operations” but the duplicate lines are not removed

      Click HERE.

      Are the duplicates you are intending to remove on adjacent lines, or are they spread throughout the file in different places?

      Really probably best for you to show some data.

      1 Reply Last reply Reply Quote 1
      • CletosC
        Cletos
        last edited by Cletos

        Images need to be inserted like this

        Thank you!

        [QUOTE]Are the duplicates you are intending to remove on adjacent lines, or are they spread throughout the file in different places?[/QUOTE]
        They are spread, not adjacent. Might that be the issue?

        Thank you for the link.

        [QUOTE]Really probably best for you to show some data.[/QUOTE]
        Here an excerption. These lines are in a single document and many more:

        -5ACEvl6GUewBvGL5g9ODQuVRQH5-QMh_1-QKKwuVHE.rar
        -5FQ3vhHpAlcys-0Kv1p2DYF4PHegpa9Ti2t7AtSjRI.rar
        -7FpdTFaWB_zzli9CFTMZhWM87NYCtljcLMc0dAmZJ0.rar
        -7hFndR54-pcmca80NdNcWq4YDV0uC9LIBHX4A7we7E.rar
        -bEvyhRuLhu3GEPA6sonovXx4hmBJ2txe-H8suqhlHg.rar
        -bVwECLahX-YpGVwVkWbhQB9p8lWlaIXOC5R00PAXE8.rar
        -d8tYCQypGR10_Qu-_uWa5Nheq0JFFD_8AHGrhaPkyQ.rar
        -EDKsDUs7SyOiaT-w0BL-BKTF_gu7Oy2HdSsACkZmrY.rar
        -eQIr4OXvJ-tLwdXy7ZmpsAmkvjjefc0P4KKCcw7opA.rar
        -eQnBbfIq38_pCcytxS45AB4q-2YE1hYFgIYiSC4Fyo.rar
        -FLAC-RAVANAN-2010-ORGINAL-ACD-RIP-DOLBY-DIGITAL-HIGH-DEFINITION-AUDIO.rar
        -fYPQF0hZqwkQNtqHP-H8igCHc-CMYzxtbDvF7btANU.rar
        -KZZU1K8lA-NLvgpdF6bn6fAV8tDhQ8-PTZhap3f69k.rar
        -lG2INJS9kD-i4FvVMhKP1VEXGq1rfvwCtqA5ibLhqI.rar
        -Mirror_Sister(www.mp3vip.org).mp3
        -myd-22.rar
        -myLoO6I9MqIBqbJzcHgxE2-8_bHH6yTmhoNjM1ke_M.rar
        -oeuErdiTYGG7Oj9AhGn4DJi9Sr4zqILTZJIrChFQ6M.rar
        -Original Funk Soul Sister - The Best Of Ann Peebles.zip
        -Rahsaan_Patterson-After_Hours-Retail-2004-WHOA.zip
        -Tell_Me_how_it_feels_extended_12_inch(myfreemp3.eu).mp3
        -U7IRxMzcv6WorFna2j-oNrUExWug0MMK5wmg0f7Nr4.rar
        -x4gnCvNCIsAWfQu5otO7AOWuI3kBLmDJ2tIruxjJGQ.rar

        here between are thousands of such lines. These lines below are duplicates but they are not removed:

        -5ACEvl6GUewBvGL5g9ODQuVRQH5-QMh_1-QKKwuVHE.rar
        -5FQ3vhHpAlcys-0Kv1p2DYF4PHegpa9Ti2t7AtSjRI.rar
        -7FpdTFaWB_zzli9CFTMZhWM87NYCtljcLMc0dAmZJ0.rar
        -7hFndR54-pcmca80NdNcWq4YDV0uC9LIBHX4A7we7E.rar
        -bEvyhRuLhu3GEPA6sonovXx4hmBJ2txe-H8suqhlHg.rar
        -bVwECLahX-YpGVwVkWbhQB9p8lWlaIXOC5R00PAXE8.rar
        -d8tYCQypGR10_Qu-_uWa5Nheq0JFFD_8AHGrhaPkyQ.rar
        -EDKsDUs7SyOiaT-w0BL-BKTF_gu7Oy2HdSsACkZmrY.rar
        -eQIr4OXvJ-tLwdXy7ZmpsAmkvjjefc0P4KKCcw7opA.rar
        -eQnBbfIq38_pCcytxS45AB4q-2YE1hYFgIYiSC4Fyo.rar
        -FLAC-RAVANAN-2010-ORGINAL-ACD-RIP-DOLBY-DIGITAL-HIGH-DEFINITION-AUDIO.rar
        -fYPQF0hZqwkQNtqHP-H8igCHc-CMYzxtbDvF7btANU.rar
        -KZZU1K8lA-NLvgpdF6bn6fAV8tDhQ8-PTZhap3f69k.rar
        -lG2INJS9kD-i4FvVMhKP1VEXGq1rfvwCtqA5ibLhqI.rar

        Alan KilbornA 1 Reply Last reply Reply Quote 0
        • Alan KilbornA
          Alan Kilborn @Cletos
          last edited by Alan Kilborn

          @Cletos said in Remove duplicate lines not possible?:

          They are spread, not adjacent. Might that be the issue?

          Well, it’s the issue if you are attempting to remove duplicates with a command called “Remove Consecutive Duplicate Lines”.

          It means that text like this:

          aaa
          aaa
          bbb
          bbb
          bbb
          ccc
          ccc
          ddd
          

          will be transformed to:

          aaa
          bbb
          ccc
          ddd
          

          but text like this:

          aaa
          bbb
          ccc
          aaa
          bbb
          ccc
          ddd
          

          will remain unaltered.

          1 Reply Last reply Reply Quote 0
          • CletosC
            Cletos
            last edited by

            So how could I remove spread duplicate lines?

            And what sense does “Remove Consecutive Duplicate Lines” have?

            Alan KilbornA 1 Reply Last reply Reply Quote 0
            • Alan KilbornA
              Alan Kilborn @Cletos
              last edited by

              @Cletos said in Remove duplicate lines not possible?:

              So how could I remove spread duplicate lines?

              It is a tough problem to solve with Notepad++ alone…sometimes the techniques to do it work, sometimes they don’t. It is data-dependent.

              And what sense does “Remove Consecutive Duplicate Lines” have?

              Well, if you don’t mind sorting your data as a first-step, the duplicates will get grouped together and then you can use that command to, well, remove consecutive duplicate lines. But sometimes data loses some of its meaning if you sort it, so this technique is not always applicable.

              1 Reply Last reply Reply Quote 0
              • CletosC
                Cletos
                last edited by

                @Alan-Kilborn said in Remove duplicate lines not possible?:

                It is a tough problem to solve with Notepad++ alone

                There once was such option to remove spread duplicates, if I remember it right.

                sometimes the techniques to do it work, sometimes they don’t. It is data-dependent

                Yes, I understand. But why replace an option doing both (spread and following lines) with one doing only one of them.

                So it is not possible with Notepad at the moment.

                Many thanks!

                Alan KilbornA 1 Reply Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @Cletos
                  last edited by

                  @Cletos said in Remove duplicate lines not possible?:

                  There once was such option to remove spread duplicates, if I remember it right.

                  No, only a way to do it via regular expressions discussed here on the Community – that’s probably what you remember.

                  So it is not possible with Notepad++ at the moment.

                  Well, you can try it with the regular expression technique; search the Community site and you’ll rediscover the links with instructions.

                  1 Reply Last reply Reply Quote 1
                  • CletosC
                    Cletos
                    last edited by

                    Alright, thank you very much!

                    1 Reply Last reply Reply Quote 0
                    • guy038G
                      guy038
                      last edited by guy038

                      Hi, @cletos, @alan-kilborn and All,

                      Alan, as you know, I’ve certainly answered this question, many times ! But, I’m a bit lazy and, instead of finding the different links, for the OP, I prefer to “re-invent the wheel” ;-))

                      So @cletos, here is the magic regular expression S/R, which deletes all duplicates lines, without changing the order of the lines

                      • SEARCH (?-s)^(.+\R)(?=(?s:.*)^\1)

                      • REPLACE Leave EMPTY

                      • Tick the Match case option, if you prefer a case detection

                      • Tick the Wrap around option, preferably

                      • Select the Regular expression search mode

                      • Click on the Replace All button ( or use the “step by step”  Replace button to verify how the regex works ! )

                      Remark :

                      Let’s suppose that your initial text is :

                      aaa
                      bbb
                      ccc
                      ddd
                      bbb
                      bbb
                      eee
                      fff
                      bbb
                      ggg
                      bbb
                      hhh
                      iii
                      

                      Then this regex S/R will delete :

                      • The bbb line between lines aaa and ccc
                      • The bbb line between lines ddd and bbb
                      • The bbb line between lines bbb and eee
                      • The bbb line between lines fff and ggg

                      And keeps, only the line bbb, located between lines ggg and hhh

                      So, to sum up, this regex S/R keep all the last duplicate lines found, in the input text !

                      So your final text becomes :

                      aaa
                      ccc
                      ddd
                      eee
                      fff
                      ggg
                      bbb
                      hhh
                      iii
                      

                      I cannot get an other layout, with a correct regex S/R ! ( For instance, keeping the line bbb between lines aaa and ccc and deleting all subsequent bbb lines ) Sorry for this limitation !


                      IMPORTANT :

                      • The last line of your list must always be followed with a line-break

                      • Be aware that the behaviour of this regex S/R is rather weird ! It works nice with small or middle-size text to process. But :

                        • If your file has a big size, over 10 Mb about, even not concerned with duplicates lines, OR

                        • If 2 duplicate lines are separated with, let’s say, more than 10,000 lines

                      It may happen that this S/R is completely wrong, with an extra occurrence, matching all the file contents :-(( It mainly depends on our Boost regex engine and, probably, on the amount of your system memory !

                      As always, give it a try, with your real files, to see how this regex S/R acts !?


                      Two possible solutions, if any problem occurs :

                      • Use, the Replace button repeatedly ( or the Alt + R shortcut ) and stop when a particular replacement wipe out, wrongly, all file contents !

                      • Split your text in smaller parts, processing this regex S/R on each part, first. Then, merge all the pieces and process, again, the regex S/R on the whole set !

                      Best Regards,

                      guy038

                      SofistanppS 1 Reply Last reply Reply Quote 1
                      • CletosC
                        Cletos
                        last edited by

                        Hello guy038,

                        Thank you very much!

                        I cannot get an other layout, with a correct regex S/R ! ( For instance, keeping the line bbb between lines aaa and ccc and deleting all subsequent bbb lines ) Sorry for this limitation !
                        No, no, it works great!

                        The last line of your list must always be followed with a line-break

                        So one has to just press ENTER at the end of that last line in the txt file.

                        If your file has a big size, over 10 Mb about, even not concerned with duplicates lines, OR

                        So I could try splitting the processing on the first half of the txt file and the last half or even smaller and hope there are many lines removed and the file gets smaller.

                        Be aware that the behaviour of this regex S/R is rather weird ! It works nice with small or middle-size text to process. But :

                        Works great after some testing.

                        Two possible solutions, if any problem occurs :

                        Use, the Replace button repeatedly ( or the Alt + R shortcut ) and stop when a particular replacement wipe out, wrongly, all file contents !
                        
                        Split your text in smaller parts, processing this regex S/R on each part, first. Then, merge all the pieces and process, again, the regex S/R on the whole set !
                        

                        I will try it like that.

                        Thank you very much, again!

                        1 Reply Last reply Reply Quote 2
                        • SofistanppS
                          Sofistanpp @guy038
                          last edited by

                          @guy038 said in Remove duplicate lines not possible?:

                          I cannot get an other layout, with a correct regex S/R ! ( For instance, keeping the line bbb between lines aaa and ccc and deleting all subsequent bbb lines ) Sorry for this limitation !

                          Hi guy038, Cletos, All:

                          Not a regex solution, but if you reverse the list —for example, by means of the Reverse Lines plugin— and run the nice regex you provided, you will get the first “bbb” with all duplicates being deleted. Once you are finished, reverse the list again to get the original order of lines.

                          Hope you find this, my first post here, useful.

                          Best Regards.

                          1 Reply Last reply Reply Quote 1
                          • CletosC
                            Cletos
                            last edited by

                            Hello Sofistanpp,

                            OK, sounds very good! Many thanks!

                            SofistanppS 1 Reply Last reply Reply Quote 0
                            • SofistanppS
                              Sofistanpp @Cletos
                              last edited by

                              @Cletos Glad to be of help.

                              Alan KilbornA 1 Reply Last reply Reply Quote 0
                              • Alan KilbornA
                                Alan Kilborn @Sofistanpp
                                last edited by

                                @Sofistanpp

                                Maybe explain how reversing the lines helps?

                                SofistanppS 1 Reply Last reply Reply Quote 0
                                • SofistanppS
                                  Sofistanpp @Alan Kilborn
                                  last edited by

                                  @Alan-Kilborn Sure. It looks to overcome a limitation pointed out by guy038, who wrote that the regex he posted remove all the duplicates except the last one, but it seems that he wanted to keep the first one. So if you reverse the order of lines and run the regex, you will remove, of course, all the instances except the last duplicate — now reverse the list back to the original order and you would have actually kept the first instance of the line —the “bbb” between “aaa” and “ccc” of the example.

                                  Hope it is clear now (English is not my first language).

                                  Best Regards.

                                  Alan KilbornA 1 Reply Last reply Reply Quote 3
                                  • Alan KilbornA
                                    Alan Kilborn @Sofistanpp
                                    last edited by

                                    @Sofistanpp

                                    Ah, okay, I missed the point about wanting to keep the first rather than the last. Thanks for the clarification.

                                    1 Reply Last reply Reply Quote 0
                                    • guy038G
                                      guy038
                                      last edited by guy038

                                      Hi, @cletos, @sofistanpp, @alan-kilborn and All,

                                      @sofistanpp, I didn’t want to privilege any solution but, indeed, it’s good to be able to chose, with your clever idea of using the Reverse Lines plugin, between these two solutions :

                                      • Keep the first duplicate line and delete all subsequent duplicate lines

                                      • Delete any duplicate but just keep the last duplicate line

                                      Now, thinking about it, I found out a solution which can be processed within N++ only, preventing from using any external tool


                                      If we go back to my previous example, open the Column editor ( Edit > Column Editor... ) and, moving the caret to the first column of the first line of your text, create a new number’s list ( Don’t forget to tick the Leading zeros option ! )

                                      Then after adding 1 or several blank character(s), after each number, with the column mode selection, you should get :

                                      
                                      01 aaa
                                      02 bbb
                                      03 ccc
                                      04 ddd
                                      05 bbb
                                      06 bbb
                                      07 eee
                                      08 fff
                                      09 bbb
                                      10 ggg
                                      11 bbb
                                      12 hhh
                                      13 iii
                                      

                                      Now, sort the lines with the option Edit > Line Operations > Sort Lines Lexicographically Descending, giving :

                                      13 iii
                                      12 hhh
                                      11 bbb
                                      10 ggg
                                      09 bbb
                                      08 fff
                                      07 eee
                                      06 bbb
                                      05 bbb
                                      04 ddd
                                      03 ccc
                                      02 bbb
                                      01 aaa
                                      

                                      Finally, after running this new version of my previous regex S/R :

                                      • SEARCH (?-s)^\d+\h+(.+\R)(?=(?s:.*)^\d+\h+\1)

                                      • REPLACE Leave EMPTY

                                      You’re left with :

                                      13 iii
                                      12 hhh
                                      10 ggg
                                      08 fff
                                      07 eee
                                      04 ddd
                                      03 ccc
                                      02 bbb
                                      01 aaa
                                      

                                      Finally, after the second sort Edit > Line Operations > Sort Lines Lexicographically Ascending, in the reverse order, we have the following output text :

                                      01 aaa
                                      02 bbb
                                      03 ccc
                                      04 ddd
                                      07 eee
                                      08 fff
                                      10 ggg
                                      12 hhh
                                      13 iii
                                      

                                      As expected, it remains the duplicate bbb line between lines aaa and ccc only ;-))

                                      Best Regards,

                                      guy038

                                      1 Reply Last reply Reply Quote 2
                                      • SofistanppS
                                        Sofistanpp
                                        last edited by

                                        Hi guy038, All:

                                        Well done. I’m glad my post somehow inspired you to develop a more comprehensive solution to the current issue. As I learned reading archived posts, ancillary lists are a frequently used resource of your toolbox.

                                        On my side, reversing lines wasn’t my first thought. What would happen, I asked myself, if I run that regex in backward direction from the last line? Would I get, by symmetry, the first “bbb”? Enabled the Backward direction button via an AutoHotkey script and clicked on Replace All, but no joy. You will get exactly the same outcome as if you run the regex in normal direction.

                                        I suspect that lookarounds are the culprits (simpler regexes do the expected job), but haven’t thoroughly tested it.

                                        Maybe you or someone else can elaborate on this issue.

                                        Best Regards.

                                        Alan KilbornA 1 Reply Last reply Reply Quote 1
                                        • CletosC
                                          Cletos
                                          last edited by

                                          Hello guy038,

                                          Thank you you very much for the new method!

                                          1 Reply Last reply Reply Quote 1
                                          • Alan KilbornA
                                            Alan Kilborn @Sofistanpp
                                            last edited by Alan Kilborn

                                            @Sofistanpp

                                            run that regex in backward direction from the last line

                                            Searching backwards with regex is “discouraged” and is partially disabled in Notepad++.
                                            The reason, I think, is that thru a given text, if you search backwards versus forwards, you won’t get the same hits. Sometimes (simpler regexes, as you noted) you will, but not always (depends upon the regex and maybe the data).

                                            Enabled the Backward direction button via an AutoHotkey script

                                            In general, enabling disabled controls and then performing an operation and expecting good results is a dubious premise.

                                            1 Reply Last reply Reply Quote 2
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors