Community
    • Login

    Delete duplicate lines ?

    Scheduled Pinned Locked Moved General Discussion
    15 Posts 4 Posters 13.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Björn B-sonB
      Björn B-son
      last edited by

      It replaces only one ocurrence per time Im doing the search
      Doesnt matter if I choose to replace all

      1 Reply Last reply Reply Quote 0
      • Björn B-sonB
        Björn B-son
        last edited by

        Actually it doesnt, it just says that it does…

        Scott SumnerS 1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hello Björn and Scott,

          =============================================================================================================

          UPDATE :

          On 11/13/16, I updated this post. Indeed, I realized that my regex may bug, when dealing with important files :-(( I suppose that it was due to the global in-line modifier (?s), at the beginning of the regex ? Don’t time to point out what something goes wrong, in some cases !

          So, to keep all the unique lines AND the last item of all the duplicate lines, you would rather use the following safer S/R :

          SEARCH : (?-s)^(.+\R)(?s)(?=(.+\R)?\1)|^\R

          REPLACE : EMPTY

          Of course, I also updated, the old post’s contents, below.

          guy038

          =============================================================================================================

          Ah, Scott, very nice regex, indeed !

          Scott and Björn, last week, I replied a post, below, to sophey hence :

          https://notepad-plus-plus.org/community/topic/12490/i-want-to-keep-only-unique-lines/2

          where I tried to fully discuss about TWO general methods of keeping :

          • Only unique lines

          • All the duplicate lines

          • Only the first line, from all the duplicate ones


          But Scott, I didn’t think about that fourth case : To keep all the unique lines AND the last item of all the duplicate lines !

          So, with your regex, (?s)^(.*?)$\s+?^(?=.*^\1$), here is, below, an example of all the lines kept, and the contents of this file before and after the S/R :

              File            Lines         File
             BEFORE           KEPT          AFTER
          -------------------------------------------                 
              aaa                            ccc
              ccc     -->      ccc           bbb
              bbb     -->      bbb           eee
              ddd                            aaa
              aaa                            fff
              eee     -->      eee           ddd
              ddd                            ggg
              aaa     -->      aaa           hhh
              fff     -->      fff           iii
              ddd     -->      ddd
              ggg     -->      ggg
              hhh            
              iii            
              hhh     -->      hhh
              iii     -->      iii
          -------------------------------------------                 
          

          Thinking about it, I found an other syntax, which can achieve the same modifications : (?-s)^(.+\R)(?s)(?=(.+\R)?\1)|^\R. Like you, the replacement zone must be EMPTY

          However, my regex needs a condition : the last line ( as the string “iii”, in the above example ) must be followed by its EOL character(s) !

          Notes :

          • The first part (?-s)^(.+\R), with the modifier (?-s), which ensures that the dot will match Standard characters, matches any complete line, with its EOL characters

          • In the second part (?s)(?=(.+\R)?\1), with the modifier (?s), which means that dot matches, absolutely, any character ( standard or EOL characters ), the syntax (.+\R)?\1, then, represents the largest optional range of characters, going further on, till an EOL character, followed itself by the contents of group 1 ( the current line )

          • Therefore, the part (?=(.+\R)?\1), which is a positive look-ahead, imposes a condition for an overall match : that exists, further on, even closed to, an identical complete line to the current one ! If so, the complete current line is deleted, in replacement

          • Finally the third part ^\R, after the alternative symbol |, matches any pure blank line, which will be deleted, in replacement, too

          Best Regards,

          guy038

          Scott SumnerS 1 Reply Last reply Reply Quote 0
          • Scott SumnerS
            Scott Sumner @guy038
            last edited by

            @guy038

            I’m glad you like my regex. There is a 99% chance that you were the original author and I obtained it from you via this community over the last 1.5 years I’ve been reading it!

            1 Reply Last reply Reply Quote 0
            • Scott SumnerS
              Scott Sumner @Björn B-son
              last edited by

              @Björn-B-son

              Did you get it working in your file(s), using either my or guy038’s methods?

              1 Reply Last reply Reply Quote 0
              • Björn B-sonB
                Björn B-son
                last edited by

                None works, I get the same result with both.
                Getting a message that one occurence were replaced, but it doesnt seems like thats a fact.
                Screenshot http://prntscr.com/cwkn1e

                My file has 1797 lines.
                Maybe I can post it somewhere to you to try ?

                Scott SumnerS 1 Reply Last reply Reply Quote 0
                • Scott SumnerS
                  Scott Sumner @Björn B-son
                  last edited by

                  @Björn-B-son

                  I just reverified my own regex as well as guy038’s regex on the “aaa, bbb, …” data guy038 provided. These regexes work to transform that data as described, so I’m not really able to tell you what is going wrong in your case. :(

                  1 Reply Last reply Reply Quote 0
                  • Björn B-sonB
                    Björn B-son
                    last edited by

                    Could it be not working because all lines starts with -

                    Like

                    • Odling av andra fleråriga växter

                    Or because its some Swedish characters

                    1 Reply Last reply Reply Quote 0
                    • guy038G
                      guy038
                      last edited by

                      Hi Björn,

                      Very strange, indeed ! I first thought it could be because of the Wrap around option, by the regexes worked well, whether this option is checked or not !

                      I also verified that if the Wrap around is checked and the the caret is, somewhere, inside the list, the resulted text is correct, at the end !

                      I also tried with your Swedish text, building the original text below :

                      Odling av andra fleråriga växter
                      Odling av andra fleråriga växter
                      aaaa
                      Odling av andra fleråriga växter
                      bbbb
                      Odling av andra fleråriga växter
                      aaaa
                      Odling av andra fleråriga växter
                      

                      After clicking on the Replace All button, I, normally, got the changed text, below :

                      bbbb
                      aaaa
                      Odling av andra fleråriga växter
                      

                      So the best thing is to begin… at the beginning !

                      First of all, using my simple test text of my previous post :

                      aaa
                      ccc
                      bbb
                      ddd
                      aaa
                      eee
                      ddd
                      aaa
                      fff
                      ddd
                      ggg
                      hhh
                      iii
                      hhh
                      iii
                      

                      do you obtain, after replacing, the text below :

                      ccc
                      bbb
                      eee
                      aaa
                      fff
                      ddd
                      ggg
                      hhh
                      iii
                      

                      Moreover, just to verify, after clicking on the Show All Characters button ( or the menu option View - Show Symbol - Show All Characters ), how look the EOL characters of your file ? CR LF, LF or CR ? Are they all identical ?

                      Remember, if you’re using my regex, just take care that the last item, of your list, is, normally, followed by EOL character(s) ! It’s the only minor restriction !

                      See you later,

                      Cheers,

                      guy038

                      1 Reply Last reply Reply Quote 0
                      • Vasile CarausV
                        Vasile Caraus
                        last edited by

                        yes, but If I have special characters, just like "-- Mother is home – " won’t work any of your regex completed.

                        1 Reply Last reply Reply Quote 0
                        • guy038G
                          guy038
                          last edited by

                          Hello, Vasile,

                          My updated regex ( See the second post, above ) works perfectly well, even if I insert your expression – Mother is home –, in a list !?. For instance, the original text, below :

                          aaa
                          ccc
                          bbb
                          ddd
                          aaa
                          -- Mother is home – 
                          eee
                          ddd
                          aaa
                          fff
                          -- Mother is home – 
                          -- Mother is home – 
                          ddd
                          ggg
                          hhh
                          -- Mother is home – 
                          iii
                          hhh
                          iii
                          

                          with the S/R :

                          SEARCH : (?-s)^(.+\R)(?s)(?=(.+\R)?\1)|^\R

                          REPLACE : EMPTY

                          will be changed into :

                          ccc
                          bbb
                          eee
                          aaa
                          fff
                          ddd
                          ggg
                          -- Mother is home – 
                          iii
                          hhh
                          iii
                          

                          => It did keep all the unique lines AND the last item of all the duplicate lines, whose your string – Mother is home – !

                          Cheers,

                          guy038

                          1 Reply Last reply Reply Quote 0
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors