Community
    • Login

    Change menu text for "Remove Consecutive Duplicate Lines" ?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    28 Posts 7 Posters 4.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones
      last edited by

      @Brian-Schweitzer said:

      TextFX does it perfectly, and I can use that with a 32 bit Notepad++, but I’d like to keep the 64 bit if possible.

      The two are not mutually exclusive. You could leave 64-bit as your installed Notepad++, but download a portable (zip-edition) of 32-bit Notepad++, unzipped in to some other directory (not in the Program Files (x86) hierarchy; I take a inspiration from the linux world, and put my outside-of-program-files programs in c:\usr\local\apps\____). You could then use the 64-bit for normal, everyday usage. But when you want to do the removing of duplicates, you can just launch your 32bit instance instead.

      1 Reply Last reply Reply Quote 2
      • Alan KilbornA
        Alan Kilborn @Brian Schweitzer
        last edited by

        @Brian-Schweitzer

        it will remove ALL records after the last duplicate that it finds

        Sadly, there are some limitations where the regular expression engine is concerned…but you’ve already discovered this so I’m adding nothing new…

        the built in “Remove Consecutive Duplicates” does exactly the same thing

        This built-in command uses a regular-expression replacement operation as well (but rather a C++ coded one, not a user-supplied one), so the same outcome makes sense.

        Is there a way to allow larger file sizes to process successfully?

        If I were doing it, I’d turn to an external tool. Since something existing that does exactly this doesn’t pop to mind, I’d likely roll my own. I’d probably first try Python but if that wasn’t fast enough I’d turn to C. Maybe in your case, sticking with TextFX is the best option.

        Sorry I don’t have a more optimistic response – maybe someone else?

        EkopalypseE 1 Reply Last reply Reply Quote 2
        • EkopalypseE
          Ekopalypse @Alan Kilborn
          last edited by

          @Alan-Kilborn

          did a quick test. Creating 6_000_100 lines take much longer than removing its duplicates.

          def remove_duplicates():
              unique_lines = set()
              duplicates = []
              for line_num, line in enumerate(editor.getCharacterPointer().splitlines()):
                  if line not in unique_lines:
                      unique_lines.add(line)
                  else:
                      duplicates.append(line_num)
          
              for line_num in reversed(duplicates):
                  editor.deleteLine(line_num)
          

          Which took 5.8 seconds on my environment. :-)
          Note, this script would remove ANY duplicate, not only the ones which are consecutive.

          Alan KilbornA 1 Reply Last reply Reply Quote 4
          • Alan KilbornA
            Alan Kilborn @Ekopalypse
            last edited by

            @Ekopalypse

            Nice.

            Which took 5.8 seconds on my environment

            Nicer.

            script would remove ANY duplicate, not only the ones which are consecutive.

            Perhaps nicest.

            :)

            I was just generalizing in my earlier reply; I didn’t know a script was going to come out of it. :)

            EkopalypseE 1 Reply Last reply Reply Quote 3
            • EkopalypseE
              Ekopalypse @Alan Kilborn
              last edited by

              @Alan-Kilborn

              I was just generalizing in my earlier reply; I didn’t know a script was going to come out of it. :)

              I had it already but never tested it with really big data and this thread just gave me the trigger to do the test :-)

              1 Reply Last reply Reply Quote 2
              • guy038G
                guy038
                last edited by guy038

                Hello, @ekopalypse,

                I’ve just tried out your script, about removing duplicates lines, with a local N++ v7.6.3, 32 bits release and nothing occured :-((

                My Python script version is 1.3.0.0 and NO error message is displayed on the console !

                My Python interpreter seems OK, as other scripts just work as expected !

                I used this simple sample text below :

                abcde
                fgh
                abcde
                jk
                opq
                abcde
                fgh
                jk
                fgh
                abcde
                

                I also, tried to sort it out first, to select a line, a block of lines or all text => No result --(( I also suppressed the line numbering, just in case…

                Here is my debug info :

                Notepad++ v7.6.3   (32-bit)
                Build time : Jan 27 2019 - 17:20:30
                Path : D:\@@\763\notepad++.exe
                Admin mode : OFF
                Local Conf mode : ON
                OS : Windows XP (32-bit)
                Plugins : BetterMultiSelection.dll ComparePlugin.dll DSpellCheck.dll ElasticTabstops.dll mimeTools.dll NppConverter.dll NppExport.dll PythonScript.dll TabIndentSpaceAlign.dll 
                

                Note that the v7.6.3 version is my last version, where I installed the PythonScript plugin and that my Win XP laptop contains numerous portable N++ versions, with various plugins in each ;-))

                So, am I missing something obvious ?!

                BR

                guy038

                EkopalypseE 1 Reply Last reply Reply Quote 2
                • EkopalypseE
                  Ekopalypse @guy038
                  last edited by

                  @guy038

                  sorry, yes, I only posted the function itself - it must be called of course :-)

                  def remove_duplicates():
                      unique_lines = set()
                      duplicates = []
                      for line_num, line in enumerate(editor.getCharacterPointer().splitlines()):
                          if line not in unique_lines:
                              unique_lines.add(line)
                          else:
                              duplicates.append(line_num)
                  
                      for line_num in reversed(duplicates):
                          editor.deleteLine(line_num)
                  
                  remove_duplicates()
                  
                  1 Reply Last reply Reply Quote 4
                  • Alan KilbornA
                    Alan Kilborn
                    last edited by

                    Somewhat equivalently, one could remove the def remove_duplicates(): line (and now also the remove_duplicates() line), and outdent the remaining lines, and it will also work fine. :)

                    1 Reply Last reply Reply Quote 3
                    • PeterJonesP
                      PeterJones
                      last edited by

                      @Ekopalypse ,

                      I just tried it out. With the call, it works for me on @guy038’s data.

                      The one thing I would suggest would be to wrap it in a editor.beginUndoAction() / editor.endUndoAction() pair. If I’m doing a bulk delete, I want to be able to bulk undo, too. :-)

                      EkopalypseE 1 Reply Last reply Reply Quote 3
                      • EkopalypseE
                        Ekopalypse @PeterJones
                        last edited by

                        @PeterJones

                        depending on the how many duplicates it found, yes, it could become quite cumbersome
                        if one would try to undo it :-)

                        def remove_duplicates():
                            unique_lines = set()
                            duplicates = []
                            for line_num, line in enumerate(editor.getCharacterPointer().splitlines()):
                                if line not in unique_lines:
                                    unique_lines.add(line)
                                else:
                                    duplicates.append(line_num)
                        
                            for line_num in reversed(duplicates):
                                editor.deleteLine(line_num)
                        
                        editor.beginUndoAction() 
                        remove_duplicates()
                        editor.endUndoAction()
                        1 Reply Last reply Reply Quote 3
                        • guy038G
                          guy038
                          last edited by

                          Hi, @Ekopalypse, @alan-kilborn, @peterjones and all,

                          Oh… my bad ! I’m feeling really silly, right now :-(( So elementary !


                          Now, as the native Remove consecutive duplicate lines N++ option does not take any selection in account, @ekopalypse, would it be easy enough to just consider the current main selection ? If so, it could be an interesting enhancement of this native N++ command ;-))

                          Cheers,

                          guy038

                          EkopalypseE 1 Reply Last reply Reply Quote 2
                          • EkopalypseE
                            Ekopalypse @guy038
                            last edited by

                            @guy038

                            yes but what should happen with the selection afterwards?
                            Should it simply disappear or should it select the remaining unique lines?

                            1 Reply Last reply Reply Quote 0
                            • guy038G
                              guy038
                              last edited by

                              @ekopalypse,

                              To my mind, I don’t think that it’s necessary to keep the selection. Indeed, it just would be a mean to define the part of file to be processed, afterwards !

                              What’s your feeling about it ?

                              Cheers,

                              guy038

                              EkopalypseE 2 Replies Last reply Reply Quote 1
                              • EkopalypseE
                                Ekopalypse @guy038
                                last edited by

                                @guy038

                                not sure, I guess providing a flag which can be set is good enough. In case one wants it
                                turn it on, if not, turn it off.

                                If not someone else is jumping in then I will follow up tomorrow, as it is already past midnight but I know you know this as you are from France as I remember.

                                'til tomorrow.

                                1 Reply Last reply Reply Quote 1
                                • EkopalypseE
                                  Ekopalypse @guy038
                                  last edited by Ekopalypse

                                  Hi @guy038
                                  as promised here a version with a selection option.

                                  def remove_duplicates():
                                      unselect_after_removable = False
                                      unique_lines = set()
                                      duplicates = []
                                      if editor.getSelectionEmpty():
                                          for line_num, line in enumerate(editor.getCharacterPointer().splitlines()):
                                              if line not in unique_lines:
                                                  unique_lines.add(line)
                                              else:
                                                  duplicates.append(line_num)
                                      else:
                                          start, end = editor.getUserLineSelection()
                                          for line_num in range(start, end+1) :
                                              line = editor.getLine(line_num)
                                              if line not in unique_lines:
                                                  unique_lines.add(line)
                                              else:
                                                  duplicates.append(line_num)
                                  
                                      for line_num in reversed(duplicates):
                                          editor.deleteLine(line_num)
                                  
                                      if unselect_after_removable:
                                          editor.clearSelections()
                                  
                                  1 Reply Last reply Reply Quote 3
                                  • guy038G
                                    guy038
                                    last edited by guy038

                                    Hi, @ekopalypse and All,

                                    This time, I was warned ;-)) So, adding the part, below, to your script allowed me to appreciate your last version :

                                    editor.beginUndoAction() 
                                    remove_duplicates()
                                    editor.endUndoAction()
                                    

                                    If no main selection is present, all file contents are processed. Else, the selection range, only, is concerned. Nice, indeed ;-))

                                    I built a sample file containing, roughly, 497,000 lines, all different and I added a block of 15 lines, 128 times, each block being separated from the next one, with between 800 to 7500 lines, which, finally, gave me a file of almost 500,000 lines. On my out-dated laptop ( Win XP, 1GB of RAM ! ), No problem. It took 31s about to be processed !

                                    BR

                                    guy038

                                    P.S. :

                                    Yes, I know ! Why can’t he buy a recent laptop, with a 250 Gb SSD for Windows 10, 8 Gb of SDRAM, a 2 To SATA HD and 2 Go NVIDIA GeForce, as everybody ? Well, I think I’m about to reach the tipping point ;-))

                                    Note that I did not emphasize these laptop’s characteristics as I’m not quite certain they are all accurate !!

                                    EkopalypseE 1 Reply Last reply Reply Quote 2
                                    • EkopalypseE
                                      Ekopalypse @guy038
                                      last edited by

                                      @guy038

                                      hehe :-)
                                      31s is long time - would be interesting to see your results using this little test
                                      I assume the ram might be bottleneck, would you mind making some test
                                      with 50 000 instead of 500 000 lines as well?

                                      import time
                                      from random import randint
                                      
                                      
                                      def create_unique_lines(num_of_lines):
                                          lines = ['sample data {0} on line {0}\r\n'.format(x) for x in range(num_of_lines)]
                                          editor.setText(''.join(lines))
                                      
                                      
                                      
                                      def create_duplicates(num_of_duplicates):
                                          max_lines = editor.getLineCount()
                                          duplicated_lines = []
                                          for i in range(num_of_duplicates):
                                              duplicate_line = randint(1, max_lines)
                                              duplicated_lines.append(editor.getLine(duplicate_line))
                                          editor.appendText(''.join(duplicated_lines))
                                          editor.scrollToEnd()
                                      
                                      
                                      def remove_duplicates():
                                          unselect_after_removable = False
                                          unique_lines = set() # here much faster than lists
                                          duplicates = []
                                          if editor.getSelectionEmpty():
                                              for line_num, line in enumerate(editor.getCharacterPointer().splitlines()):
                                                  if line not in unique_lines:
                                                      unique_lines.add(line)
                                                  else:
                                                      duplicates.append(line_num)
                                          else:
                                              start, end = editor.getUserLineSelection()
                                              for line_num in range(start, end+1) :
                                                  line = editor.getLine(line_num)
                                                  if line not in unique_lines:
                                                      unique_lines.add(line)
                                                  else:
                                                      duplicates.append(line_num)
                                      
                                          for line_num in reversed(duplicates):
                                              editor.deleteLine(line_num)
                                      
                                          if unselect_after_removable:
                                              editor.clearSelections()
                                      
                                      
                                      def main():
                                          keep_going = '0'
                                          while keep_going:
                                              keep_going = notepad.prompt('1 = Create data    2 = Create duplicates    3 = Remove duplicates',
                                                                          'Choose action',
                                                                          '1' if keep_going == '3' else str(int(keep_going) + 1))
                                      
                                              if keep_going == '1':
                                                  create_unique_lines(500000)
                                              elif keep_going == '2':
                                                  create_duplicates(10)
                                              elif keep_going == '3':
                                                  editor.beginUndoAction()
                                                  s = time.time()
                                                  remove_duplicates()
                                                  print(time.time() - s)
                                                  editor.endUndoAction()
                                                  break
                                              else:
                                                  break
                                      
                                      
                                      main()
                                      

                                      The last action, number 3, or pressing the cancel button breaks the loop.
                                      Btw. I get 0.33 seconds removing 10 duplicates from 500 000 lines on my machine.
                                      First gen i5 2500 but with plenty of ram - 16gb :-)

                                      1 Reply Last reply Reply Quote 2
                                      • guy038G
                                        guy038
                                        last edited by guy038

                                        Hello @ekopalypse and All,

                                        My original file consisted of 500,000 lines about, between 40 and 270 characters.So, this could explain the long execution time noticed, on my weak configuration !

                                        To give you an idea, running several times your second script, yesterday, and this morning, took between 0.937 s and 1.328 s on my old laptop !


                                        Now, after running once more time, your script, I cancelled it, right before the “remove duplicate lines” action. Then :

                                        • I deleted the first line sample data 0 on line 0

                                        • I added the line sample data 500000 on line 500000, at the end of file

                                        • I moved the random block of the 10 duplicate lines, below, to the very beginning of file :

                                        sample data 215497 on line 215497
                                        sample data 444992 on line 444992
                                        sample data 413618 on line 413618
                                        sample data 117035 on line 117035
                                        sample data 185573 on line 185573
                                        sample data 25978 on line 25978
                                        sample data 275256 on line 275256
                                        sample data 251521 on line 251521
                                        sample data 328003 on line 328003
                                        sample data 342755 on line 342755
                                        
                                        • Afterwards, I used a regex S/R, in order to add this 10 lines block, right after line 5,000, 10,000, 15,000, and so on … till 495,000 and finally 500,000

                                          • SEARCH [50]000\R\K

                                          • REPLACE sample data 215497 on line 215497\r\nsample data 444992 on line 444992\r\nsample data 413618 on line 413618\r\nsample data 117035 on line 117035\r\nsample data 185573 on line 185573\r\nsample data 25978 on line 25978\r\nsample data 275256 on line 275256\r\nsample data 251521 on line 251521\r\nsample data 328003 on line 328003\r\nsample data 342755 on line 342755\r\n

                                        So, I obtained a 501,000 lines file ( = 500,000 initial lines + 100 blocks of 10 duplicate lines )


                                        To end, in your initial Remove duplicates script I added the timer. Running the script, on the 501,000 lines file, took between 9.86 s and 11.94 s, depending of the location of caret, before execution. Everything was OK ! I did get, again, each time, a file containing 500,000 lines ;-))

                                        For information, the bock of the 10 duplicate lines kept, was the one located between lines 5,000 and 5,001 !

                                        Important remark : Your script is correct as long as the word wrap feature is NOT set ! Otherwise, results are quite incoherent :-(( This seems quite natural as I often noticed that navigation, throughout an huge file, is very, very slow when the Word wrap feature is enabled, even when there’s no highlighting !

                                        Cheers,

                                        guy038

                                        EkopalypseE 1 Reply Last reply Reply Quote 3
                                        • EkopalypseE
                                          Ekopalypse @guy038
                                          last edited by

                                          @guy038

                                          thank you for your tests. ~10s for your aged laptop seems to be ok, I guess.

                                          What do you mean by

                                          Your script is correct as long as the word wrap feature is NOT set

                                          ?
                                          While testing I didn’t encounter a situation that duplicates still exists after running the script. WordWrap and long lines are performance killers, that’s for sure, but the script should still work correctly.

                                          1 Reply Last reply Reply Quote 0
                                          • First post
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors