• Login
Community
  • Login

Wanted function: Remove Duplicated Lines

Scheduled Pinned Locked Moved General Discussion
9 Posts 7 Posters 8.0k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • D
    Dmitry Bond
    last edited by Feb 23, 2018, 9:09 AM

    Hi.

    When preparing master data it is often required to remove duplicated lines.
    Currently I have to use cmdline for this operation. So, paste into NPP, sort by lines, remove empty lines, save into txt file, apply cmdline to text file to remove duplcates, reload back to NPP, continue editing.
    Would be nice to have such function in NPP, somwhere at menu -> Edit -> Line Operations -> Remove Duplicated Lines (maybe with “ignore case” option).

    Thank you.

    1 Reply Last reply Reply Quote 1
    • C
      chcg
      last edited by Feb 23, 2018, 3:50 PM

      see https://stackoverflow.com/questions/3958350/removing-duplicate-rows-in-notepad
      or for just words https://notepad-plus-plus.org/community/topic/15247/replacing-duped-words-across-a-block-block-of-text-respecting

      1 Reply Last reply Reply Quote 0
      • G
        guy038
        last edited by guy038 Feb 23, 2018, 7:26 PM Feb 23, 2018, 7:23 PM

        Hello, @dmitry-Bond and All,

        First of all, we must agree about the statement Remove Duplicates Lines !

        So, assuming the initial example text :

            bbbbbbb
            hhhhhhhhhhh
            fffffffffffffff
            bbbbbbb
            aaaaa
            bbbbbbb
            jj
            eeeeeeeeeeeeeeeeeeeeeeeeeee
            AAAaa
            ccccccccccccccccccccccccccccccccccccccccccccccc
            AAAAA
            ddd
            iiiiiiiiiiiiiiiii
            aaaaa
            hhhhhhhhhhh
            gggggggggggggggggggggggggggggggggggg
            AAAAA
            bbbbbbb
        
        • IF the search is insensitive, 3 lines are duplicated :

          • The line aaaaa ( 5 items )

          • The line bbbbbbb ( 4 items )

          • The line hhhhhhhhhhh ( 2 items )

        • IF the search is sensitive, 4 lines are duplicated :

          • The line aaaaa ( 2 items )

          • The line AAAAA ( 2 items )

          • The line bbbbbbb ( 4 items )

          • The line hhhhhhhhhhh ( 2 items )

        I suppose that you probably want to get a final text with, at least, a single item of all these duplicated lines :-)) If you really want to delete ALL the lines, which are duplicated, just tell me !

        This task can be easily done with a search/replacement, using Regular expressions ! So :


        • Open your file in N++

        • Open the Replace dialog ( Ctrl + H )

        • IF your file is already SORTED :

          • Type in (?-is)^(.+\R)\1+ ( search sensitive to case ) OR (?i-s)^(.+\R)\1+ ( search insensitive to case )

          • Type in \1 in the Replace with: zone

        • IF your file is an UNSORTED file :

          • Type in (?s-i)^(.+?\R)(?=(?:.+\R)?\1) ( search sensitive to case ) OR (?si)^(.+?\R)(?=(?:.+\R)?\1) ( search insensitive to case )

          • Leave the Replace with: zone EMPTY

        • Tick the Wrap around option and the Regular expression search mode

        • Click on the Replace button, several times or once only, on the Replace All button

        Et voilà !

        Remark: When processing on a non sorted file, all duplicated lines, but the last, are deleted !


        So, from the initial example text ( see above ), let’s perform a pre-sort operation ( Edit > Line operations > Sort lines lexicographically Ascending ), we get the sorted text, below :

            AAAAA
            AAAAA
            AAAaa
            aaaaa
            aaaaa
            bbbbbbb
            bbbbbbb
            bbbbbbb
            bbbbbbb
            ccccccccccccccccccccccccccccccccccccccccccccccc
            ddd
            eeeeeeeeeeeeeeeeeeeeeeeeeee
            fffffffffffffff
            gggggggggggggggggggggggggggggggggggg
            hhhhhhhhhhh
            hhhhhhhhhhh
            iiiiiiiiiiiiiiiii
            jj
        

        Then, the search = (?-is)^(.+\R)\1+ and replacement = \1 would change text into :

            AAAAA
            AAAaa
            aaaaa
            bbbbbbb
            ccccccccccccccccccccccccccccccccccccccccccccccc
            ddd
            eeeeeeeeeeeeeeeeeeeeeeeeeee
            fffffffffffffff
            gggggggggggggggggggggggggggggggggggg
            hhhhhhhhhhh
            iiiiiiiiiiiiiiiii
            jj
        

        Whereas the search = (?i-s)^(.+\R)\1+ and replacement = \1 would change text into :

            AAAAA
            bbbbbbb
            ccccccccccccccccccccccccccccccccccccccccccccccc
            ddd
            eeeeeeeeeeeeeeeeeeeeeeeeeee
            fffffffffffffff
            gggggggggggggggggggggggggggggggggggg
            hhhhhhhhhhh
            iiiiiiiiiiiiiiiii
            jj
        

        Now, If we use the initial text, without any sort operation :

        The search = (?s-i)^(.+?\R)(?=(?:.+\R)?\1) and replacement = EMPTY would give :

            fffffffffffffff
            jj
            eeeeeeeeeeeeeeeeeeeeeeeeeee
            AAAaa
            ccccccccccccccccccccccccccccccccccccccccccccccc
            ddd
            iiiiiiiiiiiiiiiii
            aaaaa
            hhhhhhhhhhh
            gggggggggggggggggggggggggggggggggggg
            AAAAA
            bbbbbbb
        

        Whereas the search = (?si)^(.+?\R)(?=(?:.+\R)?\1) and replacement = EMPTY would give :

            fffffffffffffff
            jj
            eeeeeeeeeeeeeeeeeeeeeeeeeee
            ccccccccccccccccccccccccccccccccccccccccccccccc
            ddd
            iiiiiiiiiiiiiiiii
            hhhhhhhhhhh
            gggggggggggggggggggggggggggggggggggg
            AAAAA
            bbbbbbb
        

        Remarks :

        • In a previously sorted file, the regexes keep the first duplicate found

        • In an unsorted file, the regexes keep the last duplicate found

        • If you want to delete the pure blank lines, as well :

          • Adds |^\R at the end of any search regex

          • Change the non-empty replacement \1 with the syntax ?1\1

        I could give you some explanations about these regexes, or other topics, next time, if you want to :-))

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 0
        • J
          Juan Miguel Martínez
          last edited by Mar 1, 2018, 4:19 PM

          I want to add my gain of salt about this: the regex solution works, but it’s absolutely slow for anything but a hundred lines. I added one of the regex solutions (not sure if this) as a macro, and I regularly use it on a ~6k lines text, and takes enough to make it annoying, even in a 4ghz CPU. Right now I had to do it on a 120k lines text and it’s going on its tenth minute. TextFX did stuff like this almost instantly (or taking a reasonable amount of time for large texts anyway), but there’s no 64bit build of it.

          Then I have read that devs aren’t too keen to add functions when regex can do the work, but in this case ‘can do the work’ is very open to interpretation in the context of a text editor that is capable of working with very large files. The regex solution, or better called, workaround, time of operation doesn’t increase linearly with number of lines, it seems exponential.

          Right now my options are:

          • Reinstall 32bit Notepad++, or
          • Keep using the really slow regex “solution”.

          Which IMHO is a really bad pair of choices when this function, which shouldn’t be that bad to code, could be in a simple plugin, or as added functionality in the Edit-> Operations with lines section.

          So, will someone attend to our pleas, please?

          Oh, my regex hasn’t ended yet…

          Scott SumnerS 1 Reply Last reply Mar 4, 2018, 2:52 AM Reply Quote 0
          • Jim DaileyJ
            Jim Dailey
            last edited by Mar 2, 2018, 1:26 PM

            @Juan-Miguel-Martínez You have other options:

            • Learn a scripting language to perform tasks like this.
            • Use a different editor that is more to your liking.
            • Find the TextFX source code and build a 64-bit version.

            IMHO, people far too often expect an editor to things that are done much easier by tools that have been around for many years. For example, this simple AWK program will print the unique lines in a file (no sorting needed):

            { if (L[$0]++ == 0) print }
            

            And, I expect it would be very efficient, even on your 120K line input file.

            Now, consider a couple of variations. Suppose you wanted only the duplicated lines, this would do the trick:

            { if (L[$0]++ == 1) print }
            

            And, finally, if you wanted to see all of the duplicated lines (including duplicates of the duplicates), this would work:

            { if (L[$0]++ >= 1) print }
            

            I’m sure the Python and PERL experts can demonstrate similar capabilities.

            Please consider learning a bit of AWK, Python, PERL, or some other such scripting language if you need to manipulate files in various ways that seem a bit difficult or contorted using the editor’s search and replace capabilities.

            1 Reply Last reply Reply Quote 2
            • PeterJonesP
              PeterJones
              last edited by PeterJones Mar 2, 2018, 2:59 PM Mar 2, 2018, 2:57 PM

              perl -ne "print unless $seen{$_}++" dups.txt

              (I recommend easy-to-install strawberry perl for windows perl usage)

              1 Reply Last reply Reply Quote 1
              • J
                Juan Miguel Martínez
                last edited by Mar 2, 2018, 7:36 PM

                You forgot to mention I can also write my own text editor with the functions I want cough
                I’m not demanding anything, I’m requesting something that seems useful, other people seem to want it as well, and I don’t think it’s hard to implement (specially if the lines are already sorted), and certainly is not out of context, seeing other line operations Notepad++ already offers. No need to go all smartass about it.

                1 Reply Last reply Reply Quote 0
                • PeterJonesP
                  PeterJones
                  last edited by PeterJones Mar 2, 2018, 9:03 PM Mar 2, 2018, 9:02 PM

                  If you are using the 32bit Notepad++ and have PythonScript installed (or you grab and run the installer ), you can run this PythonScript to delete adjacent duplicate rows

                  # remove duplicate lines (assumes lines already sorted, so only compares to previous line)
                  
                  console.clear()
                  console.show()
                  
                  prev = "should not match previous"
                  lineNumber = 0
                  
                  while lineNumber < editor.getLineCount():
                      editor.gotoLine(lineNumber)
                      contents = editor.getLine(lineNumber)
                  
                      console.write( "#" + str(lineNumber) + "/" + str(editor.getLineCount()) )
                      console.write( "#[" + str(len(contents)) + "]\t" + contents)
                  
                      if contents == prev:
                          console.write( "\tdeleting\n" )
                          editor.deleteLine(lineNumber)
                      else:
                          console.write( "\tno match\n" )
                          lineNumber = lineNumber + 1
                  
                      prev = contents
                  

                  running it on the following

                  # line 1
                  # this matches
                  # this matches
                  # ends up as line 3
                  # this matches
                  # ends up as line 5
                  

                  will result in

                  # line 1
                  # this matches
                  # ends up as line 3
                  # this matches
                  # ends up as line 5
                  

                  So even though the 5th input line matches the 2nd/3rd input lines, it won’t be deleted. This matches your requirement/specification of “it’s already sorted”

                  oh, sorry. I just noticed again while re-reading that it’s because you’ve switched to NPP 64bit that you asked for this to begin with. Unfortunately, PythonScript isn’t there (yet) for 64bit. I know Claudia is working on converting the PythonScript plugin to 64bit, but she’s not there yet.

                  Someone (not me) might be able to convert my PythonScript solution into something that works with LuaScript, which runs under either 32bit or 64bit NPP. (I don’t know Lua at all, sorry.)

                  An alternative solution: since you know the functionality you need exists in the TextFX plugin in 32bit: grab a portable installation of the 32bit, and load that version of NPP in situations where you want to use a TextFX, but use 64bit by default in most situations.

                  1 Reply Last reply Reply Quote 1
                  • Scott SumnerS
                    Scott Sumner @Juan Miguel Martínez
                    last edited by Scott Sumner Mar 4, 2018, 2:54 AM Mar 4, 2018, 2:52 AM

                    I took a look at TextFX and I don’t see a “remove duplicate lines” functionality. There is an ability to keep only unique lines upon doing a sort, but this is not the same thing…not everyone wants a sort along with duplicate-line removal. So recompiling TextFX for 64-bit isn’t going to do anything for providing what the OP is asking for. Maybe I’m wrong and I just missed seeing it in TextFX?

                    @Juan-Miguel-Martínez said:

                    I can…write my own text editor with the functions I want cough

                    I must admit that caused me to laugh out loud–guess it was the cough part on the end. :^)

                    …No need to go all smartass about it.

                    This comment caused me to stop laughing. This forum is all about alternative solutions to problems, and Jim/Guy/Peter were simply trying to illustrate some of those. They weren’t going “all smartass” as I interpret what they said. There is usually value in all contributions here, and I for one will be checking out Strawberry Perl (used Perl before, but not the fruit-flavored variety).

                    That all being said, if you already know alternatives exist and are just stating that you’d like to see a duplicate lines removal feature built right into Notepad++, then point taken, request noted. As to whether or not you will ever see that happen, I have no idea.

                    1 Reply Last reply Reply Quote 3
                    2 out of 9
                    • First post
                      2/9
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors