Wanted function: Remove Duplicated Lines



  • Hi.

    When preparing master data it is often required to remove duplicated lines.
    Currently I have to use cmdline for this operation. So, paste into NPP, sort by lines, remove empty lines, save into txt file, apply cmdline to text file to remove duplcates, reload back to NPP, continue editing.
    Would be nice to have such function in NPP, somwhere at menu -> Edit -> Line Operations -> Remove Duplicated Lines (maybe with “ignore case” option).

    Thank you.





  • Hello, @dmitry-Bond and All,

    First of all, we must agree about the statement Remove Duplicates Lines !

    So, assuming the initial example text :

        bbbbbbb
        hhhhhhhhhhh
        fffffffffffffff
        bbbbbbb
        aaaaa
        bbbbbbb
        jj
        eeeeeeeeeeeeeeeeeeeeeeeeeee
        AAAaa
        ccccccccccccccccccccccccccccccccccccccccccccccc
        AAAAA
        ddd
        iiiiiiiiiiiiiiiii
        aaaaa
        hhhhhhhhhhh
        gggggggggggggggggggggggggggggggggggg
        AAAAA
        bbbbbbb
    
    • IF the search is insensitive, 3 lines are duplicated :

      • The line aaaaa ( 5 items )

      • The line bbbbbbb ( 4 items )

      • The line hhhhhhhhhhh ( 2 items )

    • IF the search is sensitive, 4 lines are duplicated :

      • The line aaaaa ( 2 items )

      • The line AAAAA ( 2 items )

      • The line bbbbbbb ( 4 items )

      • The line hhhhhhhhhhh ( 2 items )

    I suppose that you probably want to get a final text with, at least, a single item of all these duplicated lines :-)) If you really want to delete ALL the lines, which are duplicated, just tell me !

    This task can be easily done with a search/replacement, using Regular expressions ! So :


    • Open your file in N++

    • Open the Replace dialog ( Ctrl + H )

    • IF your file is already SORTED :

      • Type in (?-is)^(.+\R)\1+ ( search sensitive to case ) OR (?i-s)^(.+\R)\1+ ( search insensitive to case )

      • Type in \1 in the Replace with: zone

    • IF your file is an UNSORTED file :

      • Type in (?s-i)^(.+?\R)(?=(?:.+\R)?\1) ( search sensitive to case ) OR (?si)^(.+?\R)(?=(?:.+\R)?\1) ( search insensitive to case )

      • Leave the Replace with: zone EMPTY

    • Tick the Wrap around option and the Regular expression search mode

    • Click on the Replace button, several times or once only, on the Replace All button

    Et voilà !

    Remark: When processing on a non sorted file, all duplicated lines, but the last, are deleted !


    So, from the initial example text ( see above ), let’s perform a pre-sort operation ( Edit > Line operations > Sort lines lexicographically Ascending ), we get the sorted text, below :

        AAAAA
        AAAAA
        AAAaa
        aaaaa
        aaaaa
        bbbbbbb
        bbbbbbb
        bbbbbbb
        bbbbbbb
        ccccccccccccccccccccccccccccccccccccccccccccccc
        ddd
        eeeeeeeeeeeeeeeeeeeeeeeeeee
        fffffffffffffff
        gggggggggggggggggggggggggggggggggggg
        hhhhhhhhhhh
        hhhhhhhhhhh
        iiiiiiiiiiiiiiiii
        jj
    

    Then, the search = (?-is)^(.+\R)\1+ and replacement = \1 would change text into :

        AAAAA
        AAAaa
        aaaaa
        bbbbbbb
        ccccccccccccccccccccccccccccccccccccccccccccccc
        ddd
        eeeeeeeeeeeeeeeeeeeeeeeeeee
        fffffffffffffff
        gggggggggggggggggggggggggggggggggggg
        hhhhhhhhhhh
        iiiiiiiiiiiiiiiii
        jj
    

    Whereas the search = (?i-s)^(.+\R)\1+ and replacement = \1 would change text into :

        AAAAA
        bbbbbbb
        ccccccccccccccccccccccccccccccccccccccccccccccc
        ddd
        eeeeeeeeeeeeeeeeeeeeeeeeeee
        fffffffffffffff
        gggggggggggggggggggggggggggggggggggg
        hhhhhhhhhhh
        iiiiiiiiiiiiiiiii
        jj
    

    Now, If we use the initial text, without any sort operation :

    The search = (?s-i)^(.+?\R)(?=(?:.+\R)?\1) and replacement = EMPTY would give :

        fffffffffffffff
        jj
        eeeeeeeeeeeeeeeeeeeeeeeeeee
        AAAaa
        ccccccccccccccccccccccccccccccccccccccccccccccc
        ddd
        iiiiiiiiiiiiiiiii
        aaaaa
        hhhhhhhhhhh
        gggggggggggggggggggggggggggggggggggg
        AAAAA
        bbbbbbb
    

    Whereas the search = (?si)^(.+?\R)(?=(?:.+\R)?\1) and replacement = EMPTY would give :

        fffffffffffffff
        jj
        eeeeeeeeeeeeeeeeeeeeeeeeeee
        ccccccccccccccccccccccccccccccccccccccccccccccc
        ddd
        iiiiiiiiiiiiiiiii
        hhhhhhhhhhh
        gggggggggggggggggggggggggggggggggggg
        AAAAA
        bbbbbbb
    

    Remarks :

    • In a previously sorted file, the regexes keep the first duplicate found

    • In an unsorted file, the regexes keep the last duplicate found

    • If you want to delete the pure blank lines, as well :

      • Adds |^\R at the end of any search regex

      • Change the non-empty replacement \1 with the syntax ?1\1

    I could give you some explanations about these regexes, or other topics, next time, if you want to :-))

    Best Regards,

    guy038



  • I want to add my gain of salt about this: the regex solution works, but it’s absolutely slow for anything but a hundred lines. I added one of the regex solutions (not sure if this) as a macro, and I regularly use it on a ~6k lines text, and takes enough to make it annoying, even in a 4ghz CPU. Right now I had to do it on a 120k lines text and it’s going on its tenth minute. TextFX did stuff like this almost instantly (or taking a reasonable amount of time for large texts anyway), but there’s no 64bit build of it.

    Then I have read that devs aren’t too keen to add functions when regex can do the work, but in this case ‘can do the work’ is very open to interpretation in the context of a text editor that is capable of working with very large files. The regex solution, or better called, workaround, time of operation doesn’t increase linearly with number of lines, it seems exponential.

    Right now my options are:

    • Reinstall 32bit Notepad++, or
    • Keep using the really slow regex “solution”.

    Which IMHO is a really bad pair of choices when this function, which shouldn’t be that bad to code, could be in a simple plugin, or as added functionality in the Edit-> Operations with lines section.

    So, will someone attend to our pleas, please?

    Oh, my regex hasn’t ended yet…



  • @Juan-Miguel-Martínez You have other options:

    • Learn a scripting language to perform tasks like this.
    • Use a different editor that is more to your liking.
    • Find the TextFX source code and build a 64-bit version.

    IMHO, people far too often expect an editor to things that are done much easier by tools that have been around for many years. For example, this simple AWK program will print the unique lines in a file (no sorting needed):

    { if (L[$0]++ == 0) print }
    

    And, I expect it would be very efficient, even on your 120K line input file.

    Now, consider a couple of variations. Suppose you wanted only the duplicated lines, this would do the trick:

    { if (L[$0]++ == 1) print }
    

    And, finally, if you wanted to see all of the duplicated lines (including duplicates of the duplicates), this would work:

    { if (L[$0]++ >= 1) print }
    

    I’m sure the Python and PERL experts can demonstrate similar capabilities.

    Please consider learning a bit of AWK, Python, PERL, or some other such scripting language if you need to manipulate files in various ways that seem a bit difficult or contorted using the editor’s search and replace capabilities.



  • perl -ne "print unless $seen{$_}++" dups.txt

    (I recommend easy-to-install strawberry perl for windows perl usage)



  • You forgot to mention I can also write my own text editor with the functions I want cough
    I’m not demanding anything, I’m requesting something that seems useful, other people seem to want it as well, and I don’t think it’s hard to implement (specially if the lines are already sorted), and certainly is not out of context, seeing other line operations Notepad++ already offers. No need to go all smartass about it.



  • If you are using the 32bit Notepad++ and have PythonScript installed (or you grab and run the installer), you can run this PythonScript to delete adjacent duplicate rows

    # remove duplicate lines (assumes lines already sorted, so only compares to previous line)
    
    console.clear()
    console.show()
    
    prev = "should not match previous"
    lineNumber = 0
    
    while lineNumber < editor.getLineCount():
        editor.gotoLine(lineNumber)
        contents = editor.getLine(lineNumber)
    
        console.write( "#" + str(lineNumber) + "/" + str(editor.getLineCount()) )
        console.write( "#[" + str(len(contents)) + "]\t" + contents)
    
        if contents == prev:
            console.write( "\tdeleting\n" )
            editor.deleteLine(lineNumber)
        else:
            console.write( "\tno match\n" )
            lineNumber = lineNumber + 1
    
        prev = contents
    

    running it on the following

    # line 1
    # this matches
    # this matches
    # ends up as line 3
    # this matches
    # ends up as line 5
    

    will result in

    # line 1
    # this matches
    # ends up as line 3
    # this matches
    # ends up as line 5
    

    So even though the 5th input line matches the 2nd/3rd input lines, it won’t be deleted. This matches your requirement/specification of “it’s already sorted”

    oh, sorry. I just noticed again while re-reading that it’s because you’ve switched to NPP 64bit that you asked for this to begin with. Unfortunately, PythonScript isn’t there (yet) for 64bit. I know Claudia is working on converting the PythonScript plugin to 64bit, but she’s not there yet.

    Someone (not me) might be able to convert my PythonScript solution into something that works with LuaScript, which runs under either 32bit or 64bit NPP. (I don’t know Lua at all, sorry.)

    An alternative solution: since you know the functionality you need exists in the TextFX plugin in 32bit: grab a portable installation of the 32bit, and load that version of NPP in situations where you want to use a TextFX, but use 64bit by default in most situations.



  • I took a look at TextFX and I don’t see a “remove duplicate lines” functionality. There is an ability to keep only unique lines upon doing a sort, but this is not the same thing…not everyone wants a sort along with duplicate-line removal. So recompiling TextFX for 64-bit isn’t going to do anything for providing what the OP is asking for. Maybe I’m wrong and I just missed seeing it in TextFX?

    @Juan-Miguel-Martínez said:

    I can…write my own text editor with the functions I want cough

    I must admit that caused me to laugh out loud–guess it was the cough part on the end. :^)

    …No need to go all smartass about it.

    This comment caused me to stop laughing. This forum is all about alternative solutions to problems, and Jim/Guy/Peter were simply trying to illustrate some of those. They weren’t going “all smartass” as I interpret what they said. There is usually value in all contributions here, and I for one will be checking out Strawberry Perl (used Perl before, but not the fruit-flavored variety).

    That all being said, if you already know alternatives exist and are just stating that you’d like to see a duplicate lines removal feature built right into Notepad++, then point taken, request noted. As to whether or not you will ever see that happen, I have no idea.


Log in to reply