Remove duplicate lines from unsorted, keeping first



  • I’ve seen some techniques here for using regular expressions to remove duplicate lines from an unsorted file, but these all seem to show how to keep the LAST occurrence of the duplicated line. I need to do this but keep the FIRST occurrence. Does anyone know how that can be done?



  • @Alan-Kilborn

    So the Regex Cookbook (buy it!) gives a couple of regular-expression replacement scenarios for this:

    Find-what zone: (?s)^([^\r\n]*)$(.*?)(?:(?:\r?\n|\r)\1$)+
    OR
    Find-what zone: (?-s)^(.*)$([\s\S]*?)(?:(?:\r?\n|\r)\1$)+
    For either choice the Replace-with zone is set to \1\2

    Some changes may be made to these expressions for more typical use in Notepad++:

    Find-what zone: (?-s)^(.*)$(?s)(.*?)(?:\R\1$)+
    Replace-with zone: \1\2
    Wrap around checkbox: ticked
    Action: Press Replace All button REPEATEDLY until status bar indicates: “Replace All: 0 occurrences were replaced.

    But…there are some interesting things to note in the Cookbook regexes:

    • [^\r\n] may be used in place of "(?-s) with a later occurring ." to mean “any character but not including (or across) line-ending characters”
    • [\s\S] may be used in place of "(?s) with a later occurring ." to mean “any character at all (including line-ending characters)”

    I like these as they put all the functionality in one place in the regex.



  • Excellent. :)

    Is it important that the last line have a line-ending on it if it is otherwise a duplicate of exactly one line that comes before it? I guess I will find out.

    Thanks.



  • Hi, @alan-kilborn, @scott-sumner and All,

    An other formulation of this regex, could be :

    SEARCH (?-s)^(.+\R)(?s).*?\K\1+

    REPLACE EMPTY

    OPTIONS Wrap around and Regular expression set

    ACTION : Click, repeatedly, on the ALT + A shortcut ( Replace All )

    ( Easy to memorize : \K as Kilborn ! )


    So, for instance, from the initial text, below, with a line break after the last item 123 :

    123
    456
    123
    789
    789
    000
    789
    456
    abc
    123
    123
    456
    456
    456
    789
    999
    123
    
    

    We get :

    123
    456
    789
    000
    abc
    999
    

    Scott, the regex, which represents a single standard character, is (?-s).. It can, also, be replaced by the exact negative class [^\r\n\f\x85\x{2028}\x{2029}]. So, the regex [^\r\n] is just an easy approximation :-)

    And the regex (?s)., which represents any character, at all, is (?s).. It can, also, be replaced by any of the regexes [\s\S] , [\d\D] , [\l\L] , [\u\U] , [\w\W] , [\h\H] or [\v\V]


    Scott, you said, too :

    I like these as they put all the functionality in one place in the regex

    But, if you placed the (?s) syntax, inside round parentheses, along with a regex expression, (?s) acts, ONLY, inside the part, within parentheses :-))

    For instance, let’s consider the text :

    blablah
    A simple
    abc
    123
    456
    789
    xyz
    Test
    blablah
    

    And imagine the regex (?-s).+\R((?s)abc.+xyz\R).+

    • In the first and, above all, the last part of the regex, the dot . means a standard character

    • In the middle part, surrounded by parentheses, the dot means any character !

    Best Regards,

    guy038



  • @guy038 said:

    (?-s)^(.+\R)(?s).*?\K\1+

    One of the things that I’m faced with is the need to move regular expressions between Notepad++ and Python code. As Python doesn’t support \K I lean toward the regexes for this that don’t contain it. Note that Python doesn’t support the \R syntax either, but I used it earlier. But as \R is just a simple abbreviation and \K can have bigger logic implications, I’m allowed a little license with the \R. :-D Of course, the longer “Cookbook” regexes have wider applicability than even N++ and Python.

    exact negative class [^\r\n\f\x85\x{2028}\x{2029}]

    Well, I don’t care about any of those beyond the \n…but maybe somebody does! :-D

    …ONLY, inside the part, within parentheses

    Now THAT I didn’t know. Nice. But I still tend to like [\s\S] and [^\r\n] (and their close relatives that you pointed out).



  • http://rgho.st/6GD58rS8H
    See the sections AutoIt and Scripting.Dictionary.
    If satisfied, I will do it in English





  • @Alan-Kilborn said:

    Is it important that the last line have a line-ending on it if it is otherwise a duplicate…?

    Mostly No…but maybe ?

    Say we start with this data:

    value1[CR][LF]
    value2[CR][LF]
    value2[CR][LF]
    value4[CR][LF]
    value3[CR][LF]
    value3[CR][LF]
    value2[CR][LF]
    value4
    

    Notice that the last value4 does NOT have a line-ending on it.

    Then, using the technique above, and after multiple Replace All actions, we are left with this:

    value1[CR][LF]
    value2[CR][LF]
    value4[CR][LF]
    value3
    

    Thus, the final value4 was detected as a duplicate and removed, even though the line itself wasn’t an exact duplicate in the original data.

    Note, however, that the new last line (containing value3) now does not have a line-ending…when all the value3 lines in the original file did.



  • @guy038,
    Why does sometimes the (?-s)^(.+\R)(?s).*?\K\1+ makes Notepad++ think the whole text is duplicated and needs to be deleted?



  • Hello, @sepehr-e,

    I must admit that sometimes regexes, involving great amount of text, may, wrongly, get an unique match, which represents all the file contents :-(( This case may also happen, in case of regexes with recursive patterns, inside !

    I can’t clearly explain this behaviour. May be, it’s related to a matched range of characters, that exceeds a limit. It could also depends on the RAM amount or because of specific N+++ features, like the periodic backup !


    Practically, you could use the regex, below, which implies an other condition : the \1+ block of lines, which is to be deleted, must, follow, most of the time, some End of line characters !

    (?-s)^(.+\R)(?s).*?\R?\K\1+

    Despite you didn’t say anything about your working file, but it could help ?!

    Note that the syntax \R must be optional ( => the form \R? ) in case of a block of consecutives identical lines, as for instance :

    456
    789
    123
    123
    123
    123
    000
    

    Best Regards,

    guy038


Log in to reply