Remove duplicate lines from unsorted, keeping first

Alan Kilborn

I’ve seen some techniques here for using regular expressions to remove duplicate lines from an unsorted file, but these all seem to show how to keep the LAST occurrence of the duplicated line. I need to do this but keep the FIRST occurrence. Does anyone know how that can be done?

Scott Sumner

@Alan-Kilborn

So the Regex Cookbook (buy it!) gives a couple of regular-expression replacement scenarios for this:

Find-what zone: (?s)^([^\r\n]*)$(.*?)(?:(?:\r?\n|\r)\1$)+
OR
Find-what zone: (?-s)^(.*)$([\s\S]*?)(?:(?:\r?\n|\r)\1$)+
For either choice the Replace-with zone is set to \1\2

Some changes may be made to these expressions for more typical use in Notepad++:

Find-what zone: (?-s)^(.*)$(?s)(.*?)(?:\R\1$)+
Replace-with zone: \1\2
Wrap around checkbox: ticked
Action: Press Replace All button REPEATEDLY until status bar indicates: “Replace All: 0 occurrences were replaced.”

But…there are some interesting things to note in the Cookbook regexes:

[^\r\n] may be used in place of “(?-s) with a later occurring .” to mean “any character but not including (or across) line-ending characters”
[\s\S] may be used in place of “(?s) with a later occurring .” to mean “any character at all (including line-ending characters)”

I like these as they put all the functionality in one place in the regex.

Alan Kilborn

Excellent. :)

Is it important that the last line have a line-ending on it if it is otherwise a duplicate of exactly one line that comes before it? I guess I will find out.

Thanks.

guy038

Hi, @alan-kilborn, @scott-sumner and All,

An other formulation of this regex, could be :

SEARCH (?-s)^(.+\R)(?s).*?\K\1+

REPLACE EMPTY

OPTIONS Wrap around and Regular expression set

ACTION : Click, repeatedly, on the ALT + A shortcut ( Replace All )

( Easy to memorize : \K as Kilborn ! )

So, for instance, from the initial text, below, with a line break after the last item 123 :

We get :

Scott, the regex, which represents a single standard character, is (?-s).. It can, also, be replaced by the exact negative class [^\r\n\f\x85\x{2028}\x{2029}]. So, the regex [^\r\n] is just an easy approximation :-)

And the regex (?s)., which represents any character, at all, is (?s).. It can, also, be replaced by any of the regexes [\s\S] , [\d\D] , [\l\L] , [\u\U] , [\w\W] , [\h\H] or [\v\V]

Scott, you said, too :

I like these as they put all the functionality in one place in the regex

But, if you placed the (?s) syntax, inside round parentheses, along with a regex expression, (?s) acts, ONLY, inside the part, within parentheses :-))

For instance, let’s consider the text :

blablah
A simple
abc
123
456
789
xyz
Test
blablah

And imagine the regex (?-s).+\R((?s)abc.+xyz\R).+

In the first and, above all, the last part of the regex, the dot . means a standard character
In the middle part, surrounded by parentheses, the dot means any character !

Best Regards,

guy038

Scott Sumner

@guy038 said:

(?-s)^(.+\R)(?s).*?\K\1+

One of the things that I’m faced with is the need to move regular expressions between Notepad++ and Python code. As Python doesn’t support \K I lean toward the regexes for this that don’t contain it. Note that Python doesn’t support the \R syntax either, but I used it earlier. But as \R is just a simple abbreviation and \K can have bigger logic implications, I’m allowed a little license with the \R. :-D Of course, the longer “Cookbook” regexes have wider applicability than even N++ and Python.

exact negative class [^\r\n\f\x85\x{2028}\x{2029}]

Well, I don’t care about any of those beyond the \n…but maybe somebody does! :-D

…ONLY, inside the part, within parentheses

Now THAT I didn’t know. Nice. But I still tend to like [\s\S] and [^\r\n] (and their close relatives that you pointed out).

AZJIO AZJIO

http://rgho.st/6GD58rS8H
See the sections AutoIt and Scripting.Dictionary.
If satisfied, I will do it in English

AZJIO AZJIO

English
http://rgho.st/8C8jFTw4b

Scott Sumner

@Alan-Kilborn said:

Is it important that the last line have a line-ending on it if it is otherwise a duplicate…?

Mostly No…but maybe ?

Say we start with this data:

value1[CR][LF]
value2[CR][LF]
value2[CR][LF]
value4[CR][LF]
value3[CR][LF]
value3[CR][LF]
value2[CR][LF]
value4

Notice that the last value4 does NOT have a line-ending on it.

Then, using the technique above, and after multiple Replace All actions, we are left with this:

value1[CR][LF]
value2[CR][LF]
value4[CR][LF]
value3

Thus, the final value4 was detected as a duplicate and removed, even though the line itself wasn’t an exact duplicate in the original data.

Note, however, that the new last line (containing value3) now does not have a line-ending…when all the value3 lines in the original file did.

Sepehr E

@guy038,
Why does sometimes the (?-s)^(.+\R)(?s).*?\K\1+ makes Notepad++ think the whole text is duplicated and needs to be deleted?

guy038

Hello, @sepehr-e,

I must admit that sometimes regexes, involving great amount of text, may, wrongly, get an unique match, which represents all the file contents :-(( This case may also happen, in case of regexes with recursive patterns, inside !

I can’t clearly explain this behaviour. May be, it’s related to a matched range of characters, that exceeds a limit. It could also depends on the RAM amount or because of specific N+++ features, like the periodic backup !

Practically, you could use the regex, below, which implies an other condition : the \1+ block of lines, which is to be deleted, must, follow, most of the time, some End of line characters !

(?-s)^(.+\R)(?s).*?\R?\K\1+

Despite you didn’t say anything about your working file, but it could help ?!

Note that the syntax \R must be optional ( => the form \R? ) in case of a block of consecutives identical lines, as for instance :

Best Regards,

guy038