Remove duplicate lines from unsorted, keeping first
-
I’ve seen some techniques here for using regular expressions to remove duplicate lines from an unsorted file, but these all seem to show how to keep the LAST occurrence of the duplicated line. I need to do this but keep the FIRST occurrence. Does anyone know how that can be done?
-
So the Regex Cookbook (buy it!) gives a couple of regular-expression replacement scenarios for this:
Find-what zone:
(?s)^([^\r\n]*)$(.*?)(?:(?:\r?\n|\r)\1$)+
OR
Find-what zone:(?-s)^(.*)$([\s\S]*?)(?:(?:\r?\n|\r)\1$)+
For either choice the Replace-with zone is set to\1\2
Some changes may be made to these expressions for more typical use in Notepad++:
Find-what zone:
(?-s)^(.*)$(?s)(.*?)(?:\R\1$)+
Replace-with zone:\1\2
Wrap around checkbox: ticked
Action: Press Replace All button REPEATEDLY until status bar indicates: “Replace All: 0 occurrences were replaced.”But…there are some interesting things to note in the Cookbook regexes:
[^\r\n]
may be used in place of “(?-s)
with a later occurring.
” to mean “any character but not including (or across) line-ending characters”[\s\S]
may be used in place of “(?s)
with a later occurring.
” to mean “any character at all (including line-ending characters)”
I like these as they put all the functionality in one place in the regex.
-
Excellent. :)
Is it important that the last line have a line-ending on it if it is otherwise a duplicate of exactly one line that comes before it? I guess I will find out.
Thanks.
-
Hi, @alan-kilborn, @scott-sumner and All,
An other formulation of this regex, could be :
SEARCH
(?-s)^(.+\R)(?s).*?\K\1+
REPLACE
EMPTY
OPTIONS
Wrap around
andRegular expression
setACTION : Click, repeatedly, on the
ALT + A
shortcut ( Replace All )( Easy to memorize :
\K
asK
ilborn ! )
So, for instance, from the initial text, below, with a line break after the last item 123 :
123 456 123 789 789 000 789 456 abc 123 123 456 456 456 789 999 123
We get :
123 456 789 000 abc 999
Scott, the regex, which represents a single standard character, is
(?-s).
. It can, also, be replaced by the exact negative class[^\r\n\f\x85\x{2028}\x{2029}]
. So, the regex[^\r\n]
is just an easy approximation :-)And the regex
(?s).
, which represents any character, at all, is(?s).
. It can, also, be replaced by any of the regexes[\s\S]
,[\d\D]
,[\l\L]
,[\u\U]
,[\w\W]
,[\h\H]
or[\v\V]
Scott, you said, too :
I like these as they put all the functionality in one place in the regex
But, if you placed the
(?s)
syntax, inside round parentheses, along with a regex expression,(?s)
acts, ONLY, inside the part, within parentheses :-))For instance, let’s consider the text :
blablah A simple abc 123 456 789 xyz Test blablah
And imagine the regex
(?-s).+\R((?s)abc.+xyz\R).+
-
In the first and, above all, the last part of the regex, the dot
.
means a standard character -
In the middle part, surrounded by parentheses, the dot means any character !
Best Regards,
guy038
-
-
@guy038 said:
(?-s)^(.+\R)(?s).*?\K\1+
One of the things that I’m faced with is the need to move regular expressions between Notepad++ and Python code. As Python doesn’t support
\K
I lean toward the regexes for this that don’t contain it. Note that Python doesn’t support the\R
syntax either, but I used it earlier. But as\R
is just a simple abbreviation and\K
can have bigger logic implications, I’m allowed a little license with the\R
. :-D Of course, the longer “Cookbook” regexes have wider applicability than even N++ and Python.exact negative class [^\r\n\f\x85\x{2028}\x{2029}]
Well, I don’t care about any of those beyond the
\n
…but maybe somebody does! :-D…ONLY, inside the part, within parentheses
Now THAT I didn’t know. Nice. But I still tend to like
[\s\S]
and[^\r\n]
(and their close relatives that you pointed out). -
http://rgho.st/6GD58rS8H
See the sections AutoIt and Scripting.Dictionary.
If satisfied, I will do it in English -
English
http://rgho.st/8C8jFTw4b -
@Alan-Kilborn said:
Is it important that the last line have a line-ending on it if it is otherwise a duplicate…?
Mostly No…but maybe ?
Say we start with this data:
value1[CR][LF] value2[CR][LF] value2[CR][LF] value4[CR][LF] value3[CR][LF] value3[CR][LF] value2[CR][LF] value4
Notice that the last
value4
does NOT have a line-ending on it.Then, using the technique above, and after multiple Replace All actions, we are left with this:
value1[CR][LF] value2[CR][LF] value4[CR][LF] value3
Thus, the final
value4
was detected as a duplicate and removed, even though the line itself wasn’t an exact duplicate in the original data.Note, however, that the new last line (containing
value3
) now does not have a line-ending…when all thevalue3
lines in the original file did. -
@guy038,
Why does sometimes the(?-s)^(.+\R)(?s).*?\K\1+
makes Notepad++ think the whole text is duplicated and needs to be deleted? -
Hello, @sepehr-e,
I must admit that sometimes regexes, involving great amount of text, may, wrongly, get an unique match, which represents all the file contents :-(( This case may also happen, in case of regexes with recursive patterns, inside !
I can’t clearly explain this behaviour. May be, it’s related to a matched range of characters, that exceeds a limit. It could also depends on the RAM amount or because of specific N+++ features, like the periodic backup !
Practically, you could use the regex, below, which implies an other condition : the
\1+
block of lines, which is to be deleted, must, follow, most of the time, some End of line characters !(?-s)^(.+\R)(?s).*?\R?\K\1+
Despite you didn’t say anything about your working file, but it could help ?!
Note that the syntax
\R
must be optional ( => the form\R?
) in case of a block of consecutives identical lines, as for instance :456 789 123 123 123 123 000
Best Regards,
guy038