delete both duplicates regexp macro?
-
this regexp is doing the job fine:
(?-s)^(.+)\R(?s)(?=.*\R\1\R?)
but as guy said, it doesn’t work with large (or kind of) files,
my file is 1-1,5mb (a bit over 50k records) and it doesn’t workanyway, I did it with excel vlookup function
-
Hi, All
Unfortunately, again, I verified that my previous method works, only, if file contents and/or number of lines processed are not too important :-(( In most cases, the regex engine ends up , matching, wrongly, all file contents. Too bad !
So, if you wish to keep the initial order of your file, here is, a new method to adopt, which covers all cases ( I hope so ! ), in order to keep/delete duplicate lines AND/OR all non-duplicate lines of a file, whatever its size !
Please, do any test, even on mportant files to verify that this method is robust and does not fail ! I’ll be glad to get your feedback :-))
So, let’s start with that sample text :
567890 1234 45 1234 xyz 567890 567890 000000000 567890 45 abcdef 1234 1234 45 hijk 45 45 567890 1234 999 1234
-
Move the cursor at the beginning of the first item
567890
-
Open the Column editor (
Edit > Column Editor...
) -
Insert a decimal sequence of numbers, ticking the
Leading zeros
option -
Delete the last isolated number
22
=>
01567890 021234 0345 041234 05xyz 06567890 07567890 08000000000 09567890 1045 11abcdef 121234 131234 1445 15hijk 1645 1745 18567890 191234 20999 211234
-
Now, use the regex S/R, below, to swap the positions of data and numbers, where N is the number of digits, of the previous numbering, and to insert of a separation character ( I chose the
#
character, but any individual char may suit, providing it’s not used in your data. Prefer a character which is not a meta-character used in regexes ! )-
SEARCH
^(?-s)^(\d{
N})(.+)
-
REPLACE
\2#\1
-
As, in our example, N =
2
, it leads to the text :567890#01 1234#02 45#03 1234#04 xyz#05 567890#06 567890#07 000000000#08 567890#09 45#10 abcdef#11 1234#12 1234#13 45#14 hijk#15 45#16 45#17 567890#18 1234#19 999#20 1234#21
- Then, execute a sort with the menu option
Edit > Line Operations > Sort Lines Lexicographically Ascending
=>
000000000#08 1234#02 1234#04 1234#12 1234#13 1234#19 1234#21 45#03 45#10 45#14 45#16 45#17 567890#01 567890#06 567890#07 567890#09 567890#18 999#20 abcdef#11 hijk#15 xyz#05
Important : Till the end of that post, this sorted text becomes the new sample text !
Now, here are the six regex S/R that cover all possible cases :
-
Regex A : SEARCH
(?-s)^(.+#).*\R(?:\1.*\R)+
and REPLACELeave EMPTY
-
Regex B : SEARCH
(?-s)^((.+#).*\R)(?:\2.*\R)+
and REPLACE\1
-
Regex C : SEARCH
(?-s)^(.+#).*\R(\1.*\R)+
and REPLACE\2
-
Regex D : SEARCH
(?-s)^(.+#).*\R(?:\1.*\R)+|.+\R
and REPLACE?1$0
-
Regex E : SEARCH
(?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R
and REPLACE\1
-
Regex F : SEARCH
(?-s)^(.+#).*\R(\1.*\R)+|.+\R
and REPLACE\2
So, in a previously sorted file ( I insist ! ) and whatever the numbering after the
#
symbol :-
If you want to delete all duplicate lines, only, use the regex
A
-
If you want to keep isolated lines AND the first line of each block of duplicate lines, only, use the regex
B
-
If you want to keep isolated lines AND the last line of each block of duplicate lines, only, use the regex
C
-
If you want to delete isolated lines, only, use the regex
D
-
If you want to keep the first line of each block of duplicate lines, only, use the regex
E
-
If you want to keep the last line of each block of duplicate lines, only, use the regex
F
Here are, below, the results of these six regex S/R, against the sample text :
•----------------•----------------•----------------•----------------•----------------•----------------• | Regex A | Regex B | Regex C | Regex D | Regex E | Regex F | •----------------•----------------•----------------•----------------•----------------•----------------• | 000000000#08 | 000000000#08 | 000000000#08 | 1234#02 | 1234#02 | 1234#21 | | 999#20 | 1234#02 | 1234#21 | 1234#04 | 45#03 | 45#17 | | abcdef#11 | 45#03 | 45#17 | 1234#12 | 567890#01 | 567890#18 | | hijk#15 | 567890#01 | 567890#18 | 1234#13 | | | | xyz#05 | 999#20 | 999#20 | 1234#19 | | | | | abcdef#11 | abcdef#11 | 45#03 | | | | | hijk#15 | hijk#15 | 45#10 | | | | | xyz#05 | xyz#05 | 45#14 | | | | | | | 45#16 | | | | | | | 567890#01 | | | | | | | 567890#06 | | | | | | | 567890#07 | | | | | | | 567890#09 | | | •----------------•----------------•----------------•----------------•----------------•----------------•
-
Now, considering any of these
6
results, just above, let’s swap, with the regex S/R, below, the two blocks of data, on either side of the#
character-
SEARCH
^(?-s)^(.+)#(.+)
-
REPLACE
\2#\1
-
We get the different cases, below :
•----------------•----------------•----------------•----------------•----------------•----------------• | Regex A | Regex B | Regex C | Regex D | Regex E | Regex F | •----------------•----------------•----------------•----------------•----------------•----------------• | 08#000000000 | 08#000000000 | 08#000000000 | 02#1234 | 02#1234 | 21#1234 | | 20#999 | 02#1234 | 21#1234 | 04#1234 | 03#45 | 17#45 | | 11#abcdef | 03#45 | 17#45 | 12#1234 | 01#567890 | 18#567890 | | 15#hijk | 01#567890 | 18#567890 | 13#1234 | | | | 05#xyz | 20#999 | 20#999 | 19#1234 | | | | | 11#abcdef | 11#abcdef | 03#45 | | | | | 15#hijk | 15#hijk | 10#45 | | | | | 05#xyz | 05#xyz | 14#45 | | | | | | | 16#45 | | | | | | | 01#567890 | | | | | | | 06#567890 | | | | | | | 07#567890 | | | | | | | 09#567890 | | | •----------------•----------------•----------------•----------------•----------------•----------------•
- Considering any of these
6
results, just above, perform, again, a sort, with the optionEdit > Line Operations > Sort Lines Lexicographically Ascending
=>
•----------------•----------------•----------------•----------------•----------------•----------------• | Regex A | Regex B | Regex C | Regex D | Regex E | Regex F | •----------------•----------------•----------------•----------------•----------------•----------------• | 05#xyz | 01#567890 | 05#xyz | 01#567890 | 01#567890 | 17#45 | | 08#000000000 | 02#1234 | 08#000000000 | 02#1234 | 02#1234 | 18#567890 | | 11#abcdef | 03#45 | 11#abcdef | 03#45 | 03#45 | 21#1234 | | 15#hijk | 05#xyz | 15#hijk | 04#1234 | | | | 20#999 | 08#000000000 | 17#45 | 06#567890 | | | | | 11#abcdef | 18#567890 | 07#567890 | | | | | 15#hijk | 20#999 | 09#567890 | | | | | 20#999 | 21#1234 | 10#45 | | | | | | | 12#1234 | | | | | | | 13#1234 | | | | | | | 14#45 | | | | | | | 16#45 | | | | | | | 19#1234 | | | •----------------•----------------•----------------•----------------•----------------•----------------•
-
Finally, let’s use this last regex S/R to get rid of all the counting marks
-
SEARCH
(?-s)^.+#
-
REPLACE
Leave Empty
-
We obtain the
6
final results, from the original text :•----------------•----------------•----------------•----------------•----------------•----------------• | Regex A | Regex B | Regex C | Regex D | Regex E | Regex F | •----------------•----------------•----------------•----------------•----------------•----------------• | xyz | 567890 | xyz | 567890 | 567890 | 45 | | 000000000 | 1234 | 000000000 | 1234 | 1234 | 567890 | | abcdef | 45 | abcdef | 45 | 45 | 1234 | | hijk | xyz | hijk | 1234 | | | | 999 | 000000000 | 45 | 567890 | | | | | abcdef | 567890 | 567890 | | | | | hijk | 999 | 567890 | | | | | 999 | 1234 | 45 | | | | | | | 1234 | | | | | | | 1234 | | | | | | | 45 | | | | | | | 45 | | | | | | | 1234 | | | •----------------•----------------•----------------•----------------•----------------•----------------•
Remark : This method needs numerous steps, but is quite safe, because all the modifications, produced by the different S/R, concern one line at a time ( or a consecutive block of lines, in regexes A to F ! )
Of course, on huge files , execution time may be important, but you should get the expected results, at the end ;-))
Cheers,
guy038
-
-
thanks a lot for your effort, but too much fuss, isn’t it?
vlookup in excel is easier to do I think
-
@patrickdrd said:
thanks a lot for your effort, but too much fuss, isn’t it?
NOTHING is too much fuss for @guy038 ! :-D
-
@guy038 said:
the regex engine ends up , matching, wrongly, all file contents
As mentioned in this thread, this is in all likelihood caused by this problem.
-
(?-s)^(.+)\R(?s)(?=.*\R\1\R?)
doesn’t match the whole line,
e.g. it tells me that adobe.com exists, but I only have lines that end in adobe.com, e.g. get.adobe.com -
Hi, All,
Sorry for the delay, but I was busy with some garden work (hedge trimming !) and, of course, I also tested the 6 regex, from
A
toF
, of my previous post !I used the following test file :
a#9999999999 a#9999999999 abcdefghij#9999999999 ......................... ......................... ..21524 IDENTICAL lines ( in totality ! ) ......................... ......................... abcdefghij#9999999999 z#9999999999 z#9999999999
As you can see :
-
It begins with the
2
identical linesa#9999999999
-
Then, followed with
21524
identical linesabcdefghij#9999999999
-
And it finished with the
2
identical linesz#9999999999
, followed with a final line-break
So, I ran the regex C of my previous post, (
(?-s)^(.+#).*\R(\1.*\R)+
), against this test file=> It correctly matched the
2
lines, at beginning of file, then the21524
identical lines ( => a selection of495,103
characters ) and, the2
lines at the end of the fileThen, I simply added
ONE
additional lineabcdefghij#9999999999
to that file and ran the regex again. This time, it matched the2
lines, at beginning of file, but wrongly grabbed all remaining text ( So the21525
lines AND the2
last lines ) !?To verify if the results depended of the size of the selection, I changed the test file,with lines of
140
chars, as below :a#9999999999 a#9999999999 abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999 ......................... ......................... ..21524 IDENTICAL lines ( in totality ! ) ......................... ......................... abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999 z#9999999999 z#9999999999
I was very surprised to see that results were exactly the same ( OK for
21524
identical lines and KO for21525
identical lines !!?? ) And yet, this time, the selection contained3,013,360
chars !Of course, I did this test with all the other regexes. For example, with regex A, the limit is a bit higher :
25120
lines. But again, after adding one more line, the regexA
failed :-((So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?
In the meanwhile , seemingly, we can conclude that, in a previously sorted file, a regular expression can handle, roughly, not more than
21,000
identical lines, at a time ! I’d be glad to receive your feed-back in order to confirm or invalidate this fact :-))
Of course, I came to this temporary conclusion, after testing my 6 regexes, from
A
toF
, against real text. I decided to take all contents of a novel, on the Gutenberg site. And…, as I’m French, my choice was, naturally, the novel “The count of Monte-Cristo” by Alexandre Dumas, that you may download from the link below:http://www.gutenberg.org/files/1184/1184-0.txt ( Choose the link “Raw text UTF-8” )
When I first tried to build a suitable sorted working file, in order to test my regexes, unfortunately, all failed :-(( But I also noticed, in that sorted file, that there were numerous lines
the#.....
. Indeed, if you download the novel, just count the occurrences of the regex\bthe\b
=>28628
occurrences of the article “the”. So I deleted all these consecutive occurrences of the word “the”. This time all my regexes worked as expected :-))However, note that, during my tests, I found out that my regexes
D
toF
were, initially, erroneous. So I changed them, and I already updated my previous post with the correct regexes !
With the help of that page, below, on the most common words in English :
https://en.wikipedia.org/wiki/Most_common_words_in_English
I verified, with the regex
\b
Word\b
, that, in this novel, the10
most common words used, in the initial text, are :the 28,628 ( ABSENT in the SORTED file ) to 12,897 of 12,916 and 12,570 a 9,473 I 8,393 you 8,288 he 6,945 in 6,625 his 5,909
So, we are sure that the
6
regexes can, at least, manage files containing up to13,000
consecutive identical lines !
Now, if some people is interested about the different steps, that I used to constitute a decent working file, for testing regexes
A
toF
, just have a glance to the table, below :•------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• | SEARCH | REPLACE | EXPLANATIONS | Occurrences | •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• | | | We delete, manually, from BEGINNING of file to the END of the CONTENTS part | | | | | | | | | | We delete, manually, from AFTER the FOOTNOTES part till the VERY END of file | | | | | | | | ,(?=\d) | EMPTY | We delete any COMMA separator in NUMBERS | 264 | | | | | | | [,;.] | \x20 | We change any punctuation END of a (part of) SENTENCE with a SPACE character | 72,423 | | | | | | | (?i)o’(?=clock) | of\x20the\x20 | We replace the "o’" CONTRACTIVE form with the COMPLETE form "of the " | 164 | | | | | | | (?i)’s|(?<!\w)’|’(?!\w) | EMPTY | We delete the "’s" string and any "’" sign NOT SURROUNDED with WORD chars | 2,754 | | | | | | | (?i)(d|l)’ | \1e\x20 | We change the "d’" and "l’" French CONTRACTIVE forms to, RESPECTIVELY, "de " and "le " | 311 | | | | | | | —|- | \x20 | We change any HYPHEN-MINUS character as well as the EM DASH char, with a SPACE character | 4,933 | | | | | | | [^\w’\r\n ] | \x20 | We ONLY keep WORD, SPACE, and EOL characters and the ’ sign( PRESENT in English CONTRACTIVE forms ) | 38,795 | | | | | | | (?-i)(?<=\s)(?=\w)[^aAIVX\d](?=\s) | \x20 | As ONE-char STRING, we ONLY keep article "A", "a", pronoun "I", DIGITS and ROMAN letters "V" , "X" | 1,151 | | | | | | | ^\h*\R|^\h+|\h+$|\h+(?=\h) | EMPTY | We delete PURE BLANK lines, TRIM spaces at START and END, and REDUCE to a ONE SPACE gap | 107,108 | | | | | | | \x20 ( > 1 mn ! ) | \r\n | Finally, we change any SINGLE SPACE character with a LINE BREAK | 419,769 | | | | | | | COLUMN editor, with LEADING zeros | | At LINE 1, COLUMN 1 | | | | | | | | (?-s)^(\d{6})(.+) | \2#\1 | We SWAP each WORD and its REFERENCE number | 464,233 | | | | | | | (?i)^the# | | We BOOKMARK all the LINES, containing the article "the", whatever its CASE | 28,529 | | | | | | | Bookmark > Cut Bookmarked Lines | | We BACKUP all these lines in an OTHER file, for FURTHER processing | | | | | | | | Sort lines Lexico... ASCENDING | | => A work SORTED file, encoded UTF-8 with BOM, of 5,861,424 BYTES, with 435,704 WORDS, ONE per line | | | | | | | •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
Now, applying the regexes
A
toF
, against the sorted file obtained, I got, after10s
about for each, the coherent results, below :•-------•---------------------------------------•----------•-------------•--------------• | Regex | SEARCH | REPLACE | Occurrences | LINES Number | •-------•---------------------------------------•----------•-------------•--------------• | | Work SORTED file, obtained, AFTER all the steps above : | 435,704 | •-------•---------------------------------------•----------•-------------•--------------• | | | | | | | A | (?-s)^(.+#).*\R(?:\1.*\R)+ | EMPTY | 10,818 | 6,861 | | | | | | | | B | (?-s)^((.+#).*\R)(?:\2.*\R)+ | \1 | 10,818 | 17,679 | | | | | | | | C | (?-s)^(.+#).*\R(\1.*\R)+ | \2 | 10,818 | 17,679 | | | | | | | | D | (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R | ?1$0 | 17,679 | 428,843 | | | | | | | | E | (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R | \1 | 17,679 | 10,818 | | | | | | | | F | `(?-s)^(.+#).*\R(\1.*\R)+|.+\R | \2 | 17,679 | 10,818 | | | | | | | •-------•---------------------------------------•----------•-------------•--------------•
It’s easy to verify that :
-
6,861
lines, after regexA
+428,843
lines, after regexD
=435,704
( Total of the file ) -
6,861
lines, after regexA
, +10,818
lines, after regexE
=17,679
lines, after regexB
-
6,861
lines, after regexA
, +10,818
lines, after regexF
=17,679
lines, after regexC
On the other hand :
-
The
10818
occurrences of regexesA
,B
andC
correspond to all the first/last duplicate lines, as after regexesE
orF
-
The
17,679
occurrences of regexesD
,E
andF
correspond to all first/last duplicate lines AND all the uniques lines, too, as after regexesB
orC
Note also that :
-
With the
3
regexesA
,B
andC
, the unique lines, which must be kept, are,simply, not processed by the regexes -
With the
3
regexesD
,E
andF
, the unique lines, which must be deleted, are matched by the second alternative.+\R
of the regexes
So, guys, as I said, above, I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes
A
toF
, above, and the test file, below :a#9999999999 a#9999999999 abcdefghij#9999999999 ) ..................... ) ..................... ) HOW MANY lines ? ( THANKS for testing !!) ..................... ) abcdefghij#9999999999 ] z#9999999999 z#9999999999
Best Regards,
guy038
-
-
@guy038
off topic regarding garden work:
if your garden is as detailed and thorough as everything else you do, i’d gladly invite you to help me out in mine … the amount of daily magnolia leafs to collect is currently killing me this year and i’ve not been able to control my rakes and brooms with an adequate, repeatable regex ;-) -
(?-s)^(.+#).\R(\1.\R)+ doesn’t work for my case,
it doesn’t find any occurrences -
@guy038 said:
I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below
So it might be worth pointing out a good method for creating an arbitrary (i.e., large!) number of the
abcdefghij#9999999999
lines in your request.Here’s what I would do:
- put caret on that line in a tab created for the purpose of testing this
- start macro recording
- press ctrl+d (to execute the Duplicate Current Line function)
- stop macro recording
- go to the Macro menu and choose Run a Macro Multiple Times…
- fill in the prompt box entries and press Run (to create the desired number of lines)
To see how many lines of this type you’ve currently got, simply do a literal Count search for
abcdefghij#9999999999
. -
@guy038 said:
So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?
I did this and obtained exactly the same results as you did, @guy038. Specifically, OK with 21524 identical lines, and NOT OK with 21525 identical lines. I tried both the shorter and longer versions of those “middle” lines in the file. All this using Notepad++ 7.2.2, 32-bit. I doubt that any other (reasonable) version of Notepad++ will show different results.
-
Do you have more to say on this topic? I’m interested…