delete both duplicates regexp macro?
-
thanks a lot for your effort, but too much fuss, isn’t it?
vlookup in excel is easier to do I think
-
@patrickdrd said:
thanks a lot for your effort, but too much fuss, isn’t it?
NOTHING is too much fuss for @guy038 ! :-D
-
@guy038 said:
the regex engine ends up , matching, wrongly, all file contents
As mentioned in this thread, this is in all likelihood caused by this problem.
-
(?-s)^(.+)\R(?s)(?=.*\R\1\R?)
doesn’t match the whole line,
e.g. it tells me that adobe.com exists, but I only have lines that end in adobe.com, e.g. get.adobe.com -
Hi, All,
Sorry for the delay, but I was busy with some garden work (hedge trimming !) and, of course, I also tested the 6 regex, from
A
toF
, of my previous post !I used the following test file :
a#9999999999 a#9999999999 abcdefghij#9999999999 ......................... ......................... ..21524 IDENTICAL lines ( in totality ! ) ......................... ......................... abcdefghij#9999999999 z#9999999999 z#9999999999
As you can see :
-
It begins with the
2
identical linesa#9999999999
-
Then, followed with
21524
identical linesabcdefghij#9999999999
-
And it finished with the
2
identical linesz#9999999999
, followed with a final line-break
So, I ran the regex C of my previous post, (
(?-s)^(.+#).*\R(\1.*\R)+
), against this test file=> It correctly matched the
2
lines, at beginning of file, then the21524
identical lines ( => a selection of495,103
characters ) and, the2
lines at the end of the fileThen, I simply added
ONE
additional lineabcdefghij#9999999999
to that file and ran the regex again. This time, it matched the2
lines, at beginning of file, but wrongly grabbed all remaining text ( So the21525
lines AND the2
last lines ) !?To verify if the results depended of the size of the selection, I changed the test file,with lines of
140
chars, as below :a#9999999999 a#9999999999 abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999 ......................... ......................... ..21524 IDENTICAL lines ( in totality ! ) ......................... ......................... abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999 z#9999999999 z#9999999999
I was very surprised to see that results were exactly the same ( OK for
21524
identical lines and KO for21525
identical lines !!?? ) And yet, this time, the selection contained3,013,360
chars !Of course, I did this test with all the other regexes. For example, with regex A, the limit is a bit higher :
25120
lines. But again, after adding one more line, the regexA
failed :-((So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?
In the meanwhile , seemingly, we can conclude that, in a previously sorted file, a regular expression can handle, roughly, not more than
21,000
identical lines, at a time ! I’d be glad to receive your feed-back in order to confirm or invalidate this fact :-))
Of course, I came to this temporary conclusion, after testing my 6 regexes, from
A
toF
, against real text. I decided to take all contents of a novel, on the Gutenberg site. And…, as I’m French, my choice was, naturally, the novel “The count of Monte-Cristo” by Alexandre Dumas, that you may download from the link below:http://www.gutenberg.org/files/1184/1184-0.txt ( Choose the link “Raw text UTF-8” )
When I first tried to build a suitable sorted working file, in order to test my regexes, unfortunately, all failed :-(( But I also noticed, in that sorted file, that there were numerous lines
the#.....
. Indeed, if you download the novel, just count the occurrences of the regex\bthe\b
=>28628
occurrences of the article “the”. So I deleted all these consecutive occurrences of the word “the”. This time all my regexes worked as expected :-))However, note that, during my tests, I found out that my regexes
D
toF
were, initially, erroneous. So I changed them, and I already updated my previous post with the correct regexes !
With the help of that page, below, on the most common words in English :
https://en.wikipedia.org/wiki/Most_common_words_in_English
I verified, with the regex
\b
Word\b
, that, in this novel, the10
most common words used, in the initial text, are :the 28,628 ( ABSENT in the SORTED file ) to 12,897 of 12,916 and 12,570 a 9,473 I 8,393 you 8,288 he 6,945 in 6,625 his 5,909
So, we are sure that the
6
regexes can, at least, manage files containing up to13,000
consecutive identical lines !
Now, if some people is interested about the different steps, that I used to constitute a decent working file, for testing regexes
A
toF
, just have a glance to the table, below :•------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• | SEARCH | REPLACE | EXPLANATIONS | Occurrences | •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• | | | We delete, manually, from BEGINNING of file to the END of the CONTENTS part | | | | | | | | | | We delete, manually, from AFTER the FOOTNOTES part till the VERY END of file | | | | | | | | ,(?=\d) | EMPTY | We delete any COMMA separator in NUMBERS | 264 | | | | | | | [,;.] | \x20 | We change any punctuation END of a (part of) SENTENCE with a SPACE character | 72,423 | | | | | | | (?i)o’(?=clock) | of\x20the\x20 | We replace the "o’" CONTRACTIVE form with the COMPLETE form "of the " | 164 | | | | | | | (?i)’s|(?<!\w)’|’(?!\w) | EMPTY | We delete the "’s" string and any "’" sign NOT SURROUNDED with WORD chars | 2,754 | | | | | | | (?i)(d|l)’ | \1e\x20 | We change the "d’" and "l’" French CONTRACTIVE forms to, RESPECTIVELY, "de " and "le " | 311 | | | | | | | —|- | \x20 | We change any HYPHEN-MINUS character as well as the EM DASH char, with a SPACE character | 4,933 | | | | | | | [^\w’\r\n ] | \x20 | We ONLY keep WORD, SPACE, and EOL characters and the ’ sign( PRESENT in English CONTRACTIVE forms ) | 38,795 | | | | | | | (?-i)(?<=\s)(?=\w)[^aAIVX\d](?=\s) | \x20 | As ONE-char STRING, we ONLY keep article "A", "a", pronoun "I", DIGITS and ROMAN letters "V" , "X" | 1,151 | | | | | | | ^\h*\R|^\h+|\h+$|\h+(?=\h) | EMPTY | We delete PURE BLANK lines, TRIM spaces at START and END, and REDUCE to a ONE SPACE gap | 107,108 | | | | | | | \x20 ( > 1 mn ! ) | \r\n | Finally, we change any SINGLE SPACE character with a LINE BREAK | 419,769 | | | | | | | COLUMN editor, with LEADING zeros | | At LINE 1, COLUMN 1 | | | | | | | | (?-s)^(\d{6})(.+) | \2#\1 | We SWAP each WORD and its REFERENCE number | 464,233 | | | | | | | (?i)^the# | | We BOOKMARK all the LINES, containing the article "the", whatever its CASE | 28,529 | | | | | | | Bookmark > Cut Bookmarked Lines | | We BACKUP all these lines in an OTHER file, for FURTHER processing | | | | | | | | Sort lines Lexico... ASCENDING | | => A work SORTED file, encoded UTF-8 with BOM, of 5,861,424 BYTES, with 435,704 WORDS, ONE per line | | | | | | | •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
Now, applying the regexes
A
toF
, against the sorted file obtained, I got, after10s
about for each, the coherent results, below :•-------•---------------------------------------•----------•-------------•--------------• | Regex | SEARCH | REPLACE | Occurrences | LINES Number | •-------•---------------------------------------•----------•-------------•--------------• | | Work SORTED file, obtained, AFTER all the steps above : | 435,704 | •-------•---------------------------------------•----------•-------------•--------------• | | | | | | | A | (?-s)^(.+#).*\R(?:\1.*\R)+ | EMPTY | 10,818 | 6,861 | | | | | | | | B | (?-s)^((.+#).*\R)(?:\2.*\R)+ | \1 | 10,818 | 17,679 | | | | | | | | C | (?-s)^(.+#).*\R(\1.*\R)+ | \2 | 10,818 | 17,679 | | | | | | | | D | (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R | ?1$0 | 17,679 | 428,843 | | | | | | | | E | (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R | \1 | 17,679 | 10,818 | | | | | | | | F | `(?-s)^(.+#).*\R(\1.*\R)+|.+\R | \2 | 17,679 | 10,818 | | | | | | | •-------•---------------------------------------•----------•-------------•--------------•
It’s easy to verify that :
-
6,861
lines, after regexA
+428,843
lines, after regexD
=435,704
( Total of the file ) -
6,861
lines, after regexA
, +10,818
lines, after regexE
=17,679
lines, after regexB
-
6,861
lines, after regexA
, +10,818
lines, after regexF
=17,679
lines, after regexC
On the other hand :
-
The
10818
occurrences of regexesA
,B
andC
correspond to all the first/last duplicate lines, as after regexesE
orF
-
The
17,679
occurrences of regexesD
,E
andF
correspond to all first/last duplicate lines AND all the uniques lines, too, as after regexesB
orC
Note also that :
-
With the
3
regexesA
,B
andC
, the unique lines, which must be kept, are,simply, not processed by the regexes -
With the
3
regexesD
,E
andF
, the unique lines, which must be deleted, are matched by the second alternative.+\R
of the regexes
So, guys, as I said, above, I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes
A
toF
, above, and the test file, below :a#9999999999 a#9999999999 abcdefghij#9999999999 ) ..................... ) ..................... ) HOW MANY lines ? ( THANKS for testing !!) ..................... ) abcdefghij#9999999999 ] z#9999999999 z#9999999999
Best Regards,
guy038
-
-
@guy038
off topic regarding garden work:
if your garden is as detailed and thorough as everything else you do, i’d gladly invite you to help me out in mine … the amount of daily magnolia leafs to collect is currently killing me this year and i’ve not been able to control my rakes and brooms with an adequate, repeatable regex ;-) -
(?-s)^(.+#).\R(\1.\R)+ doesn’t work for my case,
it doesn’t find any occurrences -
@guy038 said:
I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below
So it might be worth pointing out a good method for creating an arbitrary (i.e., large!) number of the
abcdefghij#9999999999
lines in your request.Here’s what I would do:
- put caret on that line in a tab created for the purpose of testing this
- start macro recording
- press ctrl+d (to execute the Duplicate Current Line function)
- stop macro recording
- go to the Macro menu and choose Run a Macro Multiple Times…
- fill in the prompt box entries and press Run (to create the desired number of lines)
To see how many lines of this type you’ve currently got, simply do a literal Count search for
abcdefghij#9999999999
. -
@guy038 said:
So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?
I did this and obtained exactly the same results as you did, @guy038. Specifically, OK with 21524 identical lines, and NOT OK with 21525 identical lines. I tried both the shorter and longer versions of those “middle” lines in the file. All this using Notepad++ 7.2.2, 32-bit. I doubt that any other (reasonable) version of Notepad++ will show different results.
-
Do you have more to say on this topic? I’m interested…