delete both duplicates regexp macro?
-
thanks a lot for your effort, but too much fuss, isn’t it?
vlookup in excel is easier to do I think
-
@patrickdrd said:
thanks a lot for your effort, but too much fuss, isn’t it?
NOTHING is too much fuss for @guy038 ! :-D
-
@guy038 said:
the regex engine ends up , matching, wrongly, all file contents
As mentioned in this thread, this is in all likelihood caused by this problem.
-
(?-s)^(.+)\R(?s)(?=.*\R\1\R?)
doesn’t match the whole line,
e.g. it tells me that adobe.com exists, but I only have lines that end in adobe.com, e.g. get.adobe.com -
Hi, All,
Sorry for the delay, but I was busy with some garden work (hedge trimming !) and, of course, I also tested the 6 regex, from
AtoF, of my previous post !I used the following test file :
a#9999999999 a#9999999999 abcdefghij#9999999999 ......................... ......................... ..21524 IDENTICAL lines ( in totality ! ) ......................... ......................... abcdefghij#9999999999 z#9999999999 z#9999999999As you can see :
-
It begins with the
2identical linesa#9999999999 -
Then, followed with
21524identical linesabcdefghij#9999999999 -
And it finished with the
2identical linesz#9999999999, followed with a final line-break
So, I ran the regex C of my previous post, (
(?-s)^(.+#).*\R(\1.*\R)+), against this test file=> It correctly matched the
2lines, at beginning of file, then the21524identical lines ( => a selection of495,103characters ) and, the2lines at the end of the fileThen, I simply added
ONEadditional lineabcdefghij#9999999999to that file and ran the regex again. This time, it matched the2lines, at beginning of file, but wrongly grabbed all remaining text ( So the21525lines AND the2last lines ) !?To verify if the results depended of the size of the selection, I changed the test file,with lines of
140chars, as below :a#9999999999 a#9999999999 abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999 ......................... ......................... ..21524 IDENTICAL lines ( in totality ! ) ......................... ......................... abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999 z#9999999999 z#9999999999I was very surprised to see that results were exactly the same ( OK for
21524identical lines and KO for21525identical lines !!?? ) And yet, this time, the selection contained3,013,360chars !Of course, I did this test with all the other regexes. For example, with regex A, the limit is a bit higher :
25120lines. But again, after adding one more line, the regexAfailed :-((So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?
In the meanwhile , seemingly, we can conclude that, in a previously sorted file, a regular expression can handle, roughly, not more than
21,000identical lines, at a time ! I’d be glad to receive your feed-back in order to confirm or invalidate this fact :-))
Of course, I came to this temporary conclusion, after testing my 6 regexes, from
AtoF, against real text. I decided to take all contents of a novel, on the Gutenberg site. And…, as I’m French, my choice was, naturally, the novel “The count of Monte-Cristo” by Alexandre Dumas, that you may download from the link below:http://www.gutenberg.org/files/1184/1184-0.txt ( Choose the link “Raw text UTF-8” )
When I first tried to build a suitable sorted working file, in order to test my regexes, unfortunately, all failed :-(( But I also noticed, in that sorted file, that there were numerous lines
the#...... Indeed, if you download the novel, just count the occurrences of the regex\bthe\b=>28628occurrences of the article “the”. So I deleted all these consecutive occurrences of the word “the”. This time all my regexes worked as expected :-))However, note that, during my tests, I found out that my regexes
DtoFwere, initially, erroneous. So I changed them, and I already updated my previous post with the correct regexes !
With the help of that page, below, on the most common words in English :
https://en.wikipedia.org/wiki/Most_common_words_in_English
I verified, with the regex
\bWord\b, that, in this novel, the10most common words used, in the initial text, are :the 28,628 ( ABSENT in the SORTED file ) to 12,897 of 12,916 and 12,570 a 9,473 I 8,393 you 8,288 he 6,945 in 6,625 his 5,909So, we are sure that the
6regexes can, at least, manage files containing up to13,000consecutive identical lines !
Now, if some people is interested about the different steps, that I used to constitute a decent working file, for testing regexes
AtoF, just have a glance to the table, below :•------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• | SEARCH | REPLACE | EXPLANATIONS | Occurrences | •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• | | | We delete, manually, from BEGINNING of file to the END of the CONTENTS part | | | | | | | | | | We delete, manually, from AFTER the FOOTNOTES part till the VERY END of file | | | | | | | | ,(?=\d) | EMPTY | We delete any COMMA separator in NUMBERS | 264 | | | | | | | [,;.] | \x20 | We change any punctuation END of a (part of) SENTENCE with a SPACE character | 72,423 | | | | | | | (?i)o’(?=clock) | of\x20the\x20 | We replace the "o’" CONTRACTIVE form with the COMPLETE form "of the " | 164 | | | | | | | (?i)’s|(?<!\w)’|’(?!\w) | EMPTY | We delete the "’s" string and any "’" sign NOT SURROUNDED with WORD chars | 2,754 | | | | | | | (?i)(d|l)’ | \1e\x20 | We change the "d’" and "l’" French CONTRACTIVE forms to, RESPECTIVELY, "de " and "le " | 311 | | | | | | | —|- | \x20 | We change any HYPHEN-MINUS character as well as the EM DASH char, with a SPACE character | 4,933 | | | | | | | [^\w’\r\n ] | \x20 | We ONLY keep WORD, SPACE, and EOL characters and the ’ sign( PRESENT in English CONTRACTIVE forms ) | 38,795 | | | | | | | (?-i)(?<=\s)(?=\w)[^aAIVX\d](?=\s) | \x20 | As ONE-char STRING, we ONLY keep article "A", "a", pronoun "I", DIGITS and ROMAN letters "V" , "X" | 1,151 | | | | | | | ^\h*\R|^\h+|\h+$|\h+(?=\h) | EMPTY | We delete PURE BLANK lines, TRIM spaces at START and END, and REDUCE to a ONE SPACE gap | 107,108 | | | | | | | \x20 ( > 1 mn ! ) | \r\n | Finally, we change any SINGLE SPACE character with a LINE BREAK | 419,769 | | | | | | | COLUMN editor, with LEADING zeros | | At LINE 1, COLUMN 1 | | | | | | | | (?-s)^(\d{6})(.+) | \2#\1 | We SWAP each WORD and its REFERENCE number | 464,233 | | | | | | | (?i)^the# | | We BOOKMARK all the LINES, containing the article "the", whatever its CASE | 28,529 | | | | | | | Bookmark > Cut Bookmarked Lines | | We BACKUP all these lines in an OTHER file, for FURTHER processing | | | | | | | | Sort lines Lexico... ASCENDING | | => A work SORTED file, encoded UTF-8 with BOM, of 5,861,424 BYTES, with 435,704 WORDS, ONE per line | | | | | | | •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
Now, applying the regexes
AtoF, against the sorted file obtained, I got, after10sabout for each, the coherent results, below :•-------•---------------------------------------•----------•-------------•--------------• | Regex | SEARCH | REPLACE | Occurrences | LINES Number | •-------•---------------------------------------•----------•-------------•--------------• | | Work SORTED file, obtained, AFTER all the steps above : | 435,704 | •-------•---------------------------------------•----------•-------------•--------------• | | | | | | | A | (?-s)^(.+#).*\R(?:\1.*\R)+ | EMPTY | 10,818 | 6,861 | | | | | | | | B | (?-s)^((.+#).*\R)(?:\2.*\R)+ | \1 | 10,818 | 17,679 | | | | | | | | C | (?-s)^(.+#).*\R(\1.*\R)+ | \2 | 10,818 | 17,679 | | | | | | | | D | (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R | ?1$0 | 17,679 | 428,843 | | | | | | | | E | (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R | \1 | 17,679 | 10,818 | | | | | | | | F | `(?-s)^(.+#).*\R(\1.*\R)+|.+\R | \2 | 17,679 | 10,818 | | | | | | | •-------•---------------------------------------•----------•-------------•--------------•It’s easy to verify that :
-
6,861lines, after regexA+428,843lines, after regexD=435,704( Total of the file ) -
6,861lines, after regexA, +10,818lines, after regexE=17,679lines, after regexB -
6,861lines, after regexA, +10,818lines, after regexF=17,679lines, after regexC
On the other hand :
-
The
10818occurrences of regexesA,BandCcorrespond to all the first/last duplicate lines, as after regexesEorF -
The
17,679occurrences of regexesD,EandFcorrespond to all first/last duplicate lines AND all the uniques lines, too, as after regexesBorC
Note also that :
-
With the
3regexesA,BandC, the unique lines, which must be kept, are,simply, not processed by the regexes -
With the
3regexesD,EandF, the unique lines, which must be deleted, are matched by the second alternative.+\Rof the regexes
So, guys, as I said, above, I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes
AtoF, above, and the test file, below :a#9999999999 a#9999999999 abcdefghij#9999999999 ) ..................... ) ..................... ) HOW MANY lines ? ( THANKS for testing !!) ..................... ) abcdefghij#9999999999 ] z#9999999999 z#9999999999Best Regards,
guy038
-
-
@guy038
off topic regarding garden work:
if your garden is as detailed and thorough as everything else you do, i’d gladly invite you to help me out in mine … the amount of daily magnolia leafs to collect is currently killing me this year and i’ve not been able to control my rakes and brooms with an adequate, repeatable regex ;-) -
(?-s)^(.+#).\R(\1.\R)+ doesn’t work for my case,
it doesn’t find any occurrences -
@guy038 said:
I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below
So it might be worth pointing out a good method for creating an arbitrary (i.e., large!) number of the
abcdefghij#9999999999lines in your request.Here’s what I would do:
- put caret on that line in a tab created for the purpose of testing this
- start macro recording
- press ctrl+d (to execute the Duplicate Current Line function)
- stop macro recording
- go to the Macro menu and choose Run a Macro Multiple Times…
- fill in the prompt box entries and press Run (to create the desired number of lines)
To see how many lines of this type you’ve currently got, simply do a literal Count search for
abcdefghij#9999999999. -
@guy038 said:
So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?
I did this and obtained exactly the same results as you did, @guy038. Specifically, OK with 21524 identical lines, and NOT OK with 21525 identical lines. I tried both the shorter and longer versions of those “middle” lines in the file. All this using Notepad++ 7.2.2, 32-bit. I doubt that any other (reasonable) version of Notepad++ will show different results.
-
Do you have more to say on this topic? I’m interested…