delete both duplicates regexp macro?
-
thanks a lot for your effort, but too much fuss, isn’t it?
vlookup in excel is easier to do I think
-
@patrickdrd said:
thanks a lot for your effort, but too much fuss, isn’t it?
NOTHING is too much fuss for @guy038 ! :-D
-
@guy038 said:
the regex engine ends up , matching, wrongly, all file contents
As mentioned in this thread, this is in all likelihood caused by this problem.
-
(?-s)^(.+)\R(?s)(?=.*\R\1\R?)
doesn’t match the whole line,
e.g. it tells me that adobe.com exists, but I only have lines that end in adobe.com, e.g. get.adobe.com -
Hi, All,
Sorry for the delay, but I was busy with some garden work (hedge trimming !) and, of course, I also tested the 6 regex, from
AtoF, of my previous post !I used the following test file :
a#9999999999 a#9999999999 abcdefghij#9999999999 ......................... ......................... ..21524 IDENTICAL lines ( in totality ! ) ......................... ......................... abcdefghij#9999999999 z#9999999999 z#9999999999As you can see :
-
It begins with the
2identical linesa#9999999999 -
Then, followed with
21524identical linesabcdefghij#9999999999 -
And it finished with the
2identical linesz#9999999999, followed with a final line-break
So, I ran the regex C of my previous post, (
(?-s)^(.+#).*\R(\1.*\R)+), against this test file=> It correctly matched the
2lines, at beginning of file, then the21524identical lines ( => a selection of495,103characters ) and, the2lines at the end of the fileThen, I simply added
ONEadditional lineabcdefghij#9999999999to that file and ran the regex again. This time, it matched the2lines, at beginning of file, but wrongly grabbed all remaining text ( So the21525lines AND the2last lines ) !?To verify if the results depended of the size of the selection, I changed the test file,with lines of
140chars, as below :a#9999999999 a#9999999999 abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999 ......................... ......................... ..21524 IDENTICAL lines ( in totality ! ) ......................... ......................... abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999 z#9999999999 z#9999999999I was very surprised to see that results were exactly the same ( OK for
21524identical lines and KO for21525identical lines !!?? ) And yet, this time, the selection contained3,013,360chars !Of course, I did this test with all the other regexes. For example, with regex A, the limit is a bit higher :
25120lines. But again, after adding one more line, the regexAfailed :-((So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?
In the meanwhile , seemingly, we can conclude that, in a previously sorted file, a regular expression can handle, roughly, not more than
21,000identical lines, at a time ! I’d be glad to receive your feed-back in order to confirm or invalidate this fact :-))
Of course, I came to this temporary conclusion, after testing my 6 regexes, from
AtoF, against real text. I decided to take all contents of a novel, on the Gutenberg site. And…, as I’m French, my choice was, naturally, the novel “The count of Monte-Cristo” by Alexandre Dumas, that you may download from the link below:http://www.gutenberg.org/files/1184/1184-0.txt ( Choose the link “Raw text UTF-8” )
When I first tried to build a suitable sorted working file, in order to test my regexes, unfortunately, all failed :-(( But I also noticed, in that sorted file, that there were numerous lines
the#...... Indeed, if you download the novel, just count the occurrences of the regex\bthe\b=>28628occurrences of the article “the”. So I deleted all these consecutive occurrences of the word “the”. This time all my regexes worked as expected :-))However, note that, during my tests, I found out that my regexes
DtoFwere, initially, erroneous. So I changed them, and I already updated my previous post with the correct regexes !
With the help of that page, below, on the most common words in English :
https://en.wikipedia.org/wiki/Most_common_words_in_English
I verified, with the regex
\bWord\b, that, in this novel, the10most common words used, in the initial text, are :the 28,628 ( ABSENT in the SORTED file ) to 12,897 of 12,916 and 12,570 a 9,473 I 8,393 you 8,288 he 6,945 in 6,625 his 5,909So, we are sure that the
6regexes can, at least, manage files containing up to13,000consecutive identical lines !
Now, if some people is interested about the different steps, that I used to constitute a decent working file, for testing regexes
AtoF, just have a glance to the table, below :•------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• | SEARCH | REPLACE | EXPLANATIONS | Occurrences | •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• | | | We delete, manually, from BEGINNING of file to the END of the CONTENTS part | | | | | | | | | | We delete, manually, from AFTER the FOOTNOTES part till the VERY END of file | | | | | | | | ,(?=\d) | EMPTY | We delete any COMMA separator in NUMBERS | 264 | | | | | | | [,;.] | \x20 | We change any punctuation END of a (part of) SENTENCE with a SPACE character | 72,423 | | | | | | | (?i)o’(?=clock) | of\x20the\x20 | We replace the "o’" CONTRACTIVE form with the COMPLETE form "of the " | 164 | | | | | | | (?i)’s|(?<!\w)’|’(?!\w) | EMPTY | We delete the "’s" string and any "’" sign NOT SURROUNDED with WORD chars | 2,754 | | | | | | | (?i)(d|l)’ | \1e\x20 | We change the "d’" and "l’" French CONTRACTIVE forms to, RESPECTIVELY, "de " and "le " | 311 | | | | | | | —|- | \x20 | We change any HYPHEN-MINUS character as well as the EM DASH char, with a SPACE character | 4,933 | | | | | | | [^\w’\r\n ] | \x20 | We ONLY keep WORD, SPACE, and EOL characters and the ’ sign( PRESENT in English CONTRACTIVE forms ) | 38,795 | | | | | | | (?-i)(?<=\s)(?=\w)[^aAIVX\d](?=\s) | \x20 | As ONE-char STRING, we ONLY keep article "A", "a", pronoun "I", DIGITS and ROMAN letters "V" , "X" | 1,151 | | | | | | | ^\h*\R|^\h+|\h+$|\h+(?=\h) | EMPTY | We delete PURE BLANK lines, TRIM spaces at START and END, and REDUCE to a ONE SPACE gap | 107,108 | | | | | | | \x20 ( > 1 mn ! ) | \r\n | Finally, we change any SINGLE SPACE character with a LINE BREAK | 419,769 | | | | | | | COLUMN editor, with LEADING zeros | | At LINE 1, COLUMN 1 | | | | | | | | (?-s)^(\d{6})(.+) | \2#\1 | We SWAP each WORD and its REFERENCE number | 464,233 | | | | | | | (?i)^the# | | We BOOKMARK all the LINES, containing the article "the", whatever its CASE | 28,529 | | | | | | | Bookmark > Cut Bookmarked Lines | | We BACKUP all these lines in an OTHER file, for FURTHER processing | | | | | | | | Sort lines Lexico... ASCENDING | | => A work SORTED file, encoded UTF-8 with BOM, of 5,861,424 BYTES, with 435,704 WORDS, ONE per line | | | | | | | •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
Now, applying the regexes
AtoF, against the sorted file obtained, I got, after10sabout for each, the coherent results, below :•-------•---------------------------------------•----------•-------------•--------------• | Regex | SEARCH | REPLACE | Occurrences | LINES Number | •-------•---------------------------------------•----------•-------------•--------------• | | Work SORTED file, obtained, AFTER all the steps above : | 435,704 | •-------•---------------------------------------•----------•-------------•--------------• | | | | | | | A | (?-s)^(.+#).*\R(?:\1.*\R)+ | EMPTY | 10,818 | 6,861 | | | | | | | | B | (?-s)^((.+#).*\R)(?:\2.*\R)+ | \1 | 10,818 | 17,679 | | | | | | | | C | (?-s)^(.+#).*\R(\1.*\R)+ | \2 | 10,818 | 17,679 | | | | | | | | D | (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R | ?1$0 | 17,679 | 428,843 | | | | | | | | E | (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R | \1 | 17,679 | 10,818 | | | | | | | | F | `(?-s)^(.+#).*\R(\1.*\R)+|.+\R | \2 | 17,679 | 10,818 | | | | | | | •-------•---------------------------------------•----------•-------------•--------------•It’s easy to verify that :
-
6,861lines, after regexA+428,843lines, after regexD=435,704( Total of the file ) -
6,861lines, after regexA, +10,818lines, after regexE=17,679lines, after regexB -
6,861lines, after regexA, +10,818lines, after regexF=17,679lines, after regexC
On the other hand :
-
The
10818occurrences of regexesA,BandCcorrespond to all the first/last duplicate lines, as after regexesEorF -
The
17,679occurrences of regexesD,EandFcorrespond to all first/last duplicate lines AND all the uniques lines, too, as after regexesBorC
Note also that :
-
With the
3regexesA,BandC, the unique lines, which must be kept, are,simply, not processed by the regexes -
With the
3regexesD,EandF, the unique lines, which must be deleted, are matched by the second alternative.+\Rof the regexes
So, guys, as I said, above, I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes
AtoF, above, and the test file, below :a#9999999999 a#9999999999 abcdefghij#9999999999 ) ..................... ) ..................... ) HOW MANY lines ? ( THANKS for testing !!) ..................... ) abcdefghij#9999999999 ] z#9999999999 z#9999999999Best Regards,
guy038
-
-
@guy038
off topic regarding garden work:
if your garden is as detailed and thorough as everything else you do, i’d gladly invite you to help me out in mine … the amount of daily magnolia leafs to collect is currently killing me this year and i’ve not been able to control my rakes and brooms with an adequate, repeatable regex ;-) -
(?-s)^(.+#).\R(\1.\R)+ doesn’t work for my case,
it doesn’t find any occurrences -
@guy038 said:
I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below
So it might be worth pointing out a good method for creating an arbitrary (i.e., large!) number of the
abcdefghij#9999999999lines in your request.Here’s what I would do:
- put caret on that line in a tab created for the purpose of testing this
- start macro recording
- press ctrl+d (to execute the Duplicate Current Line function)
- stop macro recording
- go to the Macro menu and choose Run a Macro Multiple Times…
- fill in the prompt box entries and press Run (to create the desired number of lines)
To see how many lines of this type you’ve currently got, simply do a literal Count search for
abcdefghij#9999999999. -
@guy038 said:
So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?
I did this and obtained exactly the same results as you did, @guy038. Specifically, OK with 21524 identical lines, and NOT OK with 21525 identical lines. I tried both the shorter and longer versions of those “middle” lines in the file. All this using Notepad++ 7.2.2, 32-bit. I doubt that any other (reasonable) version of Notepad++ will show different results.
-
Do you have more to say on this topic? I’m interested…
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login