delete both duplicates regexp macro?



  • thanks a lot for your effort, but too much fuss, isn’t it?

    vlookup in excel is easier to do I think



  • @patrickdrd said:

    thanks a lot for your effort, but too much fuss, isn’t it?

    NOTHING is too much fuss for @guy038 ! :-D



  • @guy038 said:

    the regex engine ends up , matching, wrongly, all file contents

    As mentioned in this thread, this is in all likelihood caused by this problem.



  • (?-s)^(.+)\R(?s)(?=.*\R\1\R?)

    doesn’t match the whole line,
    e.g. it tells me that adobe.com exists, but I only have lines that end in adobe.com, e.g. get.adobe.com



  • Hi, All,

    Sorry for the delay, but I was busy with some garden work (hedge trimming !) and, of course, I also tested the 6 regex, from A to F, of my previous post !

    I used the following test file :

    a#9999999999
    a#9999999999
    abcdefghij#9999999999
    .........................
    .........................
    ..21524 IDENTICAL lines ( in totality ! )
    .........................  
    .........................  
    abcdefghij#9999999999
    z#9999999999
    z#9999999999
    

    As you can see :

    • It begins with the 2 identical lines a#9999999999

    • Then, followed with 21524 identical lines abcdefghij#9999999999

    • And it finished with the 2 identical lines z#9999999999, followed with a final line-break


    So, I ran the regex C of my previous post, ( (?-s)^(.+#).*\R(\1.*\R)+ ), against this test file

    => It correctly matched the 2 lines, at beginning of file, then the 21524 identical lines ( => a selection of 495,103 characters ) and, the 2 lines at the end of the file

    Then, I simply added ONE additional line abcdefghij#9999999999 to that file and ran the regex again. This time, it matched the 2 lines, at beginning of file, but wrongly grabbed all remaining text ( So the 21525 lines AND the 2 last lines ) !?

    To verify if the results depended of the size of the selection, I changed the test file,with lines of 140 chars, as below :

    a#9999999999
    a#9999999999
    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999
    .........................
    .........................
    ..21524 IDENTICAL lines ( in totality ! )
    .........................  
    .........................  
    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999
    z#9999999999
    z#9999999999
    

    I was very surprised to see that results were exactly the same ( OK for 21524 identical lines and KO for 21525 identical lines !!?? ) And yet, this time, the selection contained 3,013,360 chars !

    Of course, I did this test with all the other regexes. For example, with regex A, the limit is a bit higher : 25120 lines. But again, after adding one more line, the regex A failed :-((

    So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?

    In the meanwhile , seemingly, we can conclude that, in a previously sorted file, a regular expression can handle, roughly, not more than 21,000 identical lines, at a time ! I’d be glad to receive your feed-back in order to confirm or invalidate this fact :-))


    Of course, I came to this temporary conclusion, after testing my 6 regexes, from A to F, against real text. I decided to take all contents of a novel, on the Gutenberg site. And…, as I’m French, my choice was, naturally, the novel “The count of Monte-Cristo” by Alexandre Dumas, that you may download from the link below:

    http://www.gutenberg.org/files/1184/1184-0.txt ( Choose the link “Raw text UTF-8” )

    When I first tried to build a suitable sorted working file, in order to test my regexes, unfortunately, all failed :-(( But I also noticed, in that sorted file, that there were numerous lines the#...... Indeed, if you download the novel, just count the occurrences of the regex \bthe\b => 28628 occurrences of the article “the”. So I deleted all these consecutive occurrences of the word “the”. This time all my regexes worked as expected :-))

    However, note that, during my tests, I found out that my regexes D to F were, initially, erroneous. So I changed them, and I already updated my previous post with the correct regexes !


    With the help of that page, below, on the most common words in English :

    https://en.wikipedia.org/wiki/Most_common_words_in_English

    I verified, with the regex \bWord\b, that, in this novel, the 10 most common words used, in the initial text, are :

    the          28,628  ( ABSENT in the SORTED file )
    to           12,897
    of           12,916
    and          12,570
    a             9,473
    I             8,393
    you           8,288
    he            6,945
    in            6,625
    his           5,909
    

    So, we are sure that the 6 regexes can, at least, manage files containing up to 13,000 consecutive identical lines !


    Now, if some people is interested about the different steps, that I used to constitute a decent working file, for testing regexes A to F, just have a glance to the table, below :

    •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• 
    |            SEARCH                  |    REPLACE    |                                       EXPLANATIONS                                                  | Occurrences |
    •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
    |                                    |               | We delete, manually, from BEGINNING of file to the END of the CONTENTS part                         |             |
    |                                    |               |                                                                                                     |             |
    |                                    |               | We delete, manually, from AFTER the FOOTNOTES part till the VERY END of file                        |             |
    |                                    |               |                                                                                                     |             |
    | ,(?=\d)                            |    EMPTY      | We delete any COMMA separator in NUMBERS                                                            |       264   |
    |                                    |               |                                                                                                     |             |
    | [,;.]                              | \x20          | We change any punctuation END of a (part of) SENTENCE with a SPACE character                        |    72,423   |
    |                                    |               |                                                                                                     |             |
    | (?i)o’(?=clock)                    | of\x20the\x20 | We replace the "o’" CONTRACTIVE form with the COMPLETE form "of the "                               |       164   |
    |                                    |               |                                                                                                     |             |
    | (?i)’s|(?<!\w)’|’(?!\w)            |    EMPTY      | We delete the "’s" string and any "’" sign NOT SURROUNDED with WORD chars                           |     2,754   |
    |                                    |               |                                                                                                     |             |
    | (?i)(d|l)’                         | \1e\x20       | We change the "d’" and "l’" French CONTRACTIVE forms to, RESPECTIVELY, "de " and "le "              |       311   |
    |                                    |               |                                                                                                     |             |
    | —|-                                | \x20          | We change any HYPHEN-MINUS character as well as the EM DASH char, with a SPACE character            |     4,933   |
    |                                    |               |                                                                                                     |             |
    | [^\w’\r\n ]                        | \x20          | We ONLY keep WORD, SPACE, and EOL characters and the ’ sign( PRESENT in English CONTRACTIVE forms ) |    38,795   |
    |                                    |               |                                                                                                     |             |
    | (?-i)(?<=\s)(?=\w)[^aAIVX\d](?=\s) | \x20          | As ONE-char STRING, we ONLY keep article "A", "a", pronoun "I", DIGITS and ROMAN letters "V" , "X"  |     1,151   |
    |                                    |               |                                                                                                     |             |
    | ^\h*\R|^\h+|\h+$|\h+(?=\h)         |    EMPTY      | We delete PURE BLANK lines, TRIM spaces at START and END, and REDUCE to a ONE SPACE gap             |   107,108   |
    |                                    |               |                                                                                                     |             |
    | \x20          ( > 1 mn ! )         | \r\n          | Finally, we change any SINGLE SPACE character with a LINE BREAK                                     |   419,769   |
    |                                    |               |                                                                                                     |             |
    | COLUMN editor, with LEADING zeros  |               | At LINE 1, COLUMN 1                                                                                 |             |
    |                                    |               |                                                                                                     |             |
    | (?-s)^(\d{6})(.+)                  | \2#\1         | We SWAP each WORD and its REFERENCE number                                                          |   464,233   |
    |                                    |               |                                                                                                     |             |
    | (?i)^the#                          |               | We BOOKMARK all the LINES, containing the article "the", whatever its CASE                          |    28,529   |
    |                                    |               |                                                                                                     |             |
    | Bookmark > Cut Bookmarked Lines    |               | We BACKUP all these lines in an OTHER file, for FURTHER processing                                  |             |
    |                                    |               |                                                                                                     |             |
    | Sort lines Lexico... ASCENDING     |               | => A work SORTED file, encoded UTF-8 with BOM, of 5,861,424 BYTES, with 435,704 WORDS, ONE per line |             |
    |                                    |               |                                                                                                     |             |
    •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
    

    Now, applying the regexes A to F, against the sorted file obtained, I got, after 10s about for each, the coherent results, below :

    
    •-------•---------------------------------------•----------•-------------•--------------•
    | Regex |                SEARCH                 |  REPLACE | Occurrences | LINES Number |
    •-------•---------------------------------------•----------•-------------•--------------•
    |       |    Work SORTED file, obtained, AFTER all the steps above :     |    435,704   |
    •-------•---------------------------------------•----------•-------------•--------------•
    |       |                                       |          |             |              |
    |   A   |  (?-s)^(.+#).*\R(?:\1.*\R)+           |  EMPTY   |    10,818   |      6,861   |
    |       |                                       |          |             |              |
    |   B   |  (?-s)^((.+#).*\R)(?:\2.*\R)+         |  \1      |    10,818   |     17,679   |
    |       |                                       |          |             |              |
    |   C   |  (?-s)^(.+#).*\R(\1.*\R)+             |  \2      |    10,818   |     17,679   |
    |       |                                       |          |             |              |
    |   D   |  (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R      |  ?1$0    |    17,679   |    428,843   |
    |       |                                       |          |             |              |
    |   E   |  (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R    |  \1      |    17,679   |     10,818   |
    |       |                                       |          |             |              |
    |   F   |  `(?-s)^(.+#).*\R(\1.*\R)+|.+\R	    |  \2      |    17,679   |     10,818   |
    |       |                                       |          |             |              |
    •-------•---------------------------------------•----------•-------------•--------------•
    

    It’s easy to verify that :

    • 6,861 lines, after regex A + 428,843 lines, after regex D = 435,704 ( Total of the file )

    • 6,861 lines, after regex A, + 10,818 lines, after regex E = 17,679 lines, after regex B

    • 6,861 lines, after regex A, + 10,818 lines, after regex F = 17,679 lines, after regex C

    On the other hand :

    • The 10818 occurrences of regexes A, B and C correspond to all the first/last duplicate lines, as after regexes E or F

    • The 17,679 occurrences of regexes D, E and F correspond to all first/last duplicate lines AND all the uniques lines, too, as after regexes B or C

    Note also that :

    • With the 3 regexes A, B and C, the unique lines, which must be kept, are,simply, not processed by the regexes

    • With the 3 regexes D, E and F, the unique lines, which must be deleted, are matched by the second alternative .+\R of the regexes


    So, guys, as I said, above, I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below :

    a#9999999999
    a#9999999999
    abcdefghij#9999999999         )
    .....................         )
    .....................         )   HOW MANY lines ? ( THANKS for testing !!)
    .....................         )
    abcdefghij#9999999999         ]
    z#9999999999
    z#9999999999
    

    Best Regards,

    guy038



  • @guy038
    off topic regarding garden work:
    if your garden is as detailed and thorough as everything else you do, i’d gladly invite you to help me out in mine … the amount of daily magnolia leafs to collect is currently killing me this year and i’ve not been able to control my rakes and brooms with an adequate, repeatable regex ;-)



  • (?-s)^(.+#).\R(\1.\R)+ doesn’t work for my case,
    it doesn’t find any occurrences



  • @guy038 said:

    I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below

    So it might be worth pointing out a good method for creating an arbitrary (i.e., large!) number of the abcdefghij#9999999999 lines in your request.

    Here’s what I would do:

    • put caret on that line in a tab created for the purpose of testing this
    • start macro recording
    • press ctrl+d (to execute the Duplicate Current Line function)
    • stop macro recording
    • go to the Macro menu and choose Run a Macro Multiple Times…
    • fill in the prompt box entries and press Run (to create the desired number of lines)

    To see how many lines of this type you’ve currently got, simply do a literal Count search for abcdefghij#9999999999.



  • @guy038 said:

    So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?

    I did this and obtained exactly the same results as you did, @guy038. Specifically, OK with 21524 identical lines, and NOT OK with 21525 identical lines. I tried both the shorter and longer versions of those “middle” lines in the file. All this using Notepad++ 7.2.2, 32-bit. I doubt that any other (reasonable) version of Notepad++ will show different results.



  • @guy038,

    Do you have more to say on this topic? I’m interested…


Log in to reply