delete both duplicates regexp macro?



  • Hi, All

    Unfortunately, again, I verified that my previous method works, only, if file contents and/or number of lines processed are not too important :-(( In most cases, the regex engine ends up , matching, wrongly, all file contents. Too bad !

    So, if you wish to keep the initial order of your file, here is, a new method to adopt, which covers all cases ( I hope so ! ), in order to keep/delete duplicate lines AND/OR all non-duplicate lines of a file, whatever its size !

    Please, do any test, even on mportant files to verify that this method is robust and does not fail ! I’ll be glad to get your feedback :-))


    So, let’s start with that sample text :

    567890
    1234
    45
    1234
    xyz
    567890
    567890
    000000000
    567890
    45
    abcdef
    1234
    1234
    45
    hijk
    45
    45
    567890
    1234
    999
    1234
    
    • Move the cursor at the beginning of the first item 567890

    • Open the Column editor ( Edit > Column Editor... )

    • Insert a decimal sequence of numbers, ticking the Leading zeros option

    • Delete the last isolated number 22

    =>

    01567890
    021234
    0345
    041234
    05xyz
    06567890
    07567890
    08000000000
    09567890
    1045
    11abcdef
    121234
    131234
    1445
    15hijk
    1645
    1745
    18567890
    191234
    20999
    211234
    
    • Now, use the regex S/R, below, to swap the positions of data and numbers, where N is the number of digits, of the previous numbering, and to insert of a separation character ( I chose the # character, but any individual char may suit, providing it’s not used in your data. Prefer a character which is not a meta-character used in regexes ! )

      • SEARCH ^(?-s)^(\d{N})(.+)

      • REPLACE \2#\1

    As, in our example, N = 2, it leads to the text :

    567890#01
    1234#02
    45#03
    1234#04
    xyz#05
    567890#06
    567890#07
    000000000#08
    567890#09
    45#10
    abcdef#11
    1234#12
    1234#13
    45#14
    hijk#15
    45#16
    45#17
    567890#18
    1234#19
    999#20
    1234#21
    
    • Then, execute a sort with the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending =>
    000000000#08
    1234#02
    1234#04
    1234#12
    1234#13
    1234#19
    1234#21
    45#03
    45#10
    45#14
    45#16
    45#17
    567890#01
    567890#06
    567890#07
    567890#09
    567890#18
    999#20
    abcdef#11
    hijk#15
    xyz#05
    

    Important : Till the end of that post, this sorted text becomes the new sample text !


    Now, here are the six regex S/R that cover all possible cases :

    • Regex A : SEARCH (?-s)^(.+#).*\R(?:\1.*\R)+ and REPLACE Leave EMPTY

    • Regex B : SEARCH (?-s)^((.+#).*\R)(?:\2.*\R)+ and REPLACE \1

    • Regex C : SEARCH (?-s)^(.+#).*\R(\1.*\R)+ and REPLACE \2

    • Regex D : SEARCH (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R and REPLACE ?1$0

    • Regex E : SEARCH (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R and REPLACE \1

    • Regex F : SEARCH (?-s)^(.+#).*\R(\1.*\R)+|.+\R and REPLACE \2


    So, in a previously sorted file ( I insist ! ) and whatever the numbering after the # symbol :

    • If you want to delete all duplicate lines, only, use the regex A

    • If you want to keep isolated lines AND the first line of each block of duplicate lines, only, use the regex B

    • If you want to keep isolated lines AND the last line of each block of duplicate lines, only, use the regex C

    • If you want to delete isolated lines, only, use the regex D

    • If you want to keep the first line of each block of duplicate lines, only, use the regex E

    • If you want to keep the last line of each block of duplicate lines, only, use the regex F

    Here are, below, the results of these six regex S/R, against the sample text :

    •----------------•----------------•----------------•----------------•----------------•----------------•
    |    Regex A     |    Regex B     |    Regex C     |    Regex D     |    Regex E     |    Regex F     |
    •----------------•----------------•----------------•----------------•----------------•----------------•
    |  000000000#08  |  000000000#08  |  000000000#08  |  1234#02       |  1234#02       |  1234#21       |
    |  999#20        |  1234#02       |  1234#21       |  1234#04       |  45#03         |  45#17         |
    |  abcdef#11     |  45#03         |  45#17         |  1234#12       |  567890#01     |  567890#18     |
    |  hijk#15       |  567890#01     |  567890#18     |  1234#13       |                |                |
    |  xyz#05        |  999#20        |  999#20        |  1234#19       |                |                |
    |                |  abcdef#11     |  abcdef#11     |  45#03         |                |                |
    |                |  hijk#15       |  hijk#15       |  45#10         |                |                |
    |                |  xyz#05        |  xyz#05        |  45#14         |                |                |
    |                |                |                |  45#16         |                |                |
    |                |                |                |  567890#01     |                |                |
    |                |                |                |  567890#06     |                |                |
    |                |                |                |  567890#07     |                |                |
    |                |                |                |  567890#09     |                |                |
    •----------------•----------------•----------------•----------------•----------------•----------------•
    

    • Now, considering any of these 6 results, just above, let’s swap, with the regex S/R, below, the two blocks of data, on either side of the # character

      • SEARCH ^(?-s)^(.+)#(.+)

      • REPLACE \2#\1

    We get the different cases, below :

    •----------------•----------------•----------------•----------------•----------------•----------------•
    |    Regex A     |    Regex B     |    Regex C     |    Regex D     |    Regex E     |    Regex F     |
    •----------------•----------------•----------------•----------------•----------------•----------------•
    |  08#000000000  |  08#000000000  |  08#000000000  |  02#1234       |  02#1234       |  21#1234       |
    |  20#999        |  02#1234       |  21#1234       |  04#1234       |  03#45         |  17#45         |
    |  11#abcdef     |  03#45         |  17#45         |  12#1234       |  01#567890     |  18#567890     |
    |  15#hijk       |  01#567890     |  18#567890     |  13#1234       |                |                |
    |  05#xyz        |  20#999        |  20#999        |  19#1234       |                |                |
    |                |  11#abcdef     |  11#abcdef     |  03#45         |                |                |
    |                |  15#hijk       |  15#hijk       |  10#45         |                |                |
    |                |  05#xyz        |  05#xyz        |  14#45         |                |                |
    |                |                |                |  16#45         |                |                |
    |                |                |                |  01#567890     |                |                |
    |                |                |                |  06#567890     |                |                |
    |                |                |                |  07#567890     |                |                |
    |                |                |                |  09#567890     |                |                |
    •----------------•----------------•----------------•----------------•----------------•----------------•
    
    • Considering any of these 6 results, just above, perform, again, a sort, with the option Edit > Line Operations > Sort Lines Lexicographically Ascending =>
    •----------------•----------------•----------------•----------------•----------------•----------------•
    |    Regex A     |    Regex B     |    Regex C     |    Regex D     |    Regex E     |    Regex F     |
    •----------------•----------------•----------------•----------------•----------------•----------------•
    |  05#xyz        |  01#567890     |  05#xyz        |  01#567890     |  01#567890     |  17#45         |
    |  08#000000000  |  02#1234       |  08#000000000  |  02#1234       |  02#1234       |  18#567890     |
    |  11#abcdef     |  03#45         |  11#abcdef     |  03#45         |  03#45         |  21#1234       |
    |  15#hijk       |  05#xyz        |  15#hijk       |  04#1234       |                |                |
    |  20#999        |  08#000000000  |  17#45         |  06#567890     |                |                |
    |                |  11#abcdef     |  18#567890     |  07#567890     |                |                |
    |                |  15#hijk       |  20#999        |  09#567890     |                |                |
    |                |  20#999        |  21#1234       |  10#45         |                |                |
    |                |                |                |  12#1234       |                |                |
    |                |                |                |  13#1234       |                |                |
    |                |                |                |  14#45         |                |                |
    |                |                |                |  16#45         |                |                |
    |                |                |                |  19#1234       |                |                |
    •----------------•----------------•----------------•----------------•----------------•----------------•
    
    • Finally, let’s use this last regex S/R to get rid of all the counting marks

      • SEARCH (?-s)^.+#

      • REPLACE Leave Empty

    We obtain the 6 final results, from the original text :

    •----------------•----------------•----------------•----------------•----------------•----------------•
    |    Regex A     |    Regex B     |    Regex C     |    Regex D     |    Regex E     |    Regex F     |
    •----------------•----------------•----------------•----------------•----------------•----------------•
    |    xyz         |    567890      |    xyz         |    567890      |    567890      |    45          |
    |    000000000   |    1234        |    000000000   |    1234        |    1234        |    567890      |
    |    abcdef      |    45          |    abcdef      |    45          |    45          |    1234        |
    |    hijk        |    xyz         |    hijk        |    1234        |                |                |
    |    999         |    000000000   |    45          |    567890      |                |                |
    |                |    abcdef      |    567890      |    567890      |                |                |
    |                |    hijk        |    999         |    567890      |                |                |
    |                |    999         |    1234        |    45          |                |                |
    |                |                |                |    1234        |                |                |
    |                |                |                |    1234        |                |                |
    |                |                |                |    45          |                |                |
    |                |                |                |    45          |                |                |
    |                |                |                |    1234        |                |                |
    •----------------•----------------•----------------•----------------•----------------•----------------•
    

    Remark : This method needs numerous steps, but is quite safe, because all the modifications, produced by the different S/R, concern one line at a time ( or a consecutive block of lines, in regexes A to F ! )

    Of course, on huge files , execution time may be important, but you should get the expected results, at the end ;-))

    Cheers,

    guy038



  • thanks a lot for your effort, but too much fuss, isn’t it?

    vlookup in excel is easier to do I think



  • @patrickdrd said:

    thanks a lot for your effort, but too much fuss, isn’t it?

    NOTHING is too much fuss for @guy038 ! :-D



  • @guy038 said:

    the regex engine ends up , matching, wrongly, all file contents

    As mentioned in this thread, this is in all likelihood caused by this problem.



  • (?-s)^(.+)\R(?s)(?=.*\R\1\R?)

    doesn’t match the whole line,
    e.g. it tells me that adobe.com exists, but I only have lines that end in adobe.com, e.g. get.adobe.com



  • Hi, All,

    Sorry for the delay, but I was busy with some garden work (hedge trimming !) and, of course, I also tested the 6 regex, from A to F, of my previous post !

    I used the following test file :

    a#9999999999
    a#9999999999
    abcdefghij#9999999999
    .........................
    .........................
    ..21524 IDENTICAL lines ( in totality ! )
    .........................  
    .........................  
    abcdefghij#9999999999
    z#9999999999
    z#9999999999
    

    As you can see :

    • It begins with the 2 identical lines a#9999999999

    • Then, followed with 21524 identical lines abcdefghij#9999999999

    • And it finished with the 2 identical lines z#9999999999, followed with a final line-break


    So, I ran the regex C of my previous post, ( (?-s)^(.+#).*\R(\1.*\R)+ ), against this test file

    => It correctly matched the 2 lines, at beginning of file, then the 21524 identical lines ( => a selection of 495,103 characters ) and, the 2 lines at the end of the file

    Then, I simply added ONE additional line abcdefghij#9999999999 to that file and ran the regex again. This time, it matched the 2 lines, at beginning of file, but wrongly grabbed all remaining text ( So the 21525 lines AND the 2 last lines ) !?

    To verify if the results depended of the size of the selection, I changed the test file,with lines of 140 chars, as below :

    a#9999999999
    a#9999999999
    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999
    .........................
    .........................
    ..21524 IDENTICAL lines ( in totality ! )
    .........................  
    .........................  
    abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999
    z#9999999999
    z#9999999999
    

    I was very surprised to see that results were exactly the same ( OK for 21524 identical lines and KO for 21525 identical lines !!?? ) And yet, this time, the selection contained 3,013,360 chars !

    Of course, I did this test with all the other regexes. For example, with regex A, the limit is a bit higher : 25120 lines. But again, after adding one more line, the regex A failed :-((

    So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?

    In the meanwhile , seemingly, we can conclude that, in a previously sorted file, a regular expression can handle, roughly, not more than 21,000 identical lines, at a time ! I’d be glad to receive your feed-back in order to confirm or invalidate this fact :-))


    Of course, I came to this temporary conclusion, after testing my 6 regexes, from A to F, against real text. I decided to take all contents of a novel, on the Gutenberg site. And…, as I’m French, my choice was, naturally, the novel “The count of Monte-Cristo” by Alexandre Dumas, that you may download from the link below:

    http://www.gutenberg.org/files/1184/1184-0.txt ( Choose the link “Raw text UTF-8” )

    When I first tried to build a suitable sorted working file, in order to test my regexes, unfortunately, all failed :-(( But I also noticed, in that sorted file, that there were numerous lines the#...... Indeed, if you download the novel, just count the occurrences of the regex \bthe\b => 28628 occurrences of the article “the”. So I deleted all these consecutive occurrences of the word “the”. This time all my regexes worked as expected :-))

    However, note that, during my tests, I found out that my regexes D to F were, initially, erroneous. So I changed them, and I already updated my previous post with the correct regexes !


    With the help of that page, below, on the most common words in English :

    https://en.wikipedia.org/wiki/Most_common_words_in_English

    I verified, with the regex \bWord\b, that, in this novel, the 10 most common words used, in the initial text, are :

    the          28,628  ( ABSENT in the SORTED file )
    to           12,897
    of           12,916
    and          12,570
    a             9,473
    I             8,393
    you           8,288
    he            6,945
    in            6,625
    his           5,909
    

    So, we are sure that the 6 regexes can, at least, manage files containing up to 13,000 consecutive identical lines !


    Now, if some people is interested about the different steps, that I used to constitute a decent working file, for testing regexes A to F, just have a glance to the table, below :

    •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• 
    |            SEARCH                  |    REPLACE    |                                       EXPLANATIONS                                                  | Occurrences |
    •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
    |                                    |               | We delete, manually, from BEGINNING of file to the END of the CONTENTS part                         |             |
    |                                    |               |                                                                                                     |             |
    |                                    |               | We delete, manually, from AFTER the FOOTNOTES part till the VERY END of file                        |             |
    |                                    |               |                                                                                                     |             |
    | ,(?=\d)                            |    EMPTY      | We delete any COMMA separator in NUMBERS                                                            |       264   |
    |                                    |               |                                                                                                     |             |
    | [,;.]                              | \x20          | We change any punctuation END of a (part of) SENTENCE with a SPACE character                        |    72,423   |
    |                                    |               |                                                                                                     |             |
    | (?i)o’(?=clock)                    | of\x20the\x20 | We replace the "o’" CONTRACTIVE form with the COMPLETE form "of the "                               |       164   |
    |                                    |               |                                                                                                     |             |
    | (?i)’s|(?<!\w)’|’(?!\w)            |    EMPTY      | We delete the "’s" string and any "’" sign NOT SURROUNDED with WORD chars                           |     2,754   |
    |                                    |               |                                                                                                     |             |
    | (?i)(d|l)’                         | \1e\x20       | We change the "d’" and "l’" French CONTRACTIVE forms to, RESPECTIVELY, "de " and "le "              |       311   |
    |                                    |               |                                                                                                     |             |
    | —|-                                | \x20          | We change any HYPHEN-MINUS character as well as the EM DASH char, with a SPACE character            |     4,933   |
    |                                    |               |                                                                                                     |             |
    | [^\w’\r\n ]                        | \x20          | We ONLY keep WORD, SPACE, and EOL characters and the ’ sign( PRESENT in English CONTRACTIVE forms ) |    38,795   |
    |                                    |               |                                                                                                     |             |
    | (?-i)(?<=\s)(?=\w)[^aAIVX\d](?=\s) | \x20          | As ONE-char STRING, we ONLY keep article "A", "a", pronoun "I", DIGITS and ROMAN letters "V" , "X"  |     1,151   |
    |                                    |               |                                                                                                     |             |
    | ^\h*\R|^\h+|\h+$|\h+(?=\h)         |    EMPTY      | We delete PURE BLANK lines, TRIM spaces at START and END, and REDUCE to a ONE SPACE gap             |   107,108   |
    |                                    |               |                                                                                                     |             |
    | \x20          ( > 1 mn ! )         | \r\n          | Finally, we change any SINGLE SPACE character with a LINE BREAK                                     |   419,769   |
    |                                    |               |                                                                                                     |             |
    | COLUMN editor, with LEADING zeros  |               | At LINE 1, COLUMN 1                                                                                 |             |
    |                                    |               |                                                                                                     |             |
    | (?-s)^(\d{6})(.+)                  | \2#\1         | We SWAP each WORD and its REFERENCE number                                                          |   464,233   |
    |                                    |               |                                                                                                     |             |
    | (?i)^the#                          |               | We BOOKMARK all the LINES, containing the article "the", whatever its CASE                          |    28,529   |
    |                                    |               |                                                                                                     |             |
    | Bookmark > Cut Bookmarked Lines    |               | We BACKUP all these lines in an OTHER file, for FURTHER processing                                  |             |
    |                                    |               |                                                                                                     |             |
    | Sort lines Lexico... ASCENDING     |               | => A work SORTED file, encoded UTF-8 with BOM, of 5,861,424 BYTES, with 435,704 WORDS, ONE per line |             |
    |                                    |               |                                                                                                     |             |
    •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
    

    Now, applying the regexes A to F, against the sorted file obtained, I got, after 10s about for each, the coherent results, below :

    
    •-------•---------------------------------------•----------•-------------•--------------•
    | Regex |                SEARCH                 |  REPLACE | Occurrences | LINES Number |
    •-------•---------------------------------------•----------•-------------•--------------•
    |       |    Work SORTED file, obtained, AFTER all the steps above :     |    435,704   |
    •-------•---------------------------------------•----------•-------------•--------------•
    |       |                                       |          |             |              |
    |   A   |  (?-s)^(.+#).*\R(?:\1.*\R)+           |  EMPTY   |    10,818   |      6,861   |
    |       |                                       |          |             |              |
    |   B   |  (?-s)^((.+#).*\R)(?:\2.*\R)+         |  \1      |    10,818   |     17,679   |
    |       |                                       |          |             |              |
    |   C   |  (?-s)^(.+#).*\R(\1.*\R)+             |  \2      |    10,818   |     17,679   |
    |       |                                       |          |             |              |
    |   D   |  (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R      |  ?1$0    |    17,679   |    428,843   |
    |       |                                       |          |             |              |
    |   E   |  (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R    |  \1      |    17,679   |     10,818   |
    |       |                                       |          |             |              |
    |   F   |  `(?-s)^(.+#).*\R(\1.*\R)+|.+\R	    |  \2      |    17,679   |     10,818   |
    |       |                                       |          |             |              |
    •-------•---------------------------------------•----------•-------------•--------------•
    

    It’s easy to verify that :

    • 6,861 lines, after regex A + 428,843 lines, after regex D = 435,704 ( Total of the file )

    • 6,861 lines, after regex A, + 10,818 lines, after regex E = 17,679 lines, after regex B

    • 6,861 lines, after regex A, + 10,818 lines, after regex F = 17,679 lines, after regex C

    On the other hand :

    • The 10818 occurrences of regexes A, B and C correspond to all the first/last duplicate lines, as after regexes E or F

    • The 17,679 occurrences of regexes D, E and F correspond to all first/last duplicate lines AND all the uniques lines, too, as after regexes B or C

    Note also that :

    • With the 3 regexes A, B and C, the unique lines, which must be kept, are,simply, not processed by the regexes

    • With the 3 regexes D, E and F, the unique lines, which must be deleted, are matched by the second alternative .+\R of the regexes


    So, guys, as I said, above, I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below :

    a#9999999999
    a#9999999999
    abcdefghij#9999999999         )
    .....................         )
    .....................         )   HOW MANY lines ? ( THANKS for testing !!)
    .....................         )
    abcdefghij#9999999999         ]
    z#9999999999
    z#9999999999
    

    Best Regards,

    guy038



  • @guy038
    off topic regarding garden work:
    if your garden is as detailed and thorough as everything else you do, i’d gladly invite you to help me out in mine … the amount of daily magnolia leafs to collect is currently killing me this year and i’ve not been able to control my rakes and brooms with an adequate, repeatable regex ;-)



  • (?-s)^(.+#).\R(\1.\R)+ doesn’t work for my case,
    it doesn’t find any occurrences



  • @guy038 said:

    I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below

    So it might be worth pointing out a good method for creating an arbitrary (i.e., large!) number of the abcdefghij#9999999999 lines in your request.

    Here’s what I would do:

    • put caret on that line in a tab created for the purpose of testing this
    • start macro recording
    • press ctrl+d (to execute the Duplicate Current Line function)
    • stop macro recording
    • go to the Macro menu and choose Run a Macro Multiple Times…
    • fill in the prompt box entries and press Run (to create the desired number of lines)

    To see how many lines of this type you’ve currently got, simply do a literal Count search for abcdefghij#9999999999.



  • @guy038 said:

    So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?

    I did this and obtained exactly the same results as you did, @guy038. Specifically, OK with 21524 identical lines, and NOT OK with 21525 identical lines. I tried both the shorter and longer versions of those “middle” lines in the file. All this using Notepad++ 7.2.2, 32-bit. I doubt that any other (reasonable) version of Notepad++ will show different results.



  • @guy038,

    Do you have more to say on this topic? I’m interested…


Log in to reply