Community
    • Login

    delete both duplicates regexp macro?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    25 Posts 5 Posters 7.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • patrickdrdP
      patrickdrd
      last edited by

      thanks a lot for your effort, but too much fuss, isn’t it?

      vlookup in excel is easier to do I think

      Scott SumnerS 1 Reply Last reply Reply Quote 0
      • Scott SumnerS
        Scott Sumner @patrickdrd
        last edited by

        @patrickdrd said:

        thanks a lot for your effort, but too much fuss, isn’t it?

        NOTHING is too much fuss for @guy038 ! :-D

        1 Reply Last reply Reply Quote 5
        • Scott SumnerS
          Scott Sumner
          last edited by

          @guy038 said:

          the regex engine ends up , matching, wrongly, all file contents

          As mentioned in this thread, this is in all likelihood caused by this problem.

          1 Reply Last reply Reply Quote 1
          • patrickdrdP
            patrickdrd
            last edited by

            (?-s)^(.+)\R(?s)(?=.*\R\1\R?)

            doesn’t match the whole line,
            e.g. it tells me that adobe.com exists, but I only have lines that end in adobe.com, e.g. get.adobe.com

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hi, All,

              Sorry for the delay, but I was busy with some garden work (hedge trimming !) and, of course, I also tested the 6 regex, from A to F, of my previous post !

              I used the following test file :

              a#9999999999
              a#9999999999
              abcdefghij#9999999999
              .........................
              .........................
              ..21524 IDENTICAL lines ( in totality ! )
              .........................  
              .........................  
              abcdefghij#9999999999
              z#9999999999
              z#9999999999
              

              As you can see :

              • It begins with the 2 identical lines a#9999999999

              • Then, followed with 21524 identical lines abcdefghij#9999999999

              • And it finished with the 2 identical lines z#9999999999, followed with a final line-break


              So, I ran the regex C of my previous post, ( (?-s)^(.+#).*\R(\1.*\R)+ ), against this test file

              => It correctly matched the 2 lines, at beginning of file, then the 21524 identical lines ( => a selection of 495,103 characters ) and, the 2 lines at the end of the file

              Then, I simply added ONE additional line abcdefghij#9999999999 to that file and ran the regex again. This time, it matched the 2 lines, at beginning of file, but wrongly grabbed all remaining text ( So the 21525 lines AND the 2 last lines ) !?

              To verify if the results depended of the size of the selection, I changed the test file,with lines of 140 chars, as below :

              a#9999999999
              a#9999999999
              abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999
              .........................
              .........................
              ..21524 IDENTICAL lines ( in totality ! )
              .........................  
              .........................  
              abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999
              z#9999999999
              z#9999999999
              

              I was very surprised to see that results were exactly the same ( OK for 21524 identical lines and KO for 21525 identical lines !!?? ) And yet, this time, the selection contained 3,013,360 chars !

              Of course, I did this test with all the other regexes. For example, with regex A, the limit is a bit higher : 25120 lines. But again, after adding one more line, the regex A failed :-((

              So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?

              In the meanwhile , seemingly, we can conclude that, in a previously sorted file, a regular expression can handle, roughly, not more than 21,000 identical lines, at a time ! I’d be glad to receive your feed-back in order to confirm or invalidate this fact :-))


              Of course, I came to this temporary conclusion, after testing my 6 regexes, from A to F, against real text. I decided to take all contents of a novel, on the Gutenberg site. And…, as I’m French, my choice was, naturally, the novel “The count of Monte-Cristo” by Alexandre Dumas, that you may download from the link below:

              http://www.gutenberg.org/files/1184/1184-0.txt ( Choose the link “Raw text UTF-8” )

              When I first tried to build a suitable sorted working file, in order to test my regexes, unfortunately, all failed :-(( But I also noticed, in that sorted file, that there were numerous lines the#...... Indeed, if you download the novel, just count the occurrences of the regex \bthe\b => 28628 occurrences of the article “the”. So I deleted all these consecutive occurrences of the word “the”. This time all my regexes worked as expected :-))

              However, note that, during my tests, I found out that my regexes D to F were, initially, erroneous. So I changed them, and I already updated my previous post with the correct regexes !


              With the help of that page, below, on the most common words in English :

              https://en.wikipedia.org/wiki/Most_common_words_in_English

              I verified, with the regex \bWord\b, that, in this novel, the 10 most common words used, in the initial text, are :

              the          28,628  ( ABSENT in the SORTED file )
              to           12,897
              of           12,916
              and          12,570
              a             9,473
              I             8,393
              you           8,288
              he            6,945
              in            6,625
              his           5,909
              

              So, we are sure that the 6 regexes can, at least, manage files containing up to 13,000 consecutive identical lines !


              Now, if some people is interested about the different steps, that I used to constitute a decent working file, for testing regexes A to F, just have a glance to the table, below :

              •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• 
              |            SEARCH                  |    REPLACE    |                                       EXPLANATIONS                                                  | Occurrences |
              •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
              |                                    |               | We delete, manually, from BEGINNING of file to the END of the CONTENTS part                         |             |
              |                                    |               |                                                                                                     |             |
              |                                    |               | We delete, manually, from AFTER the FOOTNOTES part till the VERY END of file                        |             |
              |                                    |               |                                                                                                     |             |
              | ,(?=\d)                            |    EMPTY      | We delete any COMMA separator in NUMBERS                                                            |       264   |
              |                                    |               |                                                                                                     |             |
              | [,;.]                              | \x20          | We change any punctuation END of a (part of) SENTENCE with a SPACE character                        |    72,423   |
              |                                    |               |                                                                                                     |             |
              | (?i)o’(?=clock)                    | of\x20the\x20 | We replace the "o’" CONTRACTIVE form with the COMPLETE form "of the "                               |       164   |
              |                                    |               |                                                                                                     |             |
              | (?i)’s|(?<!\w)’|’(?!\w)            |    EMPTY      | We delete the "’s" string and any "’" sign NOT SURROUNDED with WORD chars                           |     2,754   |
              |                                    |               |                                                                                                     |             |
              | (?i)(d|l)’                         | \1e\x20       | We change the "d’" and "l’" French CONTRACTIVE forms to, RESPECTIVELY, "de " and "le "              |       311   |
              |                                    |               |                                                                                                     |             |
              | —|-                                | \x20          | We change any HYPHEN-MINUS character as well as the EM DASH char, with a SPACE character            |     4,933   |
              |                                    |               |                                                                                                     |             |
              | [^\w’\r\n ]                        | \x20          | We ONLY keep WORD, SPACE, and EOL characters and the ’ sign( PRESENT in English CONTRACTIVE forms ) |    38,795   |
              |                                    |               |                                                                                                     |             |
              | (?-i)(?<=\s)(?=\w)[^aAIVX\d](?=\s) | \x20          | As ONE-char STRING, we ONLY keep article "A", "a", pronoun "I", DIGITS and ROMAN letters "V" , "X"  |     1,151   |
              |                                    |               |                                                                                                     |             |
              | ^\h*\R|^\h+|\h+$|\h+(?=\h)         |    EMPTY      | We delete PURE BLANK lines, TRIM spaces at START and END, and REDUCE to a ONE SPACE gap             |   107,108   |
              |                                    |               |                                                                                                     |             |
              | \x20          ( > 1 mn ! )         | \r\n          | Finally, we change any SINGLE SPACE character with a LINE BREAK                                     |   419,769   |
              |                                    |               |                                                                                                     |             |
              | COLUMN editor, with LEADING zeros  |               | At LINE 1, COLUMN 1                                                                                 |             |
              |                                    |               |                                                                                                     |             |
              | (?-s)^(\d{6})(.+)                  | \2#\1         | We SWAP each WORD and its REFERENCE number                                                          |   464,233   |
              |                                    |               |                                                                                                     |             |
              | (?i)^the#                          |               | We BOOKMARK all the LINES, containing the article "the", whatever its CASE                          |    28,529   |
              |                                    |               |                                                                                                     |             |
              | Bookmark > Cut Bookmarked Lines    |               | We BACKUP all these lines in an OTHER file, for FURTHER processing                                  |             |
              |                                    |               |                                                                                                     |             |
              | Sort lines Lexico... ASCENDING     |               | => A work SORTED file, encoded UTF-8 with BOM, of 5,861,424 BYTES, with 435,704 WORDS, ONE per line |             |
              |                                    |               |                                                                                                     |             |
              •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
              

              Now, applying the regexes A to F, against the sorted file obtained, I got, after 10s about for each, the coherent results, below :

              
              •-------•---------------------------------------•----------•-------------•--------------•
              | Regex |                SEARCH                 |  REPLACE | Occurrences | LINES Number |
              •-------•---------------------------------------•----------•-------------•--------------•
              |       |    Work SORTED file, obtained, AFTER all the steps above :     |    435,704   |
              •-------•---------------------------------------•----------•-------------•--------------•
              |       |                                       |          |             |              |
              |   A   |  (?-s)^(.+#).*\R(?:\1.*\R)+           |  EMPTY   |    10,818   |      6,861   |
              |       |                                       |          |             |              |
              |   B   |  (?-s)^((.+#).*\R)(?:\2.*\R)+         |  \1      |    10,818   |     17,679   |
              |       |                                       |          |             |              |
              |   C   |  (?-s)^(.+#).*\R(\1.*\R)+             |  \2      |    10,818   |     17,679   |
              |       |                                       |          |             |              |
              |   D   |  (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R      |  ?1$0    |    17,679   |    428,843   |
              |       |                                       |          |             |              |
              |   E   |  (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R    |  \1      |    17,679   |     10,818   |
              |       |                                       |          |             |              |
              |   F   |  `(?-s)^(.+#).*\R(\1.*\R)+|.+\R	    |  \2      |    17,679   |     10,818   |
              |       |                                       |          |             |              |
              •-------•---------------------------------------•----------•-------------•--------------•
              

              It’s easy to verify that :

              • 6,861 lines, after regex A + 428,843 lines, after regex D = 435,704 ( Total of the file )

              • 6,861 lines, after regex A, + 10,818 lines, after regex E = 17,679 lines, after regex B

              • 6,861 lines, after regex A, + 10,818 lines, after regex F = 17,679 lines, after regex C

              On the other hand :

              • The 10818 occurrences of regexes A, B and C correspond to all the first/last duplicate lines, as after regexes E or F

              • The 17,679 occurrences of regexes D, E and F correspond to all first/last duplicate lines AND all the uniques lines, too, as after regexes B or C

              Note also that :

              • With the 3 regexes A, B and C, the unique lines, which must be kept, are,simply, not processed by the regexes

              • With the 3 regexes D, E and F, the unique lines, which must be deleted, are matched by the second alternative .+\R of the regexes


              So, guys, as I said, above, I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below :

              a#9999999999
              a#9999999999
              abcdefghij#9999999999         )
              .....................         )
              .....................         )   HOW MANY lines ? ( THANKS for testing !!)
              .....................         )
              abcdefghij#9999999999         ]
              z#9999999999
              z#9999999999
              

              Best Regards,

              guy038

              Meta ChuhM Scott SumnerS 2 Replies Last reply Reply Quote 3
              • Meta ChuhM
                Meta Chuh moderator @guy038
                last edited by

                @guy038
                off topic regarding garden work:
                if your garden is as detailed and thorough as everything else you do, i’d gladly invite you to help me out in mine … the amount of daily magnolia leafs to collect is currently killing me this year and i’ve not been able to control my rakes and brooms with an adequate, repeatable regex ;-)

                1 Reply Last reply Reply Quote 1
                • patrickdrdP
                  patrickdrd
                  last edited by

                  (?-s)^(.+#).\R(\1.\R)+ doesn’t work for my case,
                  it doesn’t find any occurrences

                  1 Reply Last reply Reply Quote 0
                  • Scott SumnerS
                    Scott Sumner
                    last edited by

                    @guy038 said:

                    I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below

                    So it might be worth pointing out a good method for creating an arbitrary (i.e., large!) number of the abcdefghij#9999999999 lines in your request.

                    Here’s what I would do:

                    • put caret on that line in a tab created for the purpose of testing this
                    • start macro recording
                    • press ctrl+d (to execute the Duplicate Current Line function)
                    • stop macro recording
                    • go to the Macro menu and choose Run a Macro Multiple Times…
                    • fill in the prompt box entries and press Run (to create the desired number of lines)

                    To see how many lines of this type you’ve currently got, simply do a literal Count search for abcdefghij#9999999999.

                    1 Reply Last reply Reply Quote 2
                    • Scott SumnerS
                      Scott Sumner @guy038
                      last edited by

                      @guy038 said:

                      So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?

                      I did this and obtained exactly the same results as you did, @guy038. Specifically, OK with 21524 identical lines, and NOT OK with 21525 identical lines. I tried both the shorter and longer versions of those “middle” lines in the file. All this using Notepad++ 7.2.2, 32-bit. I doubt that any other (reasonable) version of Notepad++ will show different results.

                      1 Reply Last reply Reply Quote 2
                      • Scott SumnerS
                        Scott Sumner
                        last edited by

                        @guy038,

                        Do you have more to say on this topic? I’m interested…

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors