Community
    • Login

    delete both duplicates regexp macro?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    25 Posts 5 Posters 10.9k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • patrickdrdP Offline
      patrickdrd
      last edited by

      thanks a lot for your effort, but too much fuss, isn’t it?

      vlookup in excel is easier to do I think

      Scott SumnerS 1 Reply Last reply Reply Quote 0
      • Scott SumnerS Offline
        Scott Sumner @patrickdrd
        last edited by

        @patrickdrd said:

        thanks a lot for your effort, but too much fuss, isn’t it?

        NOTHING is too much fuss for @guy038 ! :-D

        1 Reply Last reply Reply Quote 5
        • Scott SumnerS Offline
          Scott Sumner
          last edited by

          @guy038 said:

          the regex engine ends up , matching, wrongly, all file contents

          As mentioned in this thread, this is in all likelihood caused by this problem.

          1 Reply Last reply Reply Quote 1
          • patrickdrdP Offline
            patrickdrd
            last edited by

            (?-s)^(.+)\R(?s)(?=.*\R\1\R?)

            doesn’t match the whole line,
            e.g. it tells me that adobe.com exists, but I only have lines that end in adobe.com, e.g. get.adobe.com

            1 Reply Last reply Reply Quote 0
            • guy038G Online
              guy038
              last edited by guy038

              Hi, All,

              Sorry for the delay, but I was busy with some garden work (hedge trimming !) and, of course, I also tested the 6 regex, from A to F, of my previous post !

              I used the following test file :

              a#9999999999
              a#9999999999
              abcdefghij#9999999999
              .........................
              .........................
              ..21524 IDENTICAL lines ( in totality ! )
              .........................  
              .........................  
              abcdefghij#9999999999
              z#9999999999
              z#9999999999
              

              As you can see :

              • It begins with the 2 identical lines a#9999999999

              • Then, followed with 21524 identical lines abcdefghij#9999999999

              • And it finished with the 2 identical lines z#9999999999, followed with a final line-break


              So, I ran the regex C of my previous post, ( (?-s)^(.+#).*\R(\1.*\R)+ ), against this test file

              => It correctly matched the 2 lines, at beginning of file, then the 21524 identical lines ( => a selection of 495,103 characters ) and, the 2 lines at the end of the file

              Then, I simply added ONE additional line abcdefghij#9999999999 to that file and ran the regex again. This time, it matched the 2 lines, at beginning of file, but wrongly grabbed all remaining text ( So the 21525 lines AND the 2 last lines ) !?

              To verify if the results depended of the size of the selection, I changed the test file,with lines of 140 chars, as below :

              a#9999999999
              a#9999999999
              abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999
              .........................
              .........................
              ..21524 IDENTICAL lines ( in totality ! )
              .........................  
              .........................  
              abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999
              z#9999999999
              z#9999999999
              

              I was very surprised to see that results were exactly the same ( OK for 21524 identical lines and KO for 21525 identical lines !!?? ) And yet, this time, the selection contained 3,013,360 chars !

              Of course, I did this test with all the other regexes. For example, with regex A, the limit is a bit higher : 25120 lines. But again, after adding one more line, the regex A failed :-((

              So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?

              In the meanwhile , seemingly, we can conclude that, in a previously sorted file, a regular expression can handle, roughly, not more than 21,000 identical lines, at a time ! I’d be glad to receive your feed-back in order to confirm or invalidate this fact :-))


              Of course, I came to this temporary conclusion, after testing my 6 regexes, from A to F, against real text. I decided to take all contents of a novel, on the Gutenberg site. And…, as I’m French, my choice was, naturally, the novel “The count of Monte-Cristo” by Alexandre Dumas, that you may download from the link below:

              http://www.gutenberg.org/files/1184/1184-0.txt ( Choose the link “Raw text UTF-8” )

              When I first tried to build a suitable sorted working file, in order to test my regexes, unfortunately, all failed :-(( But I also noticed, in that sorted file, that there were numerous lines the#...... Indeed, if you download the novel, just count the occurrences of the regex \bthe\b => 28628 occurrences of the article “the”. So I deleted all these consecutive occurrences of the word “the”. This time all my regexes worked as expected :-))

              However, note that, during my tests, I found out that my regexes D to F were, initially, erroneous. So I changed them, and I already updated my previous post with the correct regexes !


              With the help of that page, below, on the most common words in English :

              https://en.wikipedia.org/wiki/Most_common_words_in_English

              I verified, with the regex \bWord\b, that, in this novel, the 10 most common words used, in the initial text, are :

              the          28,628  ( ABSENT in the SORTED file )
              to           12,897
              of           12,916
              and          12,570
              a             9,473
              I             8,393
              you           8,288
              he            6,945
              in            6,625
              his           5,909
              

              So, we are sure that the 6 regexes can, at least, manage files containing up to 13,000 consecutive identical lines !


              Now, if some people is interested about the different steps, that I used to constitute a decent working file, for testing regexes A to F, just have a glance to the table, below :

              •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• 
              |            SEARCH                  |    REPLACE    |                                       EXPLANATIONS                                                  | Occurrences |
              •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
              |                                    |               | We delete, manually, from BEGINNING of file to the END of the CONTENTS part                         |             |
              |                                    |               |                                                                                                     |             |
              |                                    |               | We delete, manually, from AFTER the FOOTNOTES part till the VERY END of file                        |             |
              |                                    |               |                                                                                                     |             |
              | ,(?=\d)                            |    EMPTY      | We delete any COMMA separator in NUMBERS                                                            |       264   |
              |                                    |               |                                                                                                     |             |
              | [,;.]                              | \x20          | We change any punctuation END of a (part of) SENTENCE with a SPACE character                        |    72,423   |
              |                                    |               |                                                                                                     |             |
              | (?i)o’(?=clock)                    | of\x20the\x20 | We replace the "o’" CONTRACTIVE form with the COMPLETE form "of the "                               |       164   |
              |                                    |               |                                                                                                     |             |
              | (?i)’s|(?<!\w)’|’(?!\w)            |    EMPTY      | We delete the "’s" string and any "’" sign NOT SURROUNDED with WORD chars                           |     2,754   |
              |                                    |               |                                                                                                     |             |
              | (?i)(d|l)’                         | \1e\x20       | We change the "d’" and "l’" French CONTRACTIVE forms to, RESPECTIVELY, "de " and "le "              |       311   |
              |                                    |               |                                                                                                     |             |
              | —|-                                | \x20          | We change any HYPHEN-MINUS character as well as the EM DASH char, with a SPACE character            |     4,933   |
              |                                    |               |                                                                                                     |             |
              | [^\w’\r\n ]                        | \x20          | We ONLY keep WORD, SPACE, and EOL characters and the ’ sign( PRESENT in English CONTRACTIVE forms ) |    38,795   |
              |                                    |               |                                                                                                     |             |
              | (?-i)(?<=\s)(?=\w)[^aAIVX\d](?=\s) | \x20          | As ONE-char STRING, we ONLY keep article "A", "a", pronoun "I", DIGITS and ROMAN letters "V" , "X"  |     1,151   |
              |                                    |               |                                                                                                     |             |
              | ^\h*\R|^\h+|\h+$|\h+(?=\h)         |    EMPTY      | We delete PURE BLANK lines, TRIM spaces at START and END, and REDUCE to a ONE SPACE gap             |   107,108   |
              |                                    |               |                                                                                                     |             |
              | \x20          ( > 1 mn ! )         | \r\n          | Finally, we change any SINGLE SPACE character with a LINE BREAK                                     |   419,769   |
              |                                    |               |                                                                                                     |             |
              | COLUMN editor, with LEADING zeros  |               | At LINE 1, COLUMN 1                                                                                 |             |
              |                                    |               |                                                                                                     |             |
              | (?-s)^(\d{6})(.+)                  | \2#\1         | We SWAP each WORD and its REFERENCE number                                                          |   464,233   |
              |                                    |               |                                                                                                     |             |
              | (?i)^the#                          |               | We BOOKMARK all the LINES, containing the article "the", whatever its CASE                          |    28,529   |
              |                                    |               |                                                                                                     |             |
              | Bookmark > Cut Bookmarked Lines    |               | We BACKUP all these lines in an OTHER file, for FURTHER processing                                  |             |
              |                                    |               |                                                                                                     |             |
              | Sort lines Lexico... ASCENDING     |               | => A work SORTED file, encoded UTF-8 with BOM, of 5,861,424 BYTES, with 435,704 WORDS, ONE per line |             |
              |                                    |               |                                                                                                     |             |
              •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
              

              Now, applying the regexes A to F, against the sorted file obtained, I got, after 10s about for each, the coherent results, below :

              
              •-------•---------------------------------------•----------•-------------•--------------•
              | Regex |                SEARCH                 |  REPLACE | Occurrences | LINES Number |
              •-------•---------------------------------------•----------•-------------•--------------•
              |       |    Work SORTED file, obtained, AFTER all the steps above :     |    435,704   |
              •-------•---------------------------------------•----------•-------------•--------------•
              |       |                                       |          |             |              |
              |   A   |  (?-s)^(.+#).*\R(?:\1.*\R)+           |  EMPTY   |    10,818   |      6,861   |
              |       |                                       |          |             |              |
              |   B   |  (?-s)^((.+#).*\R)(?:\2.*\R)+         |  \1      |    10,818   |     17,679   |
              |       |                                       |          |             |              |
              |   C   |  (?-s)^(.+#).*\R(\1.*\R)+             |  \2      |    10,818   |     17,679   |
              |       |                                       |          |             |              |
              |   D   |  (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R      |  ?1$0    |    17,679   |    428,843   |
              |       |                                       |          |             |              |
              |   E   |  (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R    |  \1      |    17,679   |     10,818   |
              |       |                                       |          |             |              |
              |   F   |  `(?-s)^(.+#).*\R(\1.*\R)+|.+\R	    |  \2      |    17,679   |     10,818   |
              |       |                                       |          |             |              |
              •-------•---------------------------------------•----------•-------------•--------------•
              

              It’s easy to verify that :

              • 6,861 lines, after regex A + 428,843 lines, after regex D = 435,704 ( Total of the file )

              • 6,861 lines, after regex A, + 10,818 lines, after regex E = 17,679 lines, after regex B

              • 6,861 lines, after regex A, + 10,818 lines, after regex F = 17,679 lines, after regex C

              On the other hand :

              • The 10818 occurrences of regexes A, B and C correspond to all the first/last duplicate lines, as after regexes E or F

              • The 17,679 occurrences of regexes D, E and F correspond to all first/last duplicate lines AND all the uniques lines, too, as after regexes B or C

              Note also that :

              • With the 3 regexes A, B and C, the unique lines, which must be kept, are,simply, not processed by the regexes

              • With the 3 regexes D, E and F, the unique lines, which must be deleted, are matched by the second alternative .+\R of the regexes


              So, guys, as I said, above, I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below :

              a#9999999999
              a#9999999999
              abcdefghij#9999999999         )
              .....................         )
              .....................         )   HOW MANY lines ? ( THANKS for testing !!)
              .....................         )
              abcdefghij#9999999999         ]
              z#9999999999
              z#9999999999
              

              Best Regards,

              guy038

              Meta ChuhM Scott SumnerS 2 Replies Last reply Reply Quote 3
              • Meta ChuhM Offline
                Meta Chuh moderator @guy038
                last edited by

                @guy038
                off topic regarding garden work:
                if your garden is as detailed and thorough as everything else you do, i’d gladly invite you to help me out in mine … the amount of daily magnolia leafs to collect is currently killing me this year and i’ve not been able to control my rakes and brooms with an adequate, repeatable regex ;-)

                1 Reply Last reply Reply Quote 1
                • patrickdrdP Offline
                  patrickdrd
                  last edited by

                  (?-s)^(.+#).\R(\1.\R)+ doesn’t work for my case,
                  it doesn’t find any occurrences

                  1 Reply Last reply Reply Quote 0
                  • Scott SumnerS Offline
                    Scott Sumner
                    last edited by

                    @guy038 said:

                    I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below

                    So it might be worth pointing out a good method for creating an arbitrary (i.e., large!) number of the abcdefghij#9999999999 lines in your request.

                    Here’s what I would do:

                    • put caret on that line in a tab created for the purpose of testing this
                    • start macro recording
                    • press ctrl+d (to execute the Duplicate Current Line function)
                    • stop macro recording
                    • go to the Macro menu and choose Run a Macro Multiple Times…
                    • fill in the prompt box entries and press Run (to create the desired number of lines)

                    To see how many lines of this type you’ve currently got, simply do a literal Count search for abcdefghij#9999999999.

                    1 Reply Last reply Reply Quote 2
                    • Scott SumnerS Offline
                      Scott Sumner @guy038
                      last edited by

                      @guy038 said:

                      So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?

                      I did this and obtained exactly the same results as you did, @guy038. Specifically, OK with 21524 identical lines, and NOT OK with 21525 identical lines. I tried both the shorter and longer versions of those “middle” lines in the file. All this using Notepad++ 7.2.2, 32-bit. I doubt that any other (reasonable) version of Notepad++ will show different results.

                      1 Reply Last reply Reply Quote 2
                      • Scott SumnerS Offline
                        Scott Sumner
                        last edited by

                        @guy038,

                        Do you have more to say on this topic? I’m interested…

                        1 Reply Last reply Reply Quote 0

                        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                        With your input, this post could be even better 💗

                        Register Login
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors