• Login
Community
  • Login

delete both duplicates regexp macro?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
25 Posts 5 Posters 7.9k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P
    patrickdrd
    last edited by Oct 18, 2018, 7:54 PM

    how could I do it?

    1 Reply Last reply Reply Quote 0
    • P
      patrickdrd
      last edited by Oct 18, 2018, 8:03 PM

      I want to delete both lines if found duplicates,

      1 Reply Last reply Reply Quote 0
      • T
        Terry R
        last edited by Terry R Oct 18, 2018, 8:07 PM Oct 18, 2018, 8:06 PM

        @patrickdrd
        You will need to clarify a lot more if you expect decent/any answers.

        Read the FAQ, in particular
        https://notepad-plus-plus.org/community/topic/15739/faq-desk-request-for-help-without-sufficient-information-to-help-you

        I see the next post elaborated slightly but it’s still unclear.

        Say for example I have

        12345
        12346
        12347
        2345
        12345
        234
        12345
        

        Do you want ALL 12345 deleted or just the 2nd and 3rd copies.
        OR do you want 1st and 2nd copies deleted and the 3rd copy remains.

        You see it all depends on you providing enough information for us to give you help. You help us to help you.

        Terry

        1 Reply Last reply Reply Quote 0
        • P
          patrickdrd
          last edited by Oct 18, 2018, 8:08 PM

          yes, thanks, I want all copies deleted

          1 Reply Last reply Reply Quote 0
          • T
            Terry R
            last edited by Oct 18, 2018, 8:13 PM

            @patrickdrd

            I think you need to read the link I provided. Whilst you have given some additional information some examples might be worthwhile (as stated in that post).

            So far we have ALL lines that are copies of each other to be deleted.

            I can think of a possible solution but 1 question currently is, can the lines be sorted? If so it would make it much easier using a regex (regular expression) to do so. When they are together they can be easily removed.

            As I said read the link I provided and if you are able to, give us some idea of the type of data you need fixing.

            Terry

            1 Reply Last reply Reply Quote 0
            • P
              patrickdrd
              last edited by Oct 18, 2018, 8:27 PM

              good idea, I’ll try with sorting, thanks

              1 Reply Last reply Reply Quote 0
              • T
                Terry R
                last edited by Terry R Oct 18, 2018, 9:09 PM Oct 18, 2018, 9:06 PM

                @patrickdrd
                I also found (searching the old posts) courtesy of @Scott_Sumner
                https://notepad-plus-plus.org/community/topic/14835/remove-duplicate-lines-from-unsorted-keeping-first/2
                where the first regex may well be what you need. In that case the request was to keep one occurrence. If you have the Replace With field left empty I think it removes ALL copies. This will work when the data is sorted so the copies will sit together. You will only need to run it once, I’d suggest starting from the first line of the data to be sure it gets ALL the data you want to remove. Also make sure the last line has a CR/LF at the end, in effect the new last line will be empty.

                Terry

                1 Reply Last reply Reply Quote 1
                • G
                  guy038
                  last edited by guy038 Oct 18, 2018, 11:23 PM Oct 18, 2018, 11:17 PM

                  Hi, @patrickdrd, @terry-r and All,

                  So, if you do not need to give the initial order of your data, it’s quite easy !

                  Imagine, for instance, the initial text :

                  789
                  123
                  456
                  123
                  xyz
                  789
                  789
                  000
                  789
                  456
                  abc
                  123
                  123
                  456
                  hij
                  456
                  456
                  789
                  123
                  999
                  123
                  

                  Which gives, after lexicographically ascending sort :

                  000
                  123
                  123
                  123
                  123
                  123
                  123
                  456
                  456
                  456
                  456
                  456
                  789
                  789
                  789
                  789
                  789
                  999
                  abc
                  hij
                  xyz
                  
                  • Now, just verify that, AFTER sort, the last line of your data is followed with a pure blank line

                  • Then, use the regex S/R, below, which will delete any duplicated line, leaving isolated lines, only :

                    • SEARCH (?-s)^(.+\R)\1+

                    • REPLACE Leave EMPTY

                  => You’ll get, as expected, the sorted list of the isolated lines :

                  000
                  999
                  abc
                  hij
                  xyz
                  

                  Now, if you prefer to keep the initial order of your data list, I surely found correct regexes, in the past, to achieve such a task, but, I didn’t want to bother looking for where ;-))

                  So, thinking about a possible solution and having the results of the previous regex in front of me, everything became clear ! Don’t you see…?

                  • First, copy in a new tab, your original list

                  • Then, add a line with, let’s say, some equal signs =

                  • Finally add the previous results ( that is to say, all the isolated lines, after sort and regex replacement ) !

                  So, we have the following text :

                  789
                  123
                  456
                  123
                  xyz
                  789
                  789
                  000
                  789
                  456
                  abc
                  123
                  123
                  456
                  hij
                  456
                  456
                  789
                  123
                  999
                  123
                  ======
                  000
                  999
                  abc
                  hij
                  xyz
                  
                  • Now, use the regex S/R, below, which will keep the isolated lines, only, in their initial order ;-)

                    • SEARCH (?-s)^(.*\R)(?s)(?!.*^=+.*^\1)

                    • REPLACE Leave EMPTY

                  giving :

                  xyz
                  000
                  abc
                  hij
                  999
                  

                  As you can see, these unique lines are listed, according to their initial location. Nice !

                  Notes :

                  • The regex matches any line which cannot be found, further on, after the line ======

                  • Note that when the regex engine reaches the string ======, this line, and the subsequent ones cannot be followed, further on, with a second line ====== => The negative look-ahead returns TRUE. Consequently, the ====== line and all the sorted isolated lines, obtained in the first part of this post, are, also, deleted ;-))

                  Best Regards,

                  guy038

                  1 Reply Last reply Reply Quote 3
                  • P
                    patrickdrd
                    last edited by Oct 19, 2018, 8:24 AM

                    thanks for help guys,
                    I managed to complete my original task,
                    but now I’ve got another one,
                    which should be a bit more difficult:

                    basically the files are two, I join them,
                    anyway, what I would like to do now is to find out
                    which lines of the first file exist in the second,
                    I tried with examdiff I use (trying to compare two sorted files),
                    but it didn’t return proper results

                    imagine that the first file acts as a “whitelist” against the second file

                    1 Reply Last reply Reply Quote 0
                    • T
                      Terry R
                      last edited by Oct 19, 2018, 8:45 AM

                      @patrickdrd
                      @guy038 had a post (in response to a similar question). See:
                      https://notepad-plus-plus.org/community/topic/15436/subtract-document-b-from-a/4

                      In this one the object was to remove duplicate lines. However you could use the same regex to “Mark” the line. Actually in the Mark function you would also tick “bookmark line”. This should identify all the lines in the first file (which is above the — line stated in that post. The second file would be below that line. Because you aren’t removing any lines the first file needs to have ONLY unique lines for the regex to correctly identify duplicates across the 2 files.

                      Have a go using that. Come back if issues arise. Please note that Notepad++ has issues with large files, so possibly read the remainder of that thread first.

                      Terry

                      1 Reply Last reply Reply Quote 2
                      • P
                        patrickdrd
                        last edited by Oct 19, 2018, 9:04 AM

                        so, which regexp should I use?

                        I tried with this one:

                        (?-s)^(.+)\R(?s)(?=.*\R\1\R?)

                        and it gives me 3 results only (I expected 7) and
                        how/where do I set bookmark colors in notepad++?

                        1 Reply Last reply Reply Quote 0
                        • P
                          patrickdrd
                          last edited by Oct 19, 2018, 9:11 AM

                          ok I’ve found about the color here:
                          https://notepad-plus-plus.org/community/topic/12631/bookmark-line-color/2

                          1 Reply Last reply Reply Quote 0
                          • T
                            Terry R
                            last edited by Oct 19, 2018, 9:15 AM

                            @patrickdrd
                            that’s the regex I meant. Well if it’s found 3 then at least it works. If you are able to figure out the other 4 lines (you expected to get), then I’d suggest making a new file, copying those 4 lines, and their ‘duplicates’ from the second file and comparing them. To make it easier to compare, use the Show Symbol (under main menu View option). You can select all, or possibly just some of the options. Put 1 line from first file directly above it’s ‘duplicate’ from the second. I bet you will find a difference. It may only be a space, or possibly a tab in one vs a number of spaces in the other line, but there will be a difference.

                            As for different bookmark colors, I know it can be done as @Scott_Sumner mentioned it recently, or rather he mentioned a different icon, so presumably a different colour is also possible. I suggest have a look through his posts. This can be done by selecting a poster (their name in blue), then in their profile page the right hand side lists posts going backwards in time, last at the top.

                            It’s bed time for me. Likely someone else will respond overnight if you are still having issues.

                            Good luck

                            Terry

                            1 Reply Last reply Reply Quote 0
                            • P
                              patrickdrd
                              last edited by Oct 19, 2018, 10:58 AM

                              this regexp is doing the job fine:

                              (?-s)^(.+)\R(?s)(?=.*\R\1\R?)

                              but as guy said, it doesn’t work with large (or kind of) files,
                              my file is 1-1,5mb (a bit over 50k records) and it doesn’t work

                              anyway, I did it with excel vlookup function

                              1 Reply Last reply Reply Quote 0
                              • G
                                guy038
                                last edited by guy038 Oct 20, 2018, 2:42 AM Oct 19, 2018, 1:25 PM

                                Hi, All

                                Unfortunately, again, I verified that my previous method works, only, if file contents and/or number of lines processed are not too important :-(( In most cases, the regex engine ends up , matching, wrongly, all file contents. Too bad !

                                So, if you wish to keep the initial order of your file, here is, a new method to adopt, which covers all cases ( I hope so ! ), in order to keep/delete duplicate lines AND/OR all non-duplicate lines of a file, whatever its size !

                                Please, do any test, even on mportant files to verify that this method is robust and does not fail ! I’ll be glad to get your feedback :-))


                                So, let’s start with that sample text :

                                567890
                                1234
                                45
                                1234
                                xyz
                                567890
                                567890
                                000000000
                                567890
                                45
                                abcdef
                                1234
                                1234
                                45
                                hijk
                                45
                                45
                                567890
                                1234
                                999
                                1234
                                
                                • Move the cursor at the beginning of the first item 567890

                                • Open the Column editor ( Edit > Column Editor... )

                                • Insert a decimal sequence of numbers, ticking the Leading zeros option

                                • Delete the last isolated number 22

                                =>

                                01567890
                                021234
                                0345
                                041234
                                05xyz
                                06567890
                                07567890
                                08000000000
                                09567890
                                1045
                                11abcdef
                                121234
                                131234
                                1445
                                15hijk
                                1645
                                1745
                                18567890
                                191234
                                20999
                                211234
                                
                                • Now, use the regex S/R, below, to swap the positions of data and numbers, where N is the number of digits, of the previous numbering, and to insert of a separation character ( I chose the # character, but any individual char may suit, providing it’s not used in your data. Prefer a character which is not a meta-character used in regexes ! )

                                  • SEARCH ^(?-s)^(\d{N})(.+)

                                  • REPLACE \2#\1

                                As, in our example, N = 2, it leads to the text :

                                567890#01
                                1234#02
                                45#03
                                1234#04
                                xyz#05
                                567890#06
                                567890#07
                                000000000#08
                                567890#09
                                45#10
                                abcdef#11
                                1234#12
                                1234#13
                                45#14
                                hijk#15
                                45#16
                                45#17
                                567890#18
                                1234#19
                                999#20
                                1234#21
                                
                                • Then, execute a sort with the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending =>
                                000000000#08
                                1234#02
                                1234#04
                                1234#12
                                1234#13
                                1234#19
                                1234#21
                                45#03
                                45#10
                                45#14
                                45#16
                                45#17
                                567890#01
                                567890#06
                                567890#07
                                567890#09
                                567890#18
                                999#20
                                abcdef#11
                                hijk#15
                                xyz#05
                                

                                Important : Till the end of that post, this sorted text becomes the new sample text !


                                Now, here are the six regex S/R that cover all possible cases :

                                • Regex A : SEARCH (?-s)^(.+#).*\R(?:\1.*\R)+ and REPLACE Leave EMPTY

                                • Regex B : SEARCH (?-s)^((.+#).*\R)(?:\2.*\R)+ and REPLACE \1

                                • Regex C : SEARCH (?-s)^(.+#).*\R(\1.*\R)+ and REPLACE \2

                                • Regex D : SEARCH (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R and REPLACE ?1$0

                                • Regex E : SEARCH (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R and REPLACE \1

                                • Regex F : SEARCH (?-s)^(.+#).*\R(\1.*\R)+|.+\R and REPLACE \2


                                So, in a previously sorted file ( I insist ! ) and whatever the numbering after the # symbol :

                                • If you want to delete all duplicate lines, only, use the regex A

                                • If you want to keep isolated lines AND the first line of each block of duplicate lines, only, use the regex B

                                • If you want to keep isolated lines AND the last line of each block of duplicate lines, only, use the regex C

                                • If you want to delete isolated lines, only, use the regex D

                                • If you want to keep the first line of each block of duplicate lines, only, use the regex E

                                • If you want to keep the last line of each block of duplicate lines, only, use the regex F

                                Here are, below, the results of these six regex S/R, against the sample text :

                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                |    Regex A     |    Regex B     |    Regex C     |    Regex D     |    Regex E     |    Regex F     |
                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                |  000000000#08  |  000000000#08  |  000000000#08  |  1234#02       |  1234#02       |  1234#21       |
                                |  999#20        |  1234#02       |  1234#21       |  1234#04       |  45#03         |  45#17         |
                                |  abcdef#11     |  45#03         |  45#17         |  1234#12       |  567890#01     |  567890#18     |
                                |  hijk#15       |  567890#01     |  567890#18     |  1234#13       |                |                |
                                |  xyz#05        |  999#20        |  999#20        |  1234#19       |                |                |
                                |                |  abcdef#11     |  abcdef#11     |  45#03         |                |                |
                                |                |  hijk#15       |  hijk#15       |  45#10         |                |                |
                                |                |  xyz#05        |  xyz#05        |  45#14         |                |                |
                                |                |                |                |  45#16         |                |                |
                                |                |                |                |  567890#01     |                |                |
                                |                |                |                |  567890#06     |                |                |
                                |                |                |                |  567890#07     |                |                |
                                |                |                |                |  567890#09     |                |                |
                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                

                                • Now, considering any of these 6 results, just above, let’s swap, with the regex S/R, below, the two blocks of data, on either side of the # character

                                  • SEARCH ^(?-s)^(.+)#(.+)

                                  • REPLACE \2#\1

                                We get the different cases, below :

                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                |    Regex A     |    Regex B     |    Regex C     |    Regex D     |    Regex E     |    Regex F     |
                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                |  08#000000000  |  08#000000000  |  08#000000000  |  02#1234       |  02#1234       |  21#1234       |
                                |  20#999        |  02#1234       |  21#1234       |  04#1234       |  03#45         |  17#45         |
                                |  11#abcdef     |  03#45         |  17#45         |  12#1234       |  01#567890     |  18#567890     |
                                |  15#hijk       |  01#567890     |  18#567890     |  13#1234       |                |                |
                                |  05#xyz        |  20#999        |  20#999        |  19#1234       |                |                |
                                |                |  11#abcdef     |  11#abcdef     |  03#45         |                |                |
                                |                |  15#hijk       |  15#hijk       |  10#45         |                |                |
                                |                |  05#xyz        |  05#xyz        |  14#45         |                |                |
                                |                |                |                |  16#45         |                |                |
                                |                |                |                |  01#567890     |                |                |
                                |                |                |                |  06#567890     |                |                |
                                |                |                |                |  07#567890     |                |                |
                                |                |                |                |  09#567890     |                |                |
                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                
                                • Considering any of these 6 results, just above, perform, again, a sort, with the option Edit > Line Operations > Sort Lines Lexicographically Ascending =>
                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                |    Regex A     |    Regex B     |    Regex C     |    Regex D     |    Regex E     |    Regex F     |
                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                |  05#xyz        |  01#567890     |  05#xyz        |  01#567890     |  01#567890     |  17#45         |
                                |  08#000000000  |  02#1234       |  08#000000000  |  02#1234       |  02#1234       |  18#567890     |
                                |  11#abcdef     |  03#45         |  11#abcdef     |  03#45         |  03#45         |  21#1234       |
                                |  15#hijk       |  05#xyz        |  15#hijk       |  04#1234       |                |                |
                                |  20#999        |  08#000000000  |  17#45         |  06#567890     |                |                |
                                |                |  11#abcdef     |  18#567890     |  07#567890     |                |                |
                                |                |  15#hijk       |  20#999        |  09#567890     |                |                |
                                |                |  20#999        |  21#1234       |  10#45         |                |                |
                                |                |                |                |  12#1234       |                |                |
                                |                |                |                |  13#1234       |                |                |
                                |                |                |                |  14#45         |                |                |
                                |                |                |                |  16#45         |                |                |
                                |                |                |                |  19#1234       |                |                |
                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                
                                • Finally, let’s use this last regex S/R to get rid of all the counting marks

                                  • SEARCH (?-s)^.+#

                                  • REPLACE Leave Empty

                                We obtain the 6 final results, from the original text :

                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                |    Regex A     |    Regex B     |    Regex C     |    Regex D     |    Regex E     |    Regex F     |
                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                |    xyz         |    567890      |    xyz         |    567890      |    567890      |    45          |
                                |    000000000   |    1234        |    000000000   |    1234        |    1234        |    567890      |
                                |    abcdef      |    45          |    abcdef      |    45          |    45          |    1234        |
                                |    hijk        |    xyz         |    hijk        |    1234        |                |                |
                                |    999         |    000000000   |    45          |    567890      |                |                |
                                |                |    abcdef      |    567890      |    567890      |                |                |
                                |                |    hijk        |    999         |    567890      |                |                |
                                |                |    999         |    1234        |    45          |                |                |
                                |                |                |                |    1234        |                |                |
                                |                |                |                |    1234        |                |                |
                                |                |                |                |    45          |                |                |
                                |                |                |                |    45          |                |                |
                                |                |                |                |    1234        |                |                |
                                •----------------•----------------•----------------•----------------•----------------•----------------•
                                

                                Remark : This method needs numerous steps, but is quite safe, because all the modifications, produced by the different S/R, concern one line at a time ( or a consecutive block of lines, in regexes A to F ! )

                                Of course, on huge files , execution time may be important, but you should get the expected results, at the end ;-))

                                Cheers,

                                guy038

                                1 Reply Last reply Reply Quote 3
                                • P
                                  patrickdrd
                                  last edited by Oct 19, 2018, 1:49 PM

                                  thanks a lot for your effort, but too much fuss, isn’t it?

                                  vlookup in excel is easier to do I think

                                  Scott SumnerS 1 Reply Last reply Oct 19, 2018, 2:51 PM Reply Quote 0
                                  • Scott SumnerS
                                    Scott Sumner @patrickdrd
                                    last edited by Oct 19, 2018, 2:51 PM

                                    @patrickdrd said:

                                    thanks a lot for your effort, but too much fuss, isn’t it?

                                    NOTHING is too much fuss for @guy038 ! :-D

                                    1 Reply Last reply Reply Quote 5
                                    • Scott SumnerS
                                      Scott Sumner
                                      last edited by Oct 19, 2018, 6:12 PM

                                      @guy038 said:

                                      the regex engine ends up , matching, wrongly, all file contents

                                      As mentioned in this thread , this is in all likelihood caused by this problem .

                                      1 Reply Last reply Reply Quote 1
                                      • P
                                        patrickdrd
                                        last edited by Oct 22, 2018, 5:54 AM

                                        (?-s)^(.+)\R(?s)(?=.*\R\1\R?)

                                        doesn’t match the whole line,
                                        e.g. it tells me that adobe.com exists, but I only have lines that end in adobe.com , e.g. get.adobe.com

                                        1 Reply Last reply Reply Quote 0
                                        • G
                                          guy038
                                          last edited by guy038 Oct 29, 2018, 7:21 PM Oct 23, 2018, 10:37 PM

                                          Hi, All,

                                          Sorry for the delay, but I was busy with some garden work (hedge trimming !) and, of course, I also tested the 6 regex, from A to F, of my previous post !

                                          I used the following test file :

                                          a#9999999999
                                          a#9999999999
                                          abcdefghij#9999999999
                                          .........................
                                          .........................
                                          ..21524 IDENTICAL lines ( in totality ! )
                                          .........................  
                                          .........................  
                                          abcdefghij#9999999999
                                          z#9999999999
                                          z#9999999999
                                          

                                          As you can see :

                                          • It begins with the 2 identical lines a#9999999999

                                          • Then, followed with 21524 identical lines abcdefghij#9999999999

                                          • And it finished with the 2 identical lines z#9999999999, followed with a final line-break


                                          So, I ran the regex C of my previous post, ( (?-s)^(.+#).*\R(\1.*\R)+ ), against this test file

                                          => It correctly matched the 2 lines, at beginning of file, then the 21524 identical lines ( => a selection of 495,103 characters ) and, the 2 lines at the end of the file

                                          Then, I simply added ONE additional line abcdefghij#9999999999 to that file and ran the regex again. This time, it matched the 2 lines, at beginning of file, but wrongly grabbed all remaining text ( So the 21525 lines AND the 2 last lines ) !?

                                          To verify if the results depended of the size of the selection, I changed the test file,with lines of 140 chars, as below :

                                          a#9999999999
                                          a#9999999999
                                          abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999
                                          .........................
                                          .........................
                                          ..21524 IDENTICAL lines ( in totality ! )
                                          .........................  
                                          .........................  
                                          abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz#9999999999999999999999999999999999999999999999999999999999999999999999999999999999999
                                          z#9999999999
                                          z#9999999999
                                          

                                          I was very surprised to see that results were exactly the same ( OK for 21524 identical lines and KO for 21525 identical lines !!?? ) And yet, this time, the selection contained 3,013,360 chars !

                                          Of course, I did this test with all the other regexes. For example, with regex A, the limit is a bit higher : 25120 lines. But again, after adding one more line, the regex A failed :-((

                                          So, guys, if you don’t mind, I would like you to test the regex C , with the first test file, above, in order to verify if it is “laptop-dependent”. I means, may be, results are not pertinent with my weak Windows XP configuration !?

                                          In the meanwhile , seemingly, we can conclude that, in a previously sorted file, a regular expression can handle, roughly, not more than 21,000 identical lines, at a time ! I’d be glad to receive your feed-back in order to confirm or invalidate this fact :-))


                                          Of course, I came to this temporary conclusion, after testing my 6 regexes, from A to F, against real text. I decided to take all contents of a novel, on the Gutenberg site. And…, as I’m French, my choice was, naturally, the novel “The count of Monte-Cristo” by Alexandre Dumas, that you may download from the link below:

                                          http://www.gutenberg.org/files/1184/1184-0.txt ( Choose the link “Raw text UTF-8” )

                                          When I first tried to build a suitable sorted working file, in order to test my regexes, unfortunately, all failed :-(( But I also noticed, in that sorted file, that there were numerous lines the#...... Indeed, if you download the novel, just count the occurrences of the regex \bthe\b => 28628 occurrences of the article “the”. So I deleted all these consecutive occurrences of the word “the”. This time all my regexes worked as expected :-))

                                          However, note that, during my tests, I found out that my regexes D to F were, initially, erroneous. So I changed them, and I already updated my previous post with the correct regexes !


                                          With the help of that page, below, on the most common words in English :

                                          https://en.wikipedia.org/wiki/Most_common_words_in_English

                                          I verified, with the regex \bWord\b, that, in this novel, the 10 most common words used, in the initial text, are :

                                          the          28,628  ( ABSENT in the SORTED file )
                                          to           12,897
                                          of           12,916
                                          and          12,570
                                          a             9,473
                                          I             8,393
                                          you           8,288
                                          he            6,945
                                          in            6,625
                                          his           5,909
                                          

                                          So, we are sure that the 6 regexes can, at least, manage files containing up to 13,000 consecutive identical lines !


                                          Now, if some people is interested about the different steps, that I used to constitute a decent working file, for testing regexes A to F, just have a glance to the table, below :

                                          •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------• 
                                          |            SEARCH                  |    REPLACE    |                                       EXPLANATIONS                                                  | Occurrences |
                                          •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
                                          |                                    |               | We delete, manually, from BEGINNING of file to the END of the CONTENTS part                         |             |
                                          |                                    |               |                                                                                                     |             |
                                          |                                    |               | We delete, manually, from AFTER the FOOTNOTES part till the VERY END of file                        |             |
                                          |                                    |               |                                                                                                     |             |
                                          | ,(?=\d)                            |    EMPTY      | We delete any COMMA separator in NUMBERS                                                            |       264   |
                                          |                                    |               |                                                                                                     |             |
                                          | [,;.]                              | \x20          | We change any punctuation END of a (part of) SENTENCE with a SPACE character                        |    72,423   |
                                          |                                    |               |                                                                                                     |             |
                                          | (?i)o’(?=clock)                    | of\x20the\x20 | We replace the "o’" CONTRACTIVE form with the COMPLETE form "of the "                               |       164   |
                                          |                                    |               |                                                                                                     |             |
                                          | (?i)’s|(?<!\w)’|’(?!\w)            |    EMPTY      | We delete the "’s" string and any "’" sign NOT SURROUNDED with WORD chars                           |     2,754   |
                                          |                                    |               |                                                                                                     |             |
                                          | (?i)(d|l)’                         | \1e\x20       | We change the "d’" and "l’" French CONTRACTIVE forms to, RESPECTIVELY, "de " and "le "              |       311   |
                                          |                                    |               |                                                                                                     |             |
                                          | —|-                                | \x20          | We change any HYPHEN-MINUS character as well as the EM DASH char, with a SPACE character            |     4,933   |
                                          |                                    |               |                                                                                                     |             |
                                          | [^\w’\r\n ]                        | \x20          | We ONLY keep WORD, SPACE, and EOL characters and the ’ sign( PRESENT in English CONTRACTIVE forms ) |    38,795   |
                                          |                                    |               |                                                                                                     |             |
                                          | (?-i)(?<=\s)(?=\w)[^aAIVX\d](?=\s) | \x20          | As ONE-char STRING, we ONLY keep article "A", "a", pronoun "I", DIGITS and ROMAN letters "V" , "X"  |     1,151   |
                                          |                                    |               |                                                                                                     |             |
                                          | ^\h*\R|^\h+|\h+$|\h+(?=\h)         |    EMPTY      | We delete PURE BLANK lines, TRIM spaces at START and END, and REDUCE to a ONE SPACE gap             |   107,108   |
                                          |                                    |               |                                                                                                     |             |
                                          | \x20          ( > 1 mn ! )         | \r\n          | Finally, we change any SINGLE SPACE character with a LINE BREAK                                     |   419,769   |
                                          |                                    |               |                                                                                                     |             |
                                          | COLUMN editor, with LEADING zeros  |               | At LINE 1, COLUMN 1                                                                                 |             |
                                          |                                    |               |                                                                                                     |             |
                                          | (?-s)^(\d{6})(.+)                  | \2#\1         | We SWAP each WORD and its REFERENCE number                                                          |   464,233   |
                                          |                                    |               |                                                                                                     |             |
                                          | (?i)^the#                          |               | We BOOKMARK all the LINES, containing the article "the", whatever its CASE                          |    28,529   |
                                          |                                    |               |                                                                                                     |             |
                                          | Bookmark > Cut Bookmarked Lines    |               | We BACKUP all these lines in an OTHER file, for FURTHER processing                                  |             |
                                          |                                    |               |                                                                                                     |             |
                                          | Sort lines Lexico... ASCENDING     |               | => A work SORTED file, encoded UTF-8 with BOM, of 5,861,424 BYTES, with 435,704 WORDS, ONE per line |             |
                                          |                                    |               |                                                                                                     |             |
                                          •------------------------------------•---------------•-----------------------------------------------------------------------------------------------------•-------------•
                                          

                                          Now, applying the regexes A to F, against the sorted file obtained, I got, after 10s about for each, the coherent results, below :

                                          
                                          •-------•---------------------------------------•----------•-------------•--------------•
                                          | Regex |                SEARCH                 |  REPLACE | Occurrences | LINES Number |
                                          •-------•---------------------------------------•----------•-------------•--------------•
                                          |       |    Work SORTED file, obtained, AFTER all the steps above :     |    435,704   |
                                          •-------•---------------------------------------•----------•-------------•--------------•
                                          |       |                                       |          |             |              |
                                          |   A   |  (?-s)^(.+#).*\R(?:\1.*\R)+           |  EMPTY   |    10,818   |      6,861   |
                                          |       |                                       |          |             |              |
                                          |   B   |  (?-s)^((.+#).*\R)(?:\2.*\R)+         |  \1      |    10,818   |     17,679   |
                                          |       |                                       |          |             |              |
                                          |   C   |  (?-s)^(.+#).*\R(\1.*\R)+             |  \2      |    10,818   |     17,679   |
                                          |       |                                       |          |             |              |
                                          |   D   |  (?-s)^(.+#).*\R(?:\1.*\R)+|.+\R      |  ?1$0    |    17,679   |    428,843   |
                                          |       |                                       |          |             |              |
                                          |   E   |  (?-s)^((.+#).*\R)(?:\2.*\R)+|.+\R    |  \1      |    17,679   |     10,818   |
                                          |       |                                       |          |             |              |
                                          |   F   |  `(?-s)^(.+#).*\R(\1.*\R)+|.+\R	    |  \2      |    17,679   |     10,818   |
                                          |       |                                       |          |             |              |
                                          •-------•---------------------------------------•----------•-------------•--------------•
                                          

                                          It’s easy to verify that :

                                          • 6,861 lines, after regex A + 428,843 lines, after regex D = 435,704 ( Total of the file )

                                          • 6,861 lines, after regex A, + 10,818 lines, after regex E = 17,679 lines, after regex B

                                          • 6,861 lines, after regex A, + 10,818 lines, after regex F = 17,679 lines, after regex C

                                          On the other hand :

                                          • The 10818 occurrences of regexes A, B and C correspond to all the first/last duplicate lines, as after regexes E or F

                                          • The 17,679 occurrences of regexes D, E and F correspond to all first/last duplicate lines AND all the uniques lines, too, as after regexes B or C

                                          Note also that :

                                          • With the 3 regexes A, B and C, the unique lines, which must be kept, are,simply, not processed by the regexes

                                          • With the 3 regexes D, E and F, the unique lines, which must be deleted, are matched by the second alternative .+\R of the regexes


                                          So, guys, as I said, above, I’m looking for the results of your own tests, relative to the biggest block of consecutive identical lines, correctly handled by the six regexes A to F, above, and the test file, below :

                                          a#9999999999
                                          a#9999999999
                                          abcdefghij#9999999999         )
                                          .....................         )
                                          .....................         )   HOW MANY lines ? ( THANKS for testing !!)
                                          .....................         )
                                          abcdefghij#9999999999         ]
                                          z#9999999999
                                          z#9999999999
                                          

                                          Best Regards,

                                          guy038

                                          Meta ChuhM Scott SumnerS 2 Replies Last reply Oct 23, 2018, 10:58 PM Reply Quote 3
                                          1 out of 25
                                          • First post
                                            1/25
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors