Community
    • Login

    Deleting lines that repeat the first 15 characters

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    25 Posts 6 Posters 13.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • mangoguyM
      mangoguy
      last edited by

      I found very helpful the solution to eliminate duplicate lines found at

      https://notepad-plus-plus.org/community/topic/13147/eliminating-duplicate-identical-lines

      How can the regex in the search field (?-s)(^.+\R)\1+ be modified so that a line is deleted if the first 15 characters of a line match the first 15 characters of the preceding line?

      Thank you,
      Doug

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello @mangoguy and All,

        No problem ! You may use the following regex S/R :

        SEARVH (?-s)((.{15}).*\R)(?:\2.*\R)+

        REPLACE \1

        So, let’s imagine the sorted example text, below :

        --------------- a test
        --------------- is
        --------------- just
        --------------- this
        000001111122222 qrstu
        12345 abcde
        12345 abcde
        123456789012345
        123456789012345 abcde
        123456789012345 fghij
        123456789012345 klmop
        123456789012345 qrstu
        99999 abcde
        abcde 12345
        abcde 12345
        abcdefghijklmno 01
        abcdefghijklmno 11111
        abcdefghijklmno 22222
        abcdefghijklmno 33333
        abcdefghijklmno 56789
        end of the test
        

        After performing the regex, above, you should obtain :

        --------------- a test
        000001111122222 qrstu
        12345 abcde
        12345 abcde
        123456789012345
        99999 abcde
        abcde 12345
        abcde 12345
        abcdefghijklmno 01
        end of the test
        

        Notes :

        • The (?s) in-line modifier ensures you that the special regex dot character will match standard characters, only, even if you, previously, ticked the . matches newline option !

        • The part ((.{15}).*\R) represents the first line of the current matched block of identical lines, stored as group 1

          • The subpattern .{15} stands for the first fifteenth characters of the line, stored as group 2

          • The part .*\R looks for the rest of the line, possibly empty, followed by its End of Line character(s)

        • Finally, the part (?:\2.*\R)+ is a repeated non-capturing group of :

          • \2, which represents the first fifteen characters, of the first line

          • followed by any range of character(s) and the EOL character(s) of the subsequent lines

        • In Replacement , the group 1 ( \1) , first line of the block, is rewritten, only

        • Note that the special symbol ^, after (?-s), is not necessary, anyway, as group 2 must occur after the EOL characters ( \R ) of the first line of the block !


        An other syntax could be :

        SEARCH (?-s)(.{15}).*\R\K(?:\1.*\R)+

        REPLACE EMPTY

        Notes :

        • This time, after matching the first line of a block of identical lines ( regarding the first 15 characters, only ! ), the \K syntax resets the regex engine position and everything already matched is forgotten !

        • So, the final match is the range of all the duplicate lines, AFTER the first one, which are, simply, deleted, because of the empty replacement zone :-)

        • IMPORTANT : see Scott’s advice, in the next post !

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 0
        • Scott SumnerS
          Scott Sumner
          last edited by

          @guy038

          Probably worth pointing out (again) that if you use the regex with the \K syntax, interactive replaces don’t work correctly–you have to use Replace All

          1 Reply Last reply Reply Quote 1
          • mangoguyM
            mangoguy
            last edited by

            Works great! Thank you!

            However, bizarre behavior was noted with a file longer than 143,353 lines. The entire file was selected, 1 replacement was executed, and 15 characters remained in the file.

            Is this problem fixable?

            Thank you,
            Doug

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by

              Hello, @mangoguy,

              Doug, may be, you’re using the periodic backup functionality ? If so, it would be sensible to stop it, unticking the option Settings > Preferences… > Backup > Enable session snapshot and periodic backup, while performing the regex S/R on huge files !

              Secondly, the second regex syntax (?-s)(.{15}).*\R\K(?:\1.*\R)+, of my previous post, may produce better results :-)

              Cheers,

              guy038

              1 Reply Last reply Reply Quote 0
              • mangoguyM
                mangoguy
                last edited by

                Thank you.

                Stopping the periodic backup did not resolve the issue.

                It appears that the size of the file is not the problem but rather how many lines are searched between the initial location of the cursor and the first match.

                About 100,000 lines appears to be the limit before the bizarre behavior described recurs.

                Maybe a buffer - memory limitation of Notepad++ ? Just guessing. I can work around this limitation.

                I am very thankful and impressed with how well your regex syntax cleaned this data in Notepad++.

                Thank you again!

                Doug

                Scott SumnerS 1 Reply Last reply Reply Quote 0
                • Scott SumnerS
                  Scott Sumner @mangoguy
                  last edited by

                  @mangoguy aka “Doug” :

                  If your data is not sensitive and you could post it somewhere (example, textuploader.com although not sure of its limits on size…), someone will try to duplicate your findings.

                  1 Reply Last reply Reply Quote 0
                  • mangoguyM
                    mangoguy
                    last edited by

                    The data file should be accessible here:

                    https://mangoguy.sharefile.com/d-s55a1b6f522c41eea

                    Thank you,
                    Douglas

                    1 Reply Last reply Reply Quote 1
                    • Scott SumnerS
                      Scott Sumner
                      last edited by Scott Sumner

                      @mangoguy said:

                      https://mangoguy.sharefile.com/d-s55a1b6f522c41eea

                      FWIW, I can confirm the findings of the entire file being selected (I used [red]marking) with either of @guy038’s regexes. However, chopping up the file into 100,000 line subfiles allowed both regexes to work fine. Note that the only duplicates were found in the LAST subfile chunk–which had only ~61,000 lines, not 100,000.

                      I don’t have an explanation–“big” data can cause “big” problems–the definition of “big” being one thing where YMMV…

                      In such a case I myself would turn to a scrap bit of Python to find these duplicate records:

                      prev = ''
                      with open('data.txt') as f:
                          for (n, line) in enumerate(f):
                              if line[:15] == prev:
                                  print n+1
                              prev = line[:15]
                      

                      That doesn’t address the issue, but gets the job done.

                      :(

                      Saya JujurS 1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi, @mangoguy, @scott-sumner and All,

                        I understood what happened :-)) Just have a look, for instance, to lines 27 and 28 of your DATA.txt file, as described below :

                        08,28,1212,3959,0.458,0.458,0.504,0.492,0
                        08,28,1212,1000,0.492,0.364,0.495,0.365,0
                        

                        Obviously, these two lines are NOT sorted as string “3959” is greater than string “1000” !

                        So, I simply performed, FIRST, the classical sort : Edit > Lines operations > Sort Lines Lexicographically Ascending ( 12s )

                        And then…, everything went fine :-D I tried my two regexes, which, both, worked, as expected !

                        BTW, My second regex, with the \K syntax is slightly quicker than the first one ! On my laptop, I got 49s instead of 53s :-)


                        Some statistics about your DATA.txt file and about the regex S/R :

                                0 line  with 4 DUPLICATES or MORE  =>        0 REPLACEMENT   and       0 line DELETED
                        
                              489 lines with 3 DUPLICATES          =>      489 REPLACEMENTS  and   1,467 lines DELETED  ( 3 x 489 )
                        
                               38 lines with 2 DUPLICATES          =>       38 REPLACEMENTS  and      76 lines DELETED  ( 2 x  38 )
                        
                           28,836 lines with 1 DUPLICATE           =>   28,836 REPLACEMENTS  and  28,836 lines DELETED
                        
                                                           TOTAL   :    29,363 REPLACEMENTS  and  30,379 lines DELETED
                        
                            ORIGINAL file NUMBER of lines      460,725
                        	
                        	After SUPPRESSION of DUPLICATES  : 430,346
                        
                                               Difference  :    30,379   
                        

                        Cheers,

                        guy038

                        Scott SumnerS 1 Reply Last reply Reply Quote 0
                        • Scott SumnerS
                          Scott Sumner @guy038
                          last edited by

                          @guy038

                          I guess I’m confused. The OP asked for “a line is deleted if the first 15 characters of a line match the first 15 characters of the preceding line”–doesn’t that preclude doing a sort, because it removes the impact of the “preceding line” part? Well, no matter…if it helps the OP out that is all that matters. However, I’m still confused as to why the application of the regexes on the unsorted file caused the entire file to be selected for the OP and to be redmarked for me–can you help with the explanation of that?

                          1 Reply Last reply Reply Quote 0
                          • mangoguyM
                            mangoguy
                            last edited by

                            Thank you for the reply.

                            The file is not intended to be sorted. Sorting the file would corrupt the necessary order of the data.

                            Nevertheless, lines that duplicate the first 15 characters of the preceding line must be eliminated.

                            Thank you again,

                            Douglas

                            1 Reply Last reply Reply Quote 1
                            • guy038G
                              guy038
                              last edited by guy038

                              Hi, @mangoguy, @scott-sumner and All,

                              In my initial thread, below, whose Doug spoke of :

                              https://notepad-plus-plus.org/community/topic/13147/eliminating-duplicate-identical-lines

                              I said :

                              The suppression of all the duplicate lines, in a pre-sorted file, can be easily obtained with a Search/Replacement, in Regular expression mode !

                              Open your file, containing the sorted list of items

                              Open the Replace dialog ( CTRL + H )
                              …

                              In that text, it’s the word ‘sorted’ which is important ! Indeed, imagine this initial text, NOT sorted :

                              pqrst
                              pqrst
                              pqrst
                              uvwxy
                              fghij
                              fghij
                              fghij
                              pqrst
                              pqrst
                              pqrst
                              abcde
                              abcde
                              abcde
                              abcde
                              fghij
                              klmno
                              klmno
                              klmno
                              fghij
                              fghij
                              

                              and my initial regex :

                              SEARCH (?-s)(^.+\R)\1+

                              REPLACE \1

                              Even with a step by step replacement, with the Replace button, it would give :

                              pqrst
                              uvwxy
                              fghij
                              pqrst
                              abcde
                              fghij
                              klmno
                              fghij
                              

                              => You can see that it still remains 3 lines fghij, inside, split up on different lines. In a sense, it remains a single line with its two duplicates !

                              Now, let’s sort the initial text, first :

                              abcde
                              abcde
                              abcde
                              abcde
                              fghij
                              fghij
                              fghij
                              fghij
                              fghij
                              fghij
                              klmno
                              klmno
                              klmno
                              pqrst
                              pqrst
                              pqrst
                              pqrst
                              pqrst
                              pqrst
                              uvwxy
                              

                              After performing the same regex S/R, we, now, get the text :

                              abcde
                              fghij
                              klmno
                              pqrst
                              uvwxy
                              

                              And, as expected, it does not contain any duplicate line. So, with an initial sort, this regex, and the derivative regexes, work just fine !

                              However, the main drawback is that the original order, of the file, is lost, because of the sort process :-((


                              So, against your file, I tried other regexes, which involve a look-ahead and which do not break the file’s order :

                              • (?-s)(?:^(.{15}).*\R)(?=.*\1) and EMPTY replacement => Count process give us 1022 occurrences, which is false

                              • (?-s)(?:^(.{15}).*\R)(?=(?:.+\R)*\1) => Catastrophic break-down, with only one match ( the entire file contents ! )

                              So , what’s about this generic one :

                              • (?-s)(?:^(.{15}).*\R)(?=(?:.+\R){0,n}\1)

                              Well, but, for instance, 4 lines, in your file, begin with the string 01,12,1215,1012 ( Lines 209988, 208996, 210928 et 210936 ). And the gap between lines 208996 and 210928, for instance, is 1932 !

                              So, the number n, in that regex, should be, at least 2000, hence the regex :

                              (?-s)(?:^(.{15}).*\R)(?=(?:.+\R){0,2000}\1) => Again, a catastrophic break-down occurred :-((

                              Obviously, it is not worth going on, in that direction ! So, Scott, I presume that when a large amount of lines and/or large capturing group areas are involved, we’re going to get, likely, unpredictable results ! We have to change our mind at all about it :-D


                              Finally, I found out the right way to get the job done :-)))

                              Note that, in the Doug’s file, the key string is the first 15th characters of a line. So, here is the procedure :

                              • First, we get rid of possible blank lines, with the regex S/R, below :

                              SEARCH ^\h*\R

                              REPLACE EMPTY

                              => 2 lines should be deleted

                              • Then, we add a blank area, of six characters long, right after the 15th character, with the regex S/R, below ( ~23s ) :

                              SEARCH (?-s)^.{15}\K

                              REPLACE \x20\x20\x20\x20\x20\x20

                              • Place the caret at column 19 of the line 1 (IMPORTANT )

                              • Now, choose the menu command Edit > Column Editor… ( Alt + C )

                              • Select the Number to Insert option

                              • Type in 1, as Initial number

                              • Type in 1, in the Increase by : option

                              • Verify that the chosen format is Dec

                              • Check the Leading zeros option ( IMPORTANT )

                              • Finally, click on the OK button ( ~36s )

                              => After a while, you should get a six-digits column, at position 19, from 000001 to 460726

                              • Delete the last line 460726 and the EOL characters of the line 460725

                              • Sort the file contents, choosing the menu command Edit > Line Operations > Sort Lines Lexicographically Ascending ( ~17s )

                              • Now, perform the main regex, below, clicking on the Replace All button, exclusively ( ~1m 11s )

                              SEARCH (?-s)(.{15}).*\R\K(?:\1.*\R)+

                              REPLACE EMPTY

                              => 29363 replacements occurred and the file, from now on, contains 430346 lines, only !

                              • Then, move the middle column number, at beginning of each line, with the regex S/R, below ( ~31s ) :

                              SEARCH (?-s)^(.+?)\x20+(\d+)

                              REPLACE \2\x20\x20\x20\1

                              • Now, execute a last sort operation Edit > Line Operations > Sort Lines Lexicographically Ascending ( ~15s )

                              • Finally get rid of the line numbers, at beginning of lines, along with the space characters, with the regex S/R, below ( ~15s ) :

                              SEARCH ^\d+\x20+

                              REPLACE EMPTY

                              Et voilà !

                              To sump up, this procedure, while going downwards, throughout the file contents, keeps, only, each first occurrence of any duplicate line, with identical first 15th characters, as well as any single line, of course !

                              Cheers,

                              guy038

                              P.S. :

                              To easily understand my method, above, copy/paste this short initial text, below, in a new tab :

                              pqrstSmith
                              pqrstJones
                              pqrstTaylor
                              uvwxyBrown
                              fghijWilliams
                              fghijWilson
                              fghijJohnson
                              pqrstDavies
                              pqrstRobinson
                              pqrstWright
                              abcdeThompson
                              abcdeEvans
                              abcdeWalker
                              abcdeWhite
                              fghijRoberts
                              klmnoGreen
                              klmnoHall
                              klmnoWood
                              fghijJackson
                              fghijClarke
                              

                              Then, I decided that two lines are identical if their first 5 characters are equal !

                              So, we add, first, a six blank characters column, after the 5th character of each line

                              SEARCH (?-s)^.{5}\K

                              REPLACE \x20\x20\x20\x20\x20\x20

                              pqrst      Smith
                              pqrst      Jones
                              pqrst      Taylor
                              uvwxy      Brown
                              fghij      Williams
                              fghij      Wilson
                              fghij      Johnson
                              pqrst      Davies
                              pqrst      Robinson
                              pqrst      Wright
                              abcde      Thompson
                              abcde      Evans
                              abcde      Walker
                              abcde      White
                              fghij      Roberts
                              klmno      Green
                              klmno      Hall
                              klmno      Wood
                              fghij      Jackson
                              fghij      Clarke
                              

                              After the Column Editor operation, at column 9, we get :

                              pqrst   01   Smith
                              pqrst   02   Jones
                              pqrst   03   Taylor
                              uvwxy   04   Brown
                              fghij   05   Williams
                              fghij   06   Wilson
                              fghij   07   Johnson
                              pqrst   08   Davies
                              pqrst   09   Robinson
                              pqrst   10   Wright
                              abcde   11   Thompson
                              abcde   12   Evans
                              abcde   13   Walker
                              abcde   14   White
                              fghij   15   Roberts
                              klmno   16   Green
                              klmno   17   Hall
                              klmno   18   Wood
                              fghij   19   Jackson
                              fghij   20   Clarke
                              

                              And, after an ascending sort :

                              abcde   11   Thompson
                              abcde   12   Evans
                              abcde   13   Walker
                              abcde   14   White
                              fghij   05   Williams
                              fghij   06   Wilson
                              fghij   07   Johnson
                              fghij   15   Roberts
                              fghij   19   Jackson
                              fghij   20   Clarke
                              klmno   16   Green
                              klmno   17   Hall
                              klmno   18   Wood
                              pqrst   01   Smith
                              pqrst   02   Jones
                              pqrst   03   Taylor
                              pqrst   08   Davies
                              pqrst   09   Robinson
                              pqrst   10   Wright
                              uvwxy   04   Brown
                              

                              Then, the regex S/R, where we just change number 15 by number 5, suppresses all the duplicate lines, but one :

                              SEARCH (?-s)(.{5}).*\R\K(?:\1.*\R)+

                              REPLACE EMPTY

                              And give us :

                              abcde   11   Thompson
                              fghij   05   Williams
                              klmno   16   Green
                              pqrst   01   Smith
                              uvwxy   04   Brown
                              

                              Now, the following regex S/R, swap the line numbers area and the key area ( the first 5th characters )

                              SEARCH ^(?-s)(.+?)\x20+(\d+)

                              REPLACE \2\x20\x20\x20\1

                              11   abcde   Thompson
                              05   fghij   Williams
                              16   klmno   Green
                              01   pqrst   Smith
                              04   uvwxy   Brown
                              

                              And, after a last ascending sort operation, we get :

                              01   pqrst   Smith
                              04   uvwxy   Brown
                              05   fghij   Williams
                              11   abcde   Thompson
                              16   klmno   Green
                              

                              Finally, a last regex S/R, below, delete all the line numbers, at beginning of lines :

                              SEARCH ^\d+\x20+

                              REPLACE EMPTY

                              pqrst   Smith
                              uvwxy   Brown
                              fghij   Williams
                              abcde   Thompson
                              klmno   Green
                              
                              1 Reply Last reply Reply Quote 1
                              • mangoguyM
                                mangoguy
                                last edited by

                                Thank you again.

                                My bad for thinking a nonsorted file would not be a problem.

                                Despite multiple attempts, every time I try to use the column editor, despite putting the caret “^” at column 19 on row 1, the incremental numerical column appears at position 1 and not position 19.

                                Any thoughts as to what I am doing wrong?

                                Thank you,
                                Doug

                                Scott SumnerS 1 Reply Last reply Reply Quote 0
                                • Scott SumnerS
                                  Scott Sumner @mangoguy
                                  last edited by

                                  @mangoguy

                                  What do you mean by ^ for the caret? I know that sometimes ^ is referred to as the caret character, but it is not in Notepad++ so I’m confused.

                                  Anyway, Here’s what I do and it works to insert at col 19:

                                  • move caret to line 1 col 19
                                  • press Alt+c to get Column Editor window
                                  • tick Number to Insert
                                  • specify Initial number of 0
                                  • specify Increase by of 1
                                  • specify an empty field for Repeat
                                  • tick Leading zeros
                                  • specify Dec for Format
                                  • click OK

                                  Notepad++ places incrementing numbers in col 19 (and beyond) throughout the length of my document.

                                  Are you doing something very different from this?

                                  1 Reply Last reply Reply Quote 0
                                  • mangoguyM
                                    mangoguy
                                    last edited by

                                    This post is deleted!
                                    1 Reply Last reply Reply Quote 0
                                    • mangoguyM
                                      mangoguy
                                      last edited by

                                      Thank you for the clarification. It worked perfectly with the file exactly as instructed. Thank you!

                                      With another pre-sorted file which has no duplicates or blank lines, when I perform the main regex to find and remove duplicate lines
                                      (?-s)(.{15}).\R\K(?:\1.\R)+

                                      the replace box returns: “Replace All:1 occurrence was replaced” no matter how many times I repeat the replace. If there are no duplicates I would expect a report of 0 occurrences found.

                                      The file is found at
                                      https://mangoguy.sharefile.com/d-s7b2d2a8b3fb459cb

                                      Thank you,
                                      Doug

                                      1 Reply Last reply Reply Quote 0
                                      • Scott SumnerS
                                        Scott Sumner
                                        last edited by

                                        @mangoguy said:

                                        Replace All:1 occurrence was replaced

                                        Formatting note: Your regular expression was stated as (?-s)(.{15}).\R\K(?:\1.\R)+ but I think you really meant (?-s)(.{15}).*\R\K(?:\1.*\R)+ as per one of @guy038 's regexes above. In the future, wrap any exact text you want to post here in ` (backticks) to hopefully avoid any confusion. For example, if you type in `hello` it should appear here as hello without any special characters having trouble. You can also start a new line with four spaces and then your text to provide some data that won’t be specially interpreted.

                                        I see the same behavior as you when trying this regex replacement on your newest data file. Note that the file is NOT modified by this replacement (disk icon on its tab remains blue after the “replacement” occurs…starting point was a freshly loaded DATA2.txt file). I’m at a loss to explain this (why it is saying “1 replacement”). This thread has brought out some really odd things!

                                        Note that it IS possible to see non-zero replacements listed and have a file NOT be modified (try a Find-what of ^ and a Replace-with of $0, also Reg exp search mode), but this is very different from your replacement action.

                                        1 Reply Last reply Reply Quote 0
                                        • guy038G
                                          guy038
                                          last edited by guy038

                                          Hi, @mangoguy, @scott-sumner and All,

                                          To begin with, Doug, I was a bit surprised that, both, the numbers, at column 19 and the first 15th characters look equally sorted, in your Data2.txt file ! So I hope that you understood that the first sort must be performed, after the use of the Column Editor. Indeed, these numbers are just added in order to get the original order back, after the suppression of all the duplicate lines ! Just a remark :-))

                                          Now, mangoguy and others, keep in mind that, when a rather complicated regex is applied, against an important file, a complete failure may occur, with only 1 match which represents, simply, the selection of all the file contents :-((

                                          So, I began to investigate this problem, more deeply ! First of all, I verified that the first 15th characters, of your Data2.txt file, had absolutely no duplicate And, like Scott and you, I noticed that the regex (?-s)(.{15}).*\R\K(?:\1.*\R)+, wrongly selects the whole file, after a while, instead of finding 0 result


                                          At this point, I simply thought about reducing the file to reach the upper value, beyond we get into trouble. It happened, that, with my old Win XP laptop, the limit is 67,000 lines about. For this value, you get the correct result : no match. But, for instance, with 67,100 lines, we get the non-correct one match !

                                          Note that using the similar regex (?-s)(.{15}).*\R\K(?:\1.*\R), without the + sign, at its end, this limit increases to 68,830 lines about !


                                          So I was wondering : Could it be that the lack of matches, with the necessity to scan great amount of data, causes that false positive ? So, strangely, I decided to add false positives every 65,000 lines about, as below :

                                          ---------------
                                          ---------------
                                          

                                          So, I added these two lines of 15 dashes, at lines 65,000, 130,000, 195,000, 260,000, 325,000, 390,000 and 455,000. In addition, I duplicated the first line as well as the last line of the file.

                                          If my intuition was correct, the regex would match, of course, all the second lines of dashes ( false positives ) but also, the first duplicate, in line 2 and the second duplicate, at end of file. This would prove that the search process can go on, normally, throughout an important file ! I ran a Find All in Current Document process and… Bingo ! I obtained the Find Result panel, below, with the expected results :

                                          Search "(?-s)(.{15}).*\R\K(?:\1.*\R)+" (9 hits in 1 file)
                                            new 1 (9 hits)
                                          	Line 2: 01,02,2013,1000   000001   ,22.107,22.513,20.976,21.151,0
                                          	Line 65003: ---------------
                                          	Line 130002: ---------------
                                          	Line 195002: ---------------
                                          	Line 260002: ---------------
                                          	Line 325002: ---------------
                                          	Line 390002: ---------------
                                          	Line 455002: ---------------
                                          	Line 458420: 12,31,2015,2559   458404   ,3.270,3.270,3.538,3.527,0
                                          

                                          Therefore, it seems that a too important gap, between two successive matches, causes the complete failure of the regex search process !? I just hope that, for most of users, this gap of 65000 lines about( perhaps, we’d better speak about bytes ! ), noted with my outdated laptop, can really be greater :-))


                                          Instead of adding some false positives, in huge files, we could, also, search for a string, which would occur every x lines ! For instance, starting with the Data2.txt file, I build a file, made of five times Data2.txt : I just changed the first character of each line, taking, successively, 3 and 4, then 5 and 6,… instead of 0 and 1, in order to keep a list of lines, without any duplicate :-)

                                          This file contained 126,274,854 bytes and 2,292,022 lines. So, I decided that, in addition to the detection of duplicates, with the regex (?-s)(.{15}).*\R\K(?:\1.*\R)+, I would search for lines 50,000, 100,000, and so on…, with the regex (5|0)0000\x20 To that purpose, I just used the list of numbers, at column 19, copied five times !

                                          So the final regex is , simply, the two alternatives : (?-s)(.{15}).*\R\K(?:\1.*\R)+|(5|0)0000\x20. Again, I clicked on the Find All in Current Document button and, …after 6m 49s( Waoooou ! ) , the Find Result displayed, at last :

                                          Search "(?-s)(.{15}).*\R\K(?:\1.*\R)+|(5|0)0000\x20" (47 hits in 1 file)
                                            new 1 (47 hits)
                                          	Line 2: 01,02,2013,1000   000001   ,22.107,22.513,20.976,21.151,0
                                          	Line 50001: 02,11,2014,2536   050000   ,0.357,0.380,0.270,0.310,0
                                          	Line 100001: 03,24,2014,1115   100000   ,5.494,5.191,5.494,5.299,0
                                          	Line 150001: 05,05,2017,1346   150000   ,0.301,0.301,0.270,0.289,0
                                          	Line 200001: 06,13,2013,1107   200000   ,0.519,0.588,0.516,0.588,0
                                          	Line 250001: 07,23,2013,1437   250000   ,0.070,0.064,0.073,0.071,0
                                          	Line 300001: 09,04,2013,1158   300000   ,2.314,2.368,2.314,2.362,0
                                          	Line 350001: 10,06,2017,1031   350000   ,0.201,0.138,0.201,0.151,0
                                          	Line 400001: 11,08,2012,1254   400000   ,1.263,1.253,1.284,1.284,0
                                          	Line 450001: 12,21,2012,1043   450000   ,3.838,3.815,3.858,3.823,0
                                          	Line 508405: 22,11,2014,2536   050000   ,0.357,0.380,0.270,0.310,0
                                          	Line 558405: 23,24,2014,1115   100000   ,5.494,5.191,5.494,5.299,0
                                          	Line 608405: 25,05,2017,1346   150000   ,0.301,0.301,0.270,0.289,0
                                          	Line 658405: 26,13,2013,1107   200000   ,0.519,0.588,0.516,0.588,0
                                          	Line 708405: 27,23,2013,1437   250000   ,0.070,0.064,0.073,0.071,0
                                          	Line 758405: 29,04,2013,1158   300000   ,2.314,2.368,2.314,2.362,0
                                          	Line 808405: 30,06,2017,1031   350000   ,0.201,0.138,0.201,0.151,0
                                          	Line 858405: 31,08,2012,1254   400000   ,1.263,1.253,1.284,1.284,0
                                          	Line 908405: 32,21,2012,1043   450000   ,3.838,3.815,3.858,3.823,0
                                          	Line 966809: 42,11,2014,2536   050000   ,0.357,0.380,0.270,0.310,0
                                          	Line 1016809: 43,24,2014,1115   100000   ,5.494,5.191,5.494,5.299,0
                                          	Line 1066809: 45,05,2017,1346   150000   ,0.301,0.301,0.270,0.289,0
                                          	Line 1116809: 46,13,2013,1107   200000   ,0.519,0.588,0.516,0.588,0
                                          	Line 1166809: 47,23,2013,1437   250000   ,0.070,0.064,0.073,0.071,0
                                          	Line 1216809: 49,04,2013,1158   300000   ,2.314,2.368,2.314,2.362,0
                                          	Line 1266809: 50,06,2017,1031   350000   ,0.201,0.138,0.201,0.151,0
                                          	Line 1316809: 51,08,2012,1254   400000   ,1.263,1.253,1.284,1.284,0
                                          	Line 1366809: 52,21,2012,1043   450000   ,3.838,3.815,3.858,3.823,0
                                          	Line 1425213: 62,11,2014,2536   050000   ,0.357,0.380,0.270,0.310,0
                                          	Line 1475213: 63,24,2014,1115   100000   ,5.494,5.191,5.494,5.299,0
                                          	Line 1525213: 65,05,2017,1346   150000   ,0.301,0.301,0.270,0.289,0
                                          	Line 1575213: 66,13,2013,1107   200000   ,0.519,0.588,0.516,0.588,0
                                          	Line 1625213: 67,23,2013,1437   250000   ,0.070,0.064,0.073,0.071,0
                                          	Line 1675213: 69,04,2013,1158   300000   ,2.314,2.368,2.314,2.362,0
                                          	Line 1725213: 70,06,2017,1031   350000   ,0.201,0.138,0.201,0.151,0
                                          	Line 1775213: 71,08,2012,1254   400000   ,1.263,1.253,1.284,1.284,0
                                          	Line 1825213: 72,21,2012,1043   450000   ,3.838,3.815,3.858,3.823,0
                                          	Line 1883617: 82,11,2014,2536   050000   ,0.357,0.380,0.270,0.310,0
                                          	Line 1933617: 83,24,2014,1115   100000   ,5.494,5.191,5.494,5.299,0
                                          	Line 1983617: 85,05,2017,1346   150000   ,0.301,0.301,0.270,0.289,0
                                          	Line 2033617: 86,13,2013,1107   200000   ,0.519,0.588,0.516,0.588,0
                                          	Line 2083617: 87,23,2013,1437   250000   ,0.070,0.064,0.073,0.071,0
                                          	Line 2133617: 89,04,2013,1158   300000   ,2.314,2.368,2.314,2.362,0
                                          	Line 2183617: 90,06,2017,1031   350000   ,0.201,0.138,0.201,0.151,0
                                          	Line 2233617: 91,08,2012,1254   400000   ,1.263,1.253,1.284,1.284,0
                                          	Line 2283617: 92,21,2012,1043   450000   ,3.838,3.815,3.858,3.823,0
                                          	Line 2292022: 92,31,2015,2559   458404   ,3.270,3.270,3.538,3.527,0
                                          

                                          As you can see, the duplicate line 2 and the second duplicate, at line 2,292,022, were correctly found and reported !


                                          Conclusion :

                                          Apparently, when a too important amount of text separates two consecutive occurrences of the regex search, it breaks the normal process, getting, wrongly, a single selection of all file contents !? So, Mangoguy, as no duplicate exists in your data2.txt file, it’s obvious that we’re going into trouble as soon as your file exceeds a certain size limit !

                                          In other words, if, in huge files, you get a lot of occurrences, throughout the file contents, this should help the search process to correctly finish the job :-))

                                          Best Regards,

                                          guy038

                                          1 Reply Last reply Reply Quote 1
                                          • Scott SumnerS
                                            Scott Sumner
                                            last edited by

                                            So @guy038’s results and conclusions are interesting. I decided to see what would happen if a Pythonscript-based search was conducted. To that end I came up with:

                                            matches = []
                                            def match_found(m): matches.append(m.span(0))
                                            editor.research(r'(?-s)(.{15}).*\R\K(?:\1.*\R)+', match_found)
                                            for (start, _) in matches: print editor.lineFromPosition(start) + 1
                                            print 'done'
                                            

                                            With that script and the DATA2.txt file, I found that with 67025 lines in the file I would see “done” printed in the PS console window, but with one more line, 67026, I would get this:

                                            Traceback:
                                                editor.research(r'(?-s)(.{15}).*\R\K(?:\1.*\R)+', match_found)
                                            <type 'exceptions.RuntimeError'>:  The complexity of matching the regular expression exceeded predefined bounds.  Try refactoring the regular expression to make each choice made by the state machine unambiguous.  This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.
                                            

                                            This seems consistent with @guy038’s findings that somewhere between 67000 and 67100 lines there is a “problem”.

                                            So I think the meaning of all this is that Notepad++ is not a great tool for the OP’s task. :-(

                                            No one wants to be trying to solve one problem, only to encounter problems with the method they are using to solve that problem. Thus, I’d advise, if this is a recurring need, to have a serious look at the short bit of standard Python (or rewrite in your language of choice) that I provided much earlier in this thread. :-D

                                            1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors