Deleting lines that repeat the first 15 characters
-
@mangoguy said:
FWIW, I can confirm the findings of the entire file being selected (I used [red]marking) with either of @guy038’s regexes. However, chopping up the file into 100,000 line subfiles allowed both regexes to work fine. Note that the only duplicates were found in the LAST subfile chunk–which had only ~61,000 lines, not 100,000.
I don’t have an explanation–“big” data can cause “big” problems–the definition of “big” being one thing where YMMV…
In such a case I myself would turn to a scrap bit of Python to find these duplicate records:
prev = '' with open('data.txt') as f: for (n, line) in enumerate(f): if line[:15] == prev: print n+1 prev = line[:15]
That doesn’t address the issue, but gets the job done.
:(
-
Hi, @mangoguy, @scott-sumner and All,
I understood what happened :-)) Just have a look, for instance, to lines
27
and28
of your DATA.txt file, as described below :08,28,1212,3959,0.458,0.458,0.504,0.492,0 08,28,1212,1000,0.492,0.364,0.495,0.365,0
Obviously, these two lines are NOT sorted as string “3959” is greater than string “1000” !
So, I simply performed, FIRST, the classical sort : Edit > Lines operations > Sort Lines Lexicographically Ascending (
12s
)And then…, everything went fine :-D I tried my two regexes, which, both, worked, as expected !
BTW, My second regex, with the
\K
syntax is slightly quicker than the first one ! On my laptop, I got49s
instead of53s
:-)
Some statistics about your DATA.txt file and about the regex S/R :
0 line with 4 DUPLICATES or MORE => 0 REPLACEMENT and 0 line DELETED 489 lines with 3 DUPLICATES => 489 REPLACEMENTS and 1,467 lines DELETED ( 3 x 489 ) 38 lines with 2 DUPLICATES => 38 REPLACEMENTS and 76 lines DELETED ( 2 x 38 ) 28,836 lines with 1 DUPLICATE => 28,836 REPLACEMENTS and 28,836 lines DELETED TOTAL : 29,363 REPLACEMENTS and 30,379 lines DELETED ORIGINAL file NUMBER of lines 460,725 After SUPPRESSION of DUPLICATES : 430,346 Difference : 30,379
Cheers,
guy038
-
I guess I’m confused. The OP asked for “a line is deleted if the first 15 characters of a line match the first 15 characters of the preceding line”–doesn’t that preclude doing a sort, because it removes the impact of the “preceding line” part? Well, no matter…if it helps the OP out that is all that matters. However, I’m still confused as to why the application of the regexes on the unsorted file caused the entire file to be selected for the OP and to be redmarked for me–can you help with the explanation of that?
-
Thank you for the reply.
The file is not intended to be sorted. Sorting the file would corrupt the necessary order of the data.
Nevertheless, lines that duplicate the first 15 characters of the preceding line must be eliminated.
Thank you again,
Douglas
-
Hi, @mangoguy, @scott-sumner and All,
In my initial thread, below, whose Doug spoke of :
https://notepad-plus-plus.org/community/topic/13147/eliminating-duplicate-identical-lines
I said :
The suppression of all the duplicate lines, in a pre-sorted file, can be easily obtained with a Search/Replacement, in Regular expression mode !
Open your file, containing the sorted list of items
Open the Replace dialog ( CTRL + H )
…In that text, it’s the word ‘sorted’ which is important ! Indeed, imagine this initial text, NOT sorted :
pqrst pqrst pqrst uvwxy fghij fghij fghij pqrst pqrst pqrst abcde abcde abcde abcde fghij klmno klmno klmno fghij fghij
and my initial regex :
SEARCH
(?-s)(^.+\R)\1+
REPLACE
\1
Even with a step by step replacement, with the Replace button, it would give :
pqrst uvwxy fghij pqrst abcde fghij klmno fghij
=> You can see that it still remains
3
lines fghij, inside, split up on different lines. In a sense, it remains a single line with its two duplicates !Now, let’s sort the initial text, first :
abcde abcde abcde abcde fghij fghij fghij fghij fghij fghij klmno klmno klmno pqrst pqrst pqrst pqrst pqrst pqrst uvwxy
After performing the same regex S/R, we, now, get the text :
abcde fghij klmno pqrst uvwxy
And, as expected, it does not contain any duplicate line. So, with an initial sort, this regex, and the derivative regexes, work just fine !
However, the main drawback is that the original order, of the file, is lost, because of the sort process :-((
So, against your file, I tried other regexes, which involve a look-ahead and which do not break the file’s order :
-
(?-s)(?:^(.{15}).*\R)(?=.*\1)
andEMPTY
replacement => Count process give us1022
occurrences, which is false -
(?-s)(?:^(.{15}).*\R)(?=(?:.+\R)*\1)
=> Catastrophic break-down, with only one match ( the entire file contents ! )
So , what’s about this generic one :
(?-s)(?:^(.{15}).*\R)(?=(?:.+\R){0,n}\1)
Well, but, for instance,
4
lines, in your file, begin with the string 01,12,1215,1012 ( Lines209988
,208996
,210928
et210936
). And the gap between lines208996
and210928
, for instance, is1932
!So, the number
n
, in that regex, should be, at least2000
, hence the regex :(?-s)(?:^(.{15}).*\R)(?=(?:.+\R){0,2000}\1)
=> Again, a catastrophic break-down occurred :-((Obviously, it is not worth going on, in that direction ! So, Scott, I presume that when a large amount of lines and/or large capturing group areas are involved, we’re going to get, likely, unpredictable results ! We have to change our mind at all about it :-D
Finally, I found out the right way to get the job done :-)))
Note that, in the Doug’s file, the key string is the first 15th characters of a line. So, here is the procedure :
- First, we get rid of possible blank lines, with the regex S/R, below :
SEARCH
^\h*\R
REPLACE
EMPTY
=> 2 lines should be deleted
- Then, we add a blank area, of six characters long, right after the
15th
character, with the regex S/R, below (~23s
) :
SEARCH
(?-s)^.{15}\K
REPLACE
\x20\x20\x20\x20\x20\x20
-
Place the caret at column
19
of the line1
(IMPORTANT ) -
Now, choose the menu command Edit > Column Editor… (
Alt + C
) -
Select the Number to Insert option
-
Type in
1
, as Initial number -
Type in
1
, in the Increase by : option -
Verify that the chosen format is Dec
-
Check the Leading zeros option ( IMPORTANT )
-
Finally, click on the OK button (
~36s
)
=> After a while, you should get a six-digits column, at position
19
, from000001
to460726
-
Delete the last line
460726
and the EOL characters of the line460725
-
Sort the file contents, choosing the menu command Edit > Line Operations > Sort Lines Lexicographically Ascending (
~17s
) -
Now, perform the main regex, below, clicking on the Replace All button, exclusively (
~1m 11s
)
SEARCH
(?-s)(.{15}).*\R\K(?:\1.*\R)+
REPLACE
EMPTY
=> 29363 replacements occurred and the file, from now on, contains
430346
lines, only !- Then, move the middle column number, at beginning of each line, with the regex S/R, below (
~31s
) :
SEARCH
(?-s)^(.+?)\x20+(\d+)
REPLACE
\2\x20\x20\x20\1
-
Now, execute a last sort operation Edit > Line Operations > Sort Lines Lexicographically Ascending (
~15s
) -
Finally get rid of the line numbers, at beginning of lines, along with the space characters, with the regex S/R, below (
~15s
) :
SEARCH
^\d+\x20+
REPLACE
EMPTY
Et voilà !
To sump up, this procedure, while going downwards, throughout the file contents, keeps, only, each first occurrence of any duplicate line, with identical first 15th characters, as well as any single line, of course !
Cheers,
guy038
P.S. :
To easily understand my method, above, copy/paste this short initial text, below, in a new tab :
pqrstSmith pqrstJones pqrstTaylor uvwxyBrown fghijWilliams fghijWilson fghijJohnson pqrstDavies pqrstRobinson pqrstWright abcdeThompson abcdeEvans abcdeWalker abcdeWhite fghijRoberts klmnoGreen klmnoHall klmnoWood fghijJackson fghijClarke
Then, I decided that two lines are identical if their first 5 characters are equal !
So, we add, first, a six blank characters column, after the
5th
character of each lineSEARCH
(?-s)^.{5}\K
REPLACE
\x20\x20\x20\x20\x20\x20
pqrst Smith pqrst Jones pqrst Taylor uvwxy Brown fghij Williams fghij Wilson fghij Johnson pqrst Davies pqrst Robinson pqrst Wright abcde Thompson abcde Evans abcde Walker abcde White fghij Roberts klmno Green klmno Hall klmno Wood fghij Jackson fghij Clarke
After the Column Editor operation, at column
9
, we get :pqrst 01 Smith pqrst 02 Jones pqrst 03 Taylor uvwxy 04 Brown fghij 05 Williams fghij 06 Wilson fghij 07 Johnson pqrst 08 Davies pqrst 09 Robinson pqrst 10 Wright abcde 11 Thompson abcde 12 Evans abcde 13 Walker abcde 14 White fghij 15 Roberts klmno 16 Green klmno 17 Hall klmno 18 Wood fghij 19 Jackson fghij 20 Clarke
And, after an ascending sort :
abcde 11 Thompson abcde 12 Evans abcde 13 Walker abcde 14 White fghij 05 Williams fghij 06 Wilson fghij 07 Johnson fghij 15 Roberts fghij 19 Jackson fghij 20 Clarke klmno 16 Green klmno 17 Hall klmno 18 Wood pqrst 01 Smith pqrst 02 Jones pqrst 03 Taylor pqrst 08 Davies pqrst 09 Robinson pqrst 10 Wright uvwxy 04 Brown
Then, the regex S/R, where we just change number
15
by number5
, suppresses all the duplicate lines, but one :SEARCH
(?-s)(.{5}).*\R\K(?:\1.*\R)+
REPLACE
EMPTY
And give us :
abcde 11 Thompson fghij 05 Williams klmno 16 Green pqrst 01 Smith uvwxy 04 Brown
Now, the following regex S/R, swap the line numbers area and the key area ( the first 5th characters )
SEARCH
^(?-s)(.+?)\x20+(\d+)
REPLACE
\2\x20\x20\x20\1
11 abcde Thompson 05 fghij Williams 16 klmno Green 01 pqrst Smith 04 uvwxy Brown
And, after a last ascending sort operation, we get :
01 pqrst Smith 04 uvwxy Brown 05 fghij Williams 11 abcde Thompson 16 klmno Green
Finally, a last regex S/R, below, delete all the line numbers, at beginning of lines :
SEARCH
^\d+\x20+
REPLACE
EMPTY
pqrst Smith uvwxy Brown fghij Williams abcde Thompson klmno Green
-
-
Thank you again.
My bad for thinking a nonsorted file would not be a problem.
Despite multiple attempts, every time I try to use the column editor, despite putting the caret “^” at column 19 on row 1, the incremental numerical column appears at position 1 and not position 19.
Any thoughts as to what I am doing wrong?
Thank you,
Doug -
What do you mean by
^
for the caret? I know that sometimes^
is referred to as the caret character, but it is not in Notepad++ so I’m confused.Anyway, Here’s what I do and it works to insert at col 19:
- move caret to line 1 col 19
- press Alt+c to get Column Editor window
- tick Number to Insert
- specify Initial number of 0
- specify Increase by of 1
- specify an empty field for Repeat
- tick Leading zeros
- specify Dec for Format
- click OK
Notepad++ places incrementing numbers in col 19 (and beyond) throughout the length of my document.
Are you doing something very different from this?
-
This post is deleted! -
Thank you for the clarification. It worked perfectly with the file exactly as instructed. Thank you!
With another pre-sorted file which has no duplicates or blank lines, when I perform the main regex to find and remove duplicate lines
(?-s)(.{15}).\R\K(?:\1.\R)+the replace box returns: “Replace All:1 occurrence was replaced” no matter how many times I repeat the replace. If there are no duplicates I would expect a report of 0 occurrences found.
The file is found at
https://mangoguy.sharefile.com/d-s7b2d2a8b3fb459cbThank you,
Doug -
@mangoguy said:
Replace All:1 occurrence was replaced
Formatting note: Your regular expression was stated as
(?-s)(.{15}).\R\K(?:\1.\R)+
but I think you really meant(?-s)(.{15}).*\R\K(?:\1.*\R)+
as per one of @guy038 's regexes above. In the future, wrap any exact text you want to post here in ` (backticks) to hopefully avoid any confusion. For example, if you type in `hello` it should appear here ashello
without any special characters having trouble. You can also start a new line with four spaces and then your text to provide some data that won’t be specially interpreted.I see the same behavior as you when trying this regex replacement on your newest data file. Note that the file is NOT modified by this replacement (disk icon on its tab remains blue after the “replacement” occurs…starting point was a freshly loaded DATA2.txt file). I’m at a loss to explain this (why it is saying “1 replacement”). This thread has brought out some really odd things!
Note that it IS possible to see non-zero replacements listed and have a file NOT be modified (try a Find-what of
^
and a Replace-with of$0
, also Reg exp search mode), but this is very different from your replacement action. -
Hi, @mangoguy, @scott-sumner and All,
To begin with, Doug, I was a bit surprised that, both, the numbers, at column
19
and the first 15th characters look equally sorted, in your Data2.txt file ! So I hope that you understood that the first sort must be performed, after the use of the Column Editor. Indeed, these numbers are just added in order to get the original order back, after the suppression of all the duplicate lines ! Just a remark :-))Now, mangoguy and others, keep in mind that, when a rather complicated regex is applied, against an important file, a complete failure may occur, with only
1 match
which represents, simply, the selection of all the file contents :-((So, I began to investigate this problem, more deeply ! First of all, I verified that the first 15th characters, of your Data2.txt file, had absolutely no duplicate And, like Scott and you, I noticed that the regex
(?-s)(.{15}).*\R\K(?:\1.*\R)+
, wrongly selects the whole file, after a while, instead of finding 0 result
At this point, I simply thought about reducing the file to reach the upper value, beyond we get into trouble. It happened, that, with my old Win XP laptop, the limit is
67,000
lines about. For this value, you get the correct result : no match. But, for instance, with67,100
lines, we get the non-correct one match !Note that using the similar regex
(?-s)(.{15}).*\R\K(?:\1.*\R)
, without the+
sign, at its end, this limit increases to68,830
lines about !
So I was wondering : Could it be that the lack of matches, with the necessity to scan great amount of data, causes that false positive ? So, strangely, I decided to add false positives every
65,000
lines about, as below :--------------- ---------------
So, I added these two lines of
15
dashes, at lines65,000
,130,000
,195,000
,260,000
,325,000
,390,000
and455,000
. In addition, I duplicated the first line as well as the last line of the file.If my intuition was correct, the regex would match, of course, all the second lines of dashes ( false positives ) but also, the first duplicate, in line
2
and the second duplicate, at end of file. This would prove that the search process can go on, normally, throughout an important file ! I ran a Find All in Current Document process and… Bingo ! I obtained the Find Result panel, below, with the expected results :Search "(?-s)(.{15}).*\R\K(?:\1.*\R)+" (9 hits in 1 file) new 1 (9 hits) Line 2: 01,02,2013,1000 000001 ,22.107,22.513,20.976,21.151,0 Line 65003: --------------- Line 130002: --------------- Line 195002: --------------- Line 260002: --------------- Line 325002: --------------- Line 390002: --------------- Line 455002: --------------- Line 458420: 12,31,2015,2559 458404 ,3.270,3.270,3.538,3.527,0
Therefore, it seems that a too important gap, between two successive matches, causes the complete failure of the regex search process !? I just hope that, for most of users, this gap of 65000 lines about( perhaps, we’d better speak about bytes ! ), noted with my outdated laptop, can really be greater :-))
Instead of adding some false positives, in huge files, we could, also, search for a string, which would occur
every x
lines ! For instance, starting with the Data2.txt file, I build a file, made offive
times Data2.txt : I just changed the first character of each line, taking, successively,3
and4
, then5
and6
,… instead of0
and1
, in order to keep a list of lines, without any duplicate :-)This file contained
126,274,854
bytes and2,292,022
lines. So, I decided that, in addition to the detection of duplicates, with the regex(?-s)(.{15}).*\R\K(?:\1.*\R)+
, I would search for lines50,000
,100,000
, and so on…, with the regex(5|0)0000\x20
To that purpose, I just used the list of numbers, at column19
, copied five times !So the final regex is , simply, the two alternatives :
(?-s)(.{15}).*\R\K(?:\1.*\R)+|(5|0)0000\x20
. Again, I clicked on the Find All in Current Document button and, …after6m 49s
( Waoooou ! ) , the Find Result displayed, at last :Search "(?-s)(.{15}).*\R\K(?:\1.*\R)+|(5|0)0000\x20" (47 hits in 1 file) new 1 (47 hits) Line 2: 01,02,2013,1000 000001 ,22.107,22.513,20.976,21.151,0 Line 50001: 02,11,2014,2536 050000 ,0.357,0.380,0.270,0.310,0 Line 100001: 03,24,2014,1115 100000 ,5.494,5.191,5.494,5.299,0 Line 150001: 05,05,2017,1346 150000 ,0.301,0.301,0.270,0.289,0 Line 200001: 06,13,2013,1107 200000 ,0.519,0.588,0.516,0.588,0 Line 250001: 07,23,2013,1437 250000 ,0.070,0.064,0.073,0.071,0 Line 300001: 09,04,2013,1158 300000 ,2.314,2.368,2.314,2.362,0 Line 350001: 10,06,2017,1031 350000 ,0.201,0.138,0.201,0.151,0 Line 400001: 11,08,2012,1254 400000 ,1.263,1.253,1.284,1.284,0 Line 450001: 12,21,2012,1043 450000 ,3.838,3.815,3.858,3.823,0 Line 508405: 22,11,2014,2536 050000 ,0.357,0.380,0.270,0.310,0 Line 558405: 23,24,2014,1115 100000 ,5.494,5.191,5.494,5.299,0 Line 608405: 25,05,2017,1346 150000 ,0.301,0.301,0.270,0.289,0 Line 658405: 26,13,2013,1107 200000 ,0.519,0.588,0.516,0.588,0 Line 708405: 27,23,2013,1437 250000 ,0.070,0.064,0.073,0.071,0 Line 758405: 29,04,2013,1158 300000 ,2.314,2.368,2.314,2.362,0 Line 808405: 30,06,2017,1031 350000 ,0.201,0.138,0.201,0.151,0 Line 858405: 31,08,2012,1254 400000 ,1.263,1.253,1.284,1.284,0 Line 908405: 32,21,2012,1043 450000 ,3.838,3.815,3.858,3.823,0 Line 966809: 42,11,2014,2536 050000 ,0.357,0.380,0.270,0.310,0 Line 1016809: 43,24,2014,1115 100000 ,5.494,5.191,5.494,5.299,0 Line 1066809: 45,05,2017,1346 150000 ,0.301,0.301,0.270,0.289,0 Line 1116809: 46,13,2013,1107 200000 ,0.519,0.588,0.516,0.588,0 Line 1166809: 47,23,2013,1437 250000 ,0.070,0.064,0.073,0.071,0 Line 1216809: 49,04,2013,1158 300000 ,2.314,2.368,2.314,2.362,0 Line 1266809: 50,06,2017,1031 350000 ,0.201,0.138,0.201,0.151,0 Line 1316809: 51,08,2012,1254 400000 ,1.263,1.253,1.284,1.284,0 Line 1366809: 52,21,2012,1043 450000 ,3.838,3.815,3.858,3.823,0 Line 1425213: 62,11,2014,2536 050000 ,0.357,0.380,0.270,0.310,0 Line 1475213: 63,24,2014,1115 100000 ,5.494,5.191,5.494,5.299,0 Line 1525213: 65,05,2017,1346 150000 ,0.301,0.301,0.270,0.289,0 Line 1575213: 66,13,2013,1107 200000 ,0.519,0.588,0.516,0.588,0 Line 1625213: 67,23,2013,1437 250000 ,0.070,0.064,0.073,0.071,0 Line 1675213: 69,04,2013,1158 300000 ,2.314,2.368,2.314,2.362,0 Line 1725213: 70,06,2017,1031 350000 ,0.201,0.138,0.201,0.151,0 Line 1775213: 71,08,2012,1254 400000 ,1.263,1.253,1.284,1.284,0 Line 1825213: 72,21,2012,1043 450000 ,3.838,3.815,3.858,3.823,0 Line 1883617: 82,11,2014,2536 050000 ,0.357,0.380,0.270,0.310,0 Line 1933617: 83,24,2014,1115 100000 ,5.494,5.191,5.494,5.299,0 Line 1983617: 85,05,2017,1346 150000 ,0.301,0.301,0.270,0.289,0 Line 2033617: 86,13,2013,1107 200000 ,0.519,0.588,0.516,0.588,0 Line 2083617: 87,23,2013,1437 250000 ,0.070,0.064,0.073,0.071,0 Line 2133617: 89,04,2013,1158 300000 ,2.314,2.368,2.314,2.362,0 Line 2183617: 90,06,2017,1031 350000 ,0.201,0.138,0.201,0.151,0 Line 2233617: 91,08,2012,1254 400000 ,1.263,1.253,1.284,1.284,0 Line 2283617: 92,21,2012,1043 450000 ,3.838,3.815,3.858,3.823,0 Line 2292022: 92,31,2015,2559 458404 ,3.270,3.270,3.538,3.527,0
As you can see, the duplicate line
2
and the second duplicate, at line2,292,022
, were correctly found and reported !
Conclusion :
Apparently, when a too important amount of text separates two consecutive occurrences of the regex search, it breaks the normal process, getting, wrongly, a single selection of all file contents !? So, Mangoguy, as no duplicate exists in your data2.txt file, it’s obvious that we’re going into trouble as soon as your file exceeds a certain size limit !
In other words, if, in huge files, you get a lot of occurrences, throughout the file contents, this should help the search process to correctly finish the job :-))
Best Regards,
guy038
-
So @guy038’s results and conclusions are interesting. I decided to see what would happen if a Pythonscript-based search was conducted. To that end I came up with:
matches = [] def match_found(m): matches.append(m.span(0)) editor.research(r'(?-s)(.{15}).*\R\K(?:\1.*\R)+', match_found) for (start, _) in matches: print editor.lineFromPosition(start) + 1 print 'done'
With that script and the DATA2.txt file, I found that with 67025 lines in the file I would see “done” printed in the PS console window, but with one more line, 67026, I would get this:
Traceback: editor.research(r'(?-s)(.{15}).*\R\K(?:\1.*\R)+', match_found) <type 'exceptions.RuntimeError'>: The complexity of matching the regular expression exceeded predefined bounds. Try refactoring the regular expression to make each choice made by the state machine unambiguous. This exception is thrown to prevent "eternal" matches that take an indefinite period time to locate.
This seems consistent with @guy038’s findings that somewhere between 67000 and 67100 lines there is a “problem”.
So I think the meaning of all this is that Notepad++ is not a great tool for the OP’s task. :-(
No one wants to be trying to solve one problem, only to encounter problems with the method they are using to solve that problem. Thus, I’d advise, if this is a recurring need, to have a serious look at the short bit of standard Python (or rewrite in your language of choice) that I provided much earlier in this thread. :-D
-
Hello, @mangoguy, @scott-sumner and All,
I’m extremely confused, Indeed ! I did an important and beginner mistake, in my previous regex, that I was testing, intensively :-(( My God, of course ! The RIGHT regex is
(?-s)^(.{15}).*\R\K(?:\1.*\R)+
and NOT the regex(?-s)(.{15}).*\R\K(?:\1.*\R)+
:-))Do you see the difference ? Well, it’s just the anchor
^
, after the modifier(?-s)
!Indeed, let’s try again the wrong regex :
Assuming the test list, below :
91,02,2013,1000 000001 ,22.107,22.513,20.976,21.151,0 13,1000 000002 ,20.976,21.724,20.620,21.336,0 13,1000 000003 ,21.344,22.116,21.336,21.918,0 13,1000 000004 ,21.918,21.918,20.797,20.797,0
So, first, the caret is right before the 9 digit, of the first line and the fifteen characters
91,02,2013,1000
cannot be found elsewhere. Then, as no anchor^
( beginning of line ) exists, the regex engine goes ahead one position between the digits 9 and 1 of the first line. Again, as the fifteen characters1,02,2013,1000b
do not exist further on, the regex engine goes ahead one position, examining, now the string,02,2013,1000bb
…… till the fifteen characters
13,1000bbb00000
, which can be found, this time, at beginning of lines2
,3
and4
! Just imagine the work to accomplish for458,404
lines of the Data2.txt file :-((( Note : the lowercase letter
b
, above, stands for a space character )To easily see the problem, just get rid of the
\K
syntax, forming the regex(?-s)(.{15}).*\R(?:\1.*\R)+
. If you click on the Find Next button, it selects, after test on positions 1, 2,…and 8, from the two last digits of year 2013 till the end of text. But, if you’re using the regex(?-s)^(.{15}).*\R(?:\1.*\R)+
, with the anchor^
, it correctly gets the identical lines2
,3
and4
, regarding theirs first15
characters !
So, Doug, to sump up, using the right regex
(?-s)^(.{15}).*\R\K(?:\1.*\R)+
, against your Data2.txt file, does not find any occurrence (~5s
), that is the expected result, as we know, by construction, that the458,404
lines of this file, are all different :-)Best Regards,
guy038
-
Yea, wow, I totally didn’t see the missing
^
as well. Of course, as our local regex guru I don’t normally question @guy038’s regexes, but there is no excuse for a second pair of eyes (mine) not noticing/questioning this. Looking back over my posts in this thread, I really added nothing of value and totally wish I hadn’t participated at all. :-( -
@Scott-Sumner , about that python code:
prev = '' with open('data.txt') as f: for (n, line) in enumerate(f): if line[:15] == prev: print n+1 prev = line[:15]
How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces) are same? I have changed 15 to 200, I am afraid the code did not work.
Thank you
-
@Saya-Jujur said in Deleting lines that repeat the first 15 characters:
How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces) are same? I have changed 15 to 200, I am afraid the code did not work.
It would have been better to have started a new thread since this one was last posted to 4 years ago. By all means reference it but a new one I think is warranted.
You don’t give much detail on your need, are the lines together as that is what this thread was all about.
So start a new post, outline your need, give examples. Read the post at the top (of the Help Wanted section) titled “Please read before posting” as it will help you provide examples in a format that we can trust haven’t been altered by the posting window and we can copy to help us in tests before we provide a solution to you.
Terry
PS your request to Scott Sumner directly will likely go unanswered (by him), he hasn’t been active on this forum for a long time.
-
Untested, because I am on my phone, but maybe try
prev = '' with open('data.txt') as f: for (n, line) in enumerate(f): if line[:200] == prev[:200]: print n+1 prev = line[:200]
(You said you changed to 200 already, but maybe you missed an instance, or maybe comparing just the left of prev is enough)
If that doesn’t work, then follow @Terry-R’s advice