Deleting lines that repeat the first 15 characters
-
Hello, @mangoguy, @scott-sumner and All,
I’m extremely confused, Indeed ! I did an important and beginner mistake, in my previous regex, that I was testing, intensively :-(( My God, of course ! The RIGHT regex is
(?-s)^(.{15}).*\R\K(?:\1.*\R)+
and NOT the regex(?-s)(.{15}).*\R\K(?:\1.*\R)+
:-))Do you see the difference ? Well, it’s just the anchor
^
, after the modifier(?-s)
!Indeed, let’s try again the wrong regex :
Assuming the test list, below :
91,02,2013,1000 000001 ,22.107,22.513,20.976,21.151,0 13,1000 000002 ,20.976,21.724,20.620,21.336,0 13,1000 000003 ,21.344,22.116,21.336,21.918,0 13,1000 000004 ,21.918,21.918,20.797,20.797,0
So, first, the caret is right before the 9 digit, of the first line and the fifteen characters
91,02,2013,1000
cannot be found elsewhere. Then, as no anchor^
( beginning of line ) exists, the regex engine goes ahead one position between the digits 9 and 1 of the first line. Again, as the fifteen characters1,02,2013,1000b
do not exist further on, the regex engine goes ahead one position, examining, now the string,02,2013,1000bb
…… till the fifteen characters
13,1000bbb00000
, which can be found, this time, at beginning of lines2
,3
and4
! Just imagine the work to accomplish for458,404
lines of the Data2.txt file :-((( Note : the lowercase letter
b
, above, stands for a space character )To easily see the problem, just get rid of the
\K
syntax, forming the regex(?-s)(.{15}).*\R(?:\1.*\R)+
. If you click on the Find Next button, it selects, after test on positions 1, 2,…and 8, from the two last digits of year 2013 till the end of text. But, if you’re using the regex(?-s)^(.{15}).*\R(?:\1.*\R)+
, with the anchor^
, it correctly gets the identical lines2
,3
and4
, regarding theirs first15
characters !
So, Doug, to sump up, using the right regex
(?-s)^(.{15}).*\R\K(?:\1.*\R)+
, against your Data2.txt file, does not find any occurrence (~5s
), that is the expected result, as we know, by construction, that the458,404
lines of this file, are all different :-)Best Regards,
guy038
-
Yea, wow, I totally didn’t see the missing
^
as well. Of course, as our local regex guru I don’t normally question @guy038’s regexes, but there is no excuse for a second pair of eyes (mine) not noticing/questioning this. Looking back over my posts in this thread, I really added nothing of value and totally wish I hadn’t participated at all. :-( -
@Scott-Sumner , about that python code:
prev = '' with open('data.txt') as f: for (n, line) in enumerate(f): if line[:15] == prev: print n+1 prev = line[:15]
How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces) are same? I have changed 15 to 200, I am afraid the code did not work.
Thank you
-
@Saya-Jujur said in Deleting lines that repeat the first 15 characters:
How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces) are same? I have changed 15 to 200, I am afraid the code did not work.
It would have been better to have started a new thread since this one was last posted to 4 years ago. By all means reference it but a new one I think is warranted.
You don’t give much detail on your need, are the lines together as that is what this thread was all about.
So start a new post, outline your need, give examples. Read the post at the top (of the Help Wanted section) titled “Please read before posting” as it will help you provide examples in a format that we can trust haven’t been altered by the posting window and we can copy to help us in tests before we provide a solution to you.
Terry
PS your request to Scott Sumner directly will likely go unanswered (by him), he hasn’t been active on this forum for a long time.
-
Untested, because I am on my phone, but maybe try
prev = '' with open('data.txt') as f: for (n, line) in enumerate(f): if line[:200] == prev[:200]: print n+1 prev = line[:200]
(You said you changed to 200 already, but maybe you missed an instance, or maybe comparing just the left of prev is enough)
If that doesn’t work, then follow @Terry-R’s advice