Deleting lines that repeat the first 15 characters

guy038

Hello, @mangoguy, @scott-sumner and All,

I’m extremely confused, Indeed ! I did an important and beginner mistake, in my previous regex, that I was testing, intensively :-(( My God, of course ! The RIGHT regex is (?-s)^(.{15}).*\R\K(?:\1.*\R)+ and NOT the regex (?-s)(.{15}).*\R\K(?:\1.*\R)+ :-))

Do you see the difference ? Well, it’s just the anchor ^, after the modifier (?-s) !

Indeed, let’s try again the wrong regex :

Assuming the test list, below :

91,02,2013,1000   000001   ,22.107,22.513,20.976,21.151,0
13,1000   000002   ,20.976,21.724,20.620,21.336,0
13,1000   000003   ,21.344,22.116,21.336,21.918,0
13,1000   000004   ,21.918,21.918,20.797,20.797,0

So, first, the caret is right before the 9 digit, of the first line and the fifteen characters 91,02,2013,1000 cannot be found elsewhere. Then, as no anchor ^ ( beginning of line ) exists, the regex engine goes ahead one position between the digits 9 and 1 of the first line. Again, as the fifteen characters 1,02,2013,1000b do not exist further on, the regex engine goes ahead one position, examining, now the string ,02,2013,1000bb …

… till the fifteen characters 13,1000bbb00000, which can be found, this time, at beginning of lines 2, 3 and 4 ! Just imagine the work to accomplish for 458,404 lines of the Data2.txt file :-((

( Note : the lowercase letter b, above, stands for a space character )

To easily see the problem, just get rid of the \K syntax, forming the regex (?-s)(.{15}).*\R(?:\1.*\R)+. If you click on the Find Next button, it selects, after test on positions 1, 2,…and 8, from the two last digits of year 2013 till the end of text. But, if you’re using the regex (?-s)^(.{15}).*\R(?:\1.*\R)+, with the anchor ^, it correctly gets the identical lines 2, 3 and 4, regarding theirs first 15 characters !

So, Doug, to sump up, using the right regex (?-s)^(.{15}).*\R\K(?:\1.*\R)+, against your Data2.txt file, does not find any occurrence ( ~5s ), that is the expected result, as we know, by construction, that the 458,404 lines of this file, are all different :-)

Best Regards,

guy038

Scott Sumner

@guy038

Yea, wow, I totally didn’t see the missing ^ as well. Of course, as our local regex guru I don’t normally question @guy038’s regexes, but there is no excuse for a second pair of eyes (mine) not noticing/questioning this. Looking back over my posts in this thread, I really added nothing of value and totally wish I hadn’t participated at all. :-(

Saya Jujur

@Scott-Sumner , about that python code:

prev = ''
with open('data.txt') as f:
    for (n, line) in enumerate(f):
        if line[:15] == prev:
            print n+1
        prev = line[:15]

How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces) are same? I have changed 15 to 200, I am afraid the code did not work.

Thank you

Terry R

@Saya-Jujur said in Deleting lines that repeat the first 15 characters:

How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces) are same? I have changed 15 to 200, I am afraid the code did not work.

It would have been better to have started a new thread since this one was last posted to 4 years ago. By all means reference it but a new one I think is warranted.

You don’t give much detail on your need, are the lines together as that is what this thread was all about.

So start a new post, outline your need, give examples. Read the post at the top (of the Help Wanted section) titled “Please read before posting” as it will help you provide examples in a format that we can trust haven’t been altered by the posting window and we can copy to help us in tests before we provide a solution to you.

Terry

PS your request to Scott Sumner directly will likely go unanswered (by him), he hasn’t been active on this forum for a long time.

PeterJones

@Saya-Jujur ,

Untested, because I am on my phone, but maybe try

prev = ''
with open('data.txt') as f:
    for (n, line) in enumerate(f):
        if line[:200] == prev[:200]:
            print n+1
        prev = line[:200]

(You said you changed to 200 already, but maybe you missed an instance, or maybe comparing just the left of prev is enough)

If that doesn’t work, then follow @Terry-R’s advice