Community
    • Login

    Deleting lines that repeat the first 15 characters

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    25 Posts 6 Posters 13.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hello, @mangoguy, @scott-sumner and All,

      I’m extremely confused, Indeed ! I did an important and beginner mistake, in my previous regex, that I was testing, intensively :-(( My God, of course ! The RIGHT regex is (?-s)^(.{15}).*\R\K(?:\1.*\R)+ and NOT the regex (?-s)(.{15}).*\R\K(?:\1.*\R)+ :-))

      Do you see the difference ? Well, it’s just the anchor ^, after the modifier (?-s) !

      Indeed, let’s try again the wrong regex :

      Assuming the test list, below :

      91,02,2013,1000   000001   ,22.107,22.513,20.976,21.151,0
      13,1000   000002   ,20.976,21.724,20.620,21.336,0
      13,1000   000003   ,21.344,22.116,21.336,21.918,0
      13,1000   000004   ,21.918,21.918,20.797,20.797,0
      

      So, first, the caret is right before the 9 digit, of the first line and the fifteen characters 91,02,2013,1000 cannot be found elsewhere. Then, as no anchor ^ ( beginning of line ) exists, the regex engine goes ahead one position between the digits 9 and 1 of the first line. Again, as the fifteen characters 1,02,2013,1000b do not exist further on, the regex engine goes ahead one position, examining, now the string ,02,2013,1000bb …

      … till the fifteen characters 13,1000bbb00000, which can be found, this time, at beginning of lines 2, 3 and 4 ! Just imagine the work to accomplish for 458,404 lines of the Data2.txt file :-((

      ( Note : the lowercase letter b, above, stands for a space character )

      To easily see the problem, just get rid of the \K syntax, forming the regex (?-s)(.{15}).*\R(?:\1.*\R)+. If you click on the Find Next button, it selects, after test on positions 1, 2,…and 8, from the two last digits of year 2013 till the end of text. But, if you’re using the regex (?-s)^(.{15}).*\R(?:\1.*\R)+, with the anchor ^, it correctly gets the identical lines 2, 3 and 4, regarding theirs first 15 characters !


      So, Doug, to sump up, using the right regex (?-s)^(.{15}).*\R\K(?:\1.*\R)+, against your Data2.txt file, does not find any occurrence ( ~5s ), that is the expected result, as we know, by construction, that the 458,404 lines of this file, are all different :-)

      Best Regards,

      guy038

      Scott SumnerS 1 Reply Last reply Reply Quote 1
      • Scott SumnerS
        Scott Sumner @guy038
        last edited by

        @guy038

        Yea, wow, I totally didn’t see the missing ^ as well. Of course, as our local regex guru I don’t normally question @guy038’s regexes, but there is no excuse for a second pair of eyes (mine) not noticing/questioning this. Looking back over my posts in this thread, I really added nothing of value and totally wish I hadn’t participated at all. :-(

        1 Reply Last reply Reply Quote 0
        • Saya JujurS
          Saya Jujur @Scott Sumner
          last edited by Saya Jujur

          @Scott-Sumner , about that python code:

          prev = ''
          with open('data.txt') as f:
              for (n, line) in enumerate(f):
                  if line[:15] == prev:
                      print n+1
                  prev = line[:15]
          

          How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces) are same? I have changed 15 to 200, I am afraid the code did not work.

          Thank you

          PeterJonesP 1 Reply Last reply Reply Quote 0
          • Terry RT
            Terry R
            last edited by Terry R

            @Saya-Jujur said in Deleting lines that repeat the first 15 characters:

            How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces) are same? I have changed 15 to 200, I am afraid the code did not work.

            It would have been better to have started a new thread since this one was last posted to 4 years ago. By all means reference it but a new one I think is warranted.

            You don’t give much detail on your need, are the lines together as that is what this thread was all about.

            So start a new post, outline your need, give examples. Read the post at the top (of the Help Wanted section) titled “Please read before posting” as it will help you provide examples in a format that we can trust haven’t been altered by the posting window and we can copy to help us in tests before we provide a solution to you.

            Terry

            PS your request to Scott Sumner directly will likely go unanswered (by him), he hasn’t been active on this forum for a long time.

            1 Reply Last reply Reply Quote 0
            • PeterJonesP
              PeterJones @Saya Jujur
              last edited by PeterJones

              @Saya-Jujur ,

              Untested, because I am on my phone, but maybe try

              prev = ''
              with open('data.txt') as f:
                  for (n, line) in enumerate(f):
                      if line[:200] == prev[:200]:
                          print n+1
                      prev = line[:200]
              

              (You said you changed to 200 already, but maybe you missed an instance, or maybe comparing just the left of prev is enough)

              If that doesn’t work, then follow @Terry-R’s advice

              1 Reply Last reply Reply Quote 0
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors