Community
    • 登入

    Deleting lines that repeat the first 15 characters

    已排程 已置頂 已鎖定 已移動 Help wanted · · · – – – · · ·
    25 貼文 6 Posters 14.8k 瀏覽
    正在載入更多貼文
    • 從舊到新
    • 從新到舊
    • 最多點贊
    回覆
    • 在新貼文中回覆
    登入後回覆
    此主題已被刪除。只有擁有主題管理權限的使用者可以查看。
    • guy038G
      guy038
      最後由 guy038 編輯

      Hello, @mangoguy, @scott-sumner and All,

      I’m extremely confused, Indeed ! I did an important and beginner mistake, in my previous regex, that I was testing, intensively :-(( My God, of course ! The RIGHT regex is (?-s)^(.{15}).*\R\K(?:\1.*\R)+ and NOT the regex (?-s)(.{15}).*\R\K(?:\1.*\R)+ :-))

      Do you see the difference ? Well, it’s just the anchor ^, after the modifier (?-s) !

      Indeed, let’s try again the wrong regex :

      Assuming the test list, below :

      91,02,2013,1000   000001   ,22.107,22.513,20.976,21.151,0
      13,1000   000002   ,20.976,21.724,20.620,21.336,0
      13,1000   000003   ,21.344,22.116,21.336,21.918,0
      13,1000   000004   ,21.918,21.918,20.797,20.797,0
      

      So, first, the caret is right before the 9 digit, of the first line and the fifteen characters 91,02,2013,1000 cannot be found elsewhere. Then, as no anchor ^ ( beginning of line ) exists, the regex engine goes ahead one position between the digits 9 and 1 of the first line. Again, as the fifteen characters 1,02,2013,1000b do not exist further on, the regex engine goes ahead one position, examining, now the string ,02,2013,1000bb …

      … till the fifteen characters 13,1000bbb00000, which can be found, this time, at beginning of lines 2, 3 and 4 ! Just imagine the work to accomplish for 458,404 lines of the Data2.txt file :-((

      ( Note : the lowercase letter b, above, stands for a space character )

      To easily see the problem, just get rid of the \K syntax, forming the regex (?-s)(.{15}).*\R(?:\1.*\R)+. If you click on the Find Next button, it selects, after test on positions 1, 2,…and 8, from the two last digits of year 2013 till the end of text. But, if you’re using the regex (?-s)^(.{15}).*\R(?:\1.*\R)+, with the anchor ^, it correctly gets the identical lines 2, 3 and 4, regarding theirs first 15 characters !


      So, Doug, to sump up, using the right regex (?-s)^(.{15}).*\R\K(?:\1.*\R)+, against your Data2.txt file, does not find any occurrence ( ~5s ), that is the expected result, as we know, by construction, that the 458,404 lines of this file, are all different :-)

      Best Regards,

      guy038

      Scott SumnerS 1 條回覆 最後回覆 回覆 引用 1
      • Scott SumnerS
        Scott Sumner @guy038
        最後由 編輯

        @guy038

        Yea, wow, I totally didn’t see the missing ^ as well. Of course, as our local regex guru I don’t normally question @guy038’s regexes, but there is no excuse for a second pair of eyes (mine) not noticing/questioning this. Looking back over my posts in this thread, I really added nothing of value and totally wish I hadn’t participated at all. :-(

        1 條回覆 最後回覆 回覆 引用 0
        • Saya JujurS
          Saya Jujur @Scott Sumner
          最後由 Saya Jujur 編輯

          @Scott-Sumner , about that python code:

          prev = ''
          with open('data.txt') as f:
              for (n, line) in enumerate(f):
                  if line[:15] == prev:
                      print n+1
                  prev = line[:15]
          

          How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces) are same? I have changed 15 to 200, I am afraid the code did not work.

          Thank you

          PeterJonesP 1 條回覆 最後回覆 回覆 引用 0
          • Terry RT
            Terry R
            最後由 Terry R 編輯

            @Saya-Jujur said in Deleting lines that repeat the first 15 characters:

            How can we delete duplicate lines if first 40 words (or lets say, first 200 characters including spaces) are same? I have changed 15 to 200, I am afraid the code did not work.

            It would have been better to have started a new thread since this one was last posted to 4 years ago. By all means reference it but a new one I think is warranted.

            You don’t give much detail on your need, are the lines together as that is what this thread was all about.

            So start a new post, outline your need, give examples. Read the post at the top (of the Help Wanted section) titled “Please read before posting” as it will help you provide examples in a format that we can trust haven’t been altered by the posting window and we can copy to help us in tests before we provide a solution to you.

            Terry

            PS your request to Scott Sumner directly will likely go unanswered (by him), he hasn’t been active on this forum for a long time.

            1 條回覆 最後回覆 回覆 引用 0
            • PeterJonesP
              PeterJones @Saya Jujur
              最後由 PeterJones 編輯

              @Saya-Jujur ,

              Untested, because I am on my phone, but maybe try

              prev = ''
              with open('data.txt') as f:
                  for (n, line) in enumerate(f):
                      if line[:200] == prev[:200]:
                          print n+1
                      prev = line[:200]
              

              (You said you changed to 200 already, but maybe you missed an instance, or maybe comparing just the left of prev is enough)

              If that doesn’t work, then follow @Terry-R’s advice

              1 條回覆 最後回覆 回覆 引用 0
              • 第一個貼文
                最後的貼文
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors