• Login
Community
  • Login

Problems with certain functionality on extremely large files

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
3 Posts 3 Posters 712 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A
    Aliyah Coplan
    last edited by Aliyah Coplan Apr 26, 2019, 3:21 AM Apr 26, 2019, 3:20 AM

    I have a .txt file that is 5 million lines / 120 million characters / 130 mb. So, huge. I’ve tried to isolate lines in the file which contain a certain regex, plus a few lines after that regex, and remove any other lines. The mark tool worked fine, allowing me to mark the lines I wanted. But then I went to use the “remove unmarked lines” command, and Notepad++ stopped responding. I then tried to accomplish this with some Python scripts in the Python Script plugin, and while Notepad++ didn’t stop responding, the script never completed and appeared to stop after a certain number of lines when it shouldn’t have. Then I had another small file (800 lines for 800 words) which I wanted to compare to the larger file to see which words from the smaller file appeared in the larger file. I used the compare plugin, and Notepad++ stopped responding. And it’s not just unresponsive but working on the assigned task; I left my computer running for hours and nothing happened (the compare plugin’s progress froze early on but I left it running just in case, and while I couldn’t check the “remove unmarked lines” progress… I don’t think it’d take hours, when marking the lines took seconds). Does anyone have any tips, whether that’s some kind of change I can make to Notepad++ to allow it to work better with large files, 2) an easy way to split this file into smaller pieces quickly so it’s more manageable for Notepad++, or 3) another application with functionality similar to what I described above that can handle larger files?

    1 Reply Last reply Reply Quote 0
    • J
      Jim Dailey
      last edited by Apr 26, 2019, 11:49 AM

      @Aliyah-Coplan

      If you know Python (or PERL or AWK), then why not write a simple script to print the lines you want? If that’s acceptable, then I think it is a perfect job for an extremely simple state machine. The basic logic is this, where N is the number of lines to print after the regex is seen:

      State = 0
      for (L = each line of the file) {
          if (L matches the regex) {
              Print L
              State = 1
              Count = 0
          }
          else if (State == 1) {
              if (Count < N) {
                  Print L
                  Count = Count + 1
              }
              else {
                  State = 0
              }
          }
      }
      
      A 1 Reply Last reply Apr 26, 2019, 12:26 PM Reply Quote 4
      • A
        Alan Kilborn @Jim Dailey
        last edited by Apr 26, 2019, 12:26 PM

        Yea. “Big” data can be a problem. I’d go with Jim’s idea. In fact, I have in the past done this exact thing. Here’s a “cleaned up” version of some standalone Python I have, in the chance that it will help you:

        import re
        
        myfile = r'C:\foo\bar.txt'
        myregex = r''
        lines_after_count = 2
        
        countdown = 0
        with open(myfile) as f:
            for line in f:
                line = line.rstrip()
                if re.search(myregex, line):
                    print(line)
                    countdown = lines_after_count
                elif countdown > 0:
                    print(line)
                    countdown -= 1
        
        1 Reply Last reply Reply Quote 3
        3 out of 3
        • First post
          3/3
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors