Community
    • Login

    Problems with certain functionality on extremely large files

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    3 Posts 3 Posters 712 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Aliyah CoplanA
      Aliyah Coplan
      last edited by Aliyah Coplan

      I have a .txt file that is 5 million lines / 120 million characters / 130 mb. So, huge. I’ve tried to isolate lines in the file which contain a certain regex, plus a few lines after that regex, and remove any other lines. The mark tool worked fine, allowing me to mark the lines I wanted. But then I went to use the “remove unmarked lines” command, and Notepad++ stopped responding. I then tried to accomplish this with some Python scripts in the Python Script plugin, and while Notepad++ didn’t stop responding, the script never completed and appeared to stop after a certain number of lines when it shouldn’t have. Then I had another small file (800 lines for 800 words) which I wanted to compare to the larger file to see which words from the smaller file appeared in the larger file. I used the compare plugin, and Notepad++ stopped responding. And it’s not just unresponsive but working on the assigned task; I left my computer running for hours and nothing happened (the compare plugin’s progress froze early on but I left it running just in case, and while I couldn’t check the “remove unmarked lines” progress… I don’t think it’d take hours, when marking the lines took seconds). Does anyone have any tips, whether that’s some kind of change I can make to Notepad++ to allow it to work better with large files, 2) an easy way to split this file into smaller pieces quickly so it’s more manageable for Notepad++, or 3) another application with functionality similar to what I described above that can handle larger files?

      1 Reply Last reply Reply Quote 0
      • Jim DaileyJ
        Jim Dailey
        last edited by

        @Aliyah-Coplan

        If you know Python (or PERL or AWK), then why not write a simple script to print the lines you want? If that’s acceptable, then I think it is a perfect job for an extremely simple state machine. The basic logic is this, where N is the number of lines to print after the regex is seen:

        State = 0
        for (L = each line of the file) {
            if (L matches the regex) {
                Print L
                State = 1
                Count = 0
            }
            else if (State == 1) {
                if (Count < N) {
                    Print L
                    Count = Count + 1
                }
                else {
                    State = 0
                }
            }
        }
        
        Alan KilbornA 1 Reply Last reply Reply Quote 4
        • Alan KilbornA
          Alan Kilborn @Jim Dailey
          last edited by

          Yea. “Big” data can be a problem. I’d go with Jim’s idea. In fact, I have in the past done this exact thing. Here’s a “cleaned up” version of some standalone Python I have, in the chance that it will help you:

          import re
          
          myfile = r'C:\foo\bar.txt'
          myregex = r''
          lines_after_count = 2
          
          countdown = 0
          with open(myfile) as f:
              for line in f:
                  line = line.rstrip()
                  if re.search(myregex, line):
                      print(line)
                      countdown = lines_after_count
                  elif countdown > 0:
                      print(line)
                      countdown -= 1
          
          1 Reply Last reply Reply Quote 3
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors