Problems with certain functionality on extremely large files



  • I have a .txt file that is 5 million lines / 120 million characters / 130 mb. So, huge. I’ve tried to isolate lines in the file which contain a certain regex, plus a few lines after that regex, and remove any other lines. The mark tool worked fine, allowing me to mark the lines I wanted. But then I went to use the “remove unmarked lines” command, and Notepad++ stopped responding. I then tried to accomplish this with some Python scripts in the Python Script plugin, and while Notepad++ didn’t stop responding, the script never completed and appeared to stop after a certain number of lines when it shouldn’t have. Then I had another small file (800 lines for 800 words) which I wanted to compare to the larger file to see which words from the smaller file appeared in the larger file. I used the compare plugin, and Notepad++ stopped responding. And it’s not just unresponsive but working on the assigned task; I left my computer running for hours and nothing happened (the compare plugin’s progress froze early on but I left it running just in case, and while I couldn’t check the “remove unmarked lines” progress… I don’t think it’d take hours, when marking the lines took seconds). Does anyone have any tips, whether that’s some kind of change I can make to Notepad++ to allow it to work better with large files, 2) an easy way to split this file into smaller pieces quickly so it’s more manageable for Notepad++, or 3) another application with functionality similar to what I described above that can handle larger files?



  • @Aliyah-Coplan

    If you know Python (or PERL or AWK), then why not write a simple script to print the lines you want? If that’s acceptable, then I think it is a perfect job for an extremely simple state machine. The basic logic is this, where N is the number of lines to print after the regex is seen:

    State = 0
    for (L = each line of the file) {
        if (L matches the regex) {
            Print L
            State = 1
            Count = 0
        }
        else if (State == 1) {
            if (Count < N) {
                Print L
                Count = Count + 1
            }
            else {
                State = 0
            }
        }
    }


  • Yea. “Big” data can be a problem. I’d go with Jim’s idea. In fact, I have in the past done this exact thing. Here’s a “cleaned up” version of some standalone Python I have, in the chance that it will help you:

    import re
    
    myfile = r'C:\foo\bar.txt'
    myregex = r''
    lines_after_count = 2
    
    countdown = 0
    with open(myfile) as f:
        for line in f:
            line = line.rstrip()
            if re.search(myregex, line):
                print(line)
                countdown = lines_after_count
            elif countdown > 0:
                print(line)
                countdown -= 1

Log in to reply