# Problems with certain functionality on extremely large files

• I have a .txt file that is 5 million lines / 120 million characters / 130 mb. So, huge. I’ve tried to isolate lines in the file which contain a certain regex, plus a few lines after that regex, and remove any other lines. The mark tool worked fine, allowing me to mark the lines I wanted. But then I went to use the “remove unmarked lines” command, and Notepad++ stopped responding. I then tried to accomplish this with some Python scripts in the Python Script plugin, and while Notepad++ didn’t stop responding, the script never completed and appeared to stop after a certain number of lines when it shouldn’t have. Then I had another small file (800 lines for 800 words) which I wanted to compare to the larger file to see which words from the smaller file appeared in the larger file. I used the compare plugin, and Notepad++ stopped responding. And it’s not just unresponsive but working on the assigned task; I left my computer running for hours and nothing happened (the compare plugin’s progress froze early on but I left it running just in case, and while I couldn’t check the “remove unmarked lines” progress… I don’t think it’d take hours, when marking the lines took seconds). Does anyone have any tips, whether that’s some kind of change I can make to Notepad++ to allow it to work better with large files, 2) an easy way to split this file into smaller pieces quickly so it’s more manageable for Notepad++, or 3) another application with functionality similar to what I described above that can handle larger files?

• @Aliyah-Coplan

If you know Python (or PERL or AWK), then why not write a simple script to print the lines you want? If that’s acceptable, then I think it is a perfect job for an extremely simple state machine. The basic logic is this, where N is the number of lines to print after the regex is seen:

State = 0
for (L = each line of the file) {
if (L matches the regex) {
Print L
State = 1
Count = 0
}
else if (State == 1) {
if (Count < N) {
Print L
Count = Count + 1
}
else {
State = 0
}
}
}

• Yea. “Big” data can be a problem. I’d go with Jim’s idea. In fact, I have in the past done this exact thing. Here’s a “cleaned up” version of some standalone Python I have, in the chance that it will help you:

import re

myfile = r'C:\foo\bar.txt'
myregex = r''
lines_after_count = 2

countdown = 0
with open(myfile) as f:
for line in f:
line = line.rstrip()
if re.search(myregex, line):
print(line)
countdown = lines_after_count
elif countdown > 0:
print(line)
countdown -= 1