Problems with certain functionality on extremely large files
-
I have a .txt file that is 5 million lines / 120 million characters / 130 mb. So, huge. I’ve tried to isolate lines in the file which contain a certain regex, plus a few lines after that regex, and remove any other lines. The mark tool worked fine, allowing me to mark the lines I wanted. But then I went to use the “remove unmarked lines” command, and Notepad++ stopped responding. I then tried to accomplish this with some Python scripts in the Python Script plugin, and while Notepad++ didn’t stop responding, the script never completed and appeared to stop after a certain number of lines when it shouldn’t have. Then I had another small file (800 lines for 800 words) which I wanted to compare to the larger file to see which words from the smaller file appeared in the larger file. I used the compare plugin, and Notepad++ stopped responding. And it’s not just unresponsive but working on the assigned task; I left my computer running for hours and nothing happened (the compare plugin’s progress froze early on but I left it running just in case, and while I couldn’t check the “remove unmarked lines” progress… I don’t think it’d take hours, when marking the lines took seconds). Does anyone have any tips, whether that’s some kind of change I can make to Notepad++ to allow it to work better with large files, 2) an easy way to split this file into smaller pieces quickly so it’s more manageable for Notepad++, or 3) another application with functionality similar to what I described above that can handle larger files?
-
If you know Python (or PERL or AWK), then why not write a simple script to print the lines you want? If that’s acceptable, then I think it is a perfect job for an extremely simple state machine. The basic logic is this, where N is the number of lines to print after the regex is seen:
State = 0 for (L = each line of the file) { if (L matches the regex) { Print L State = 1 Count = 0 } else if (State == 1) { if (Count < N) { Print L Count = Count + 1 } else { State = 0 } } } -
Yea. “Big” data can be a problem. I’d go with Jim’s idea. In fact, I have in the past done this exact thing. Here’s a “cleaned up” version of some standalone Python I have, in the chance that it will help you:
import re myfile = r'C:\foo\bar.txt' myregex = r'' lines_after_count = 2 countdown = 0 with open(myfile) as f: for line in f: line = line.rstrip() if re.search(myregex, line): print(line) countdown = lines_after_count elif countdown > 0: print(line) countdown -= 1
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login