Little script is very slow (depends on file size)
-
Hello,
I have a little PythonScript that is super important for my line of work, but can be very slow parsing large files. Seems that it gets worse the bigger the file is.It’s very simple, you highlight a portion of text, and it will bookmark all the lines containing that portion of text. Then you can copy the bookmarked lines, cut them , ect…
MARK_BOOKMARK = 20 def match_found(m): targetStart = m.span(0)[0] lineNumber = editor.lineFromPosition(targetStart) editor.markerAdd(lineNumber, MARK_BOOKMARK) pattern = editor.getSelText() if pattern != '': editor.search(pattern, match_found)
I suspect editor.lineFromPosition(targetStart) is where the problem is. As it starts bookmarking lower and lower down the file, it gets increasingly slow.
Would be nice to speed this up a bit. Not quite sure what to do. I suspect the search result ‘m’ should have the line number (m.lineNumber, … that kind of thing) when it finds a pattern somewhere in the file, so there would be no need for the clumsy conversion.
-
@0BZEN We’re talking potentially 10’s of megabytes, if not 100’s. It’s not as bas as it used to be, but I suspect it could be made much more efficient.
-
@0BZEN ,
Regarding the behavior: do you really need it as a script? Because selecting the text and hitting
Ctrl+M
will bring up the Mark dialog with the selected text in the Find what (assuming normal Settings > Preferences > Searching > ☑ Fill Find Field with Selected Text). Ensuring☑ Bookmark Line
is turned on in the Mark dialog will then bookmark all those same lines, just using Notepad++'s native search, rather than having to go through the plugin. I would think that would be faster. (Though any searches and actions over hundreds of MB will be slow.)Regarding script optimization: I don’t know of any ways that would for-sure optimize that… if it were my script, I might look into whether I could cache a mapping that would help simplify, to avoid making PythonScript re-count line number from every call to the method… I might give it a little thought to flesh out the idea some, though I won’t guarantee it will actually be faster than your current behavior.
Update:
I took the lines
one one one one one one one one one one one one one one one one one one one one two two two two two two two two two two two two two two two two two two two two thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv six six six six six six six six six !HERE! six six six six six six six six six sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten
and
Ctrl+A Ctrl+D
until there were ~1.3M lines with ~107MB. In this file, 10% of the lines have a match for!HERE!
.Selecting
!HERE!
and using the Mark dialog took a couple seconds. Running my script (below) took >15sec.# encoding=utf-8 """in response to https://community.notepad-plus-plus.org/topic/25902/""" from Npp import notepad,console,editor1 # editor = None # update: comment this out def _doit(): #console.clear() #console.show() linemap = {} for l in range(editor1.getLineCount()): p = editor1.positionFromLine(l) linemap[p] = l # console.write(str(linemap)+"\n\n") def match_found(m): s = m.start() p = s l = None # console.write("match start={} p={} line={} [before p search]\n".format(s,p,l)) while p>0: # console.write("{} => {}\n".format(p, p in linemap)) if p in linemap: l = linemap[p] break else: p = p - 1 # console.write("match start={} p={} line={}\n".format(s,p,l)) editor1.markerAdd(l,20) editor1.search("!HERE!", match_found) _doit() del(_doit)
I don’t know whether the caching really helped speed things up or not – if you’re interested, you may try to compare your algorithm to mine on your own data. But what I am sure about is that the Mark dialog was significantly faster than the script equivalent for my example data.
(note: the doit() function is the algorithm itself; I wrap it in a function and then delete that function at the end to avoid cluttering my PythonScript with variables held from previous script runs, etc)
update: comment out the line referenced by @Alan-Kilborn , below
-
Isn’t scripting with bookmarks involved historically slow for some reason? I seem to recall this being discussed a few times here, with no real root cause determined, or solution found to speed things up. :-(
-
@PeterJones said in Little script is very slow (depends on file size):
editor = None
Doesn’t this kill usage of the all-important
editor
in subsequently-run scripts?? -
@Alan-Kilborn said in Little script is very slow (depends on file size):
Doesn’t this kill usage of the all-important
editor
in subsequently-run scripts??So, funny story: when I first started writing/debugging the script, the script was in editor2 and the test file in editor1, and I wouldn’t always remember to click in editor1 before running the script… so when I found I had mixed some
editor.
and someeditor1.
during debug, I eventuallyNone
d theeditor
so that it would flag me if I made that mistake again.And then, by the time the script was fully working, I had forgotten I’d done that…
So when I went to clean up before publishing, I couldn’t figure out why it wasn’t working when I tried to switch back to
editor.
… Instead of digging into it more, I just left it as-is with theeditor1
code. And hence, a stupid bugDoesn’t this kill usage of the all-important
editor
in subsequently-run scripts??Only if your subsequently-run scripts assume that they come after
from Npp import editor
or equivalent in your startup; if they always have their ownfrom Npp import editor
line, then they will always correctly defineeditor
for their own usage. :-) -
@PeterJones said in Little script is very slow (depends on file size):
@0BZEN ,
Regarding the behavior: do you really need it as a script?
Not necessarily, no, it’s just very convenient, especially with shortcuts.
Regarding script optimization: I don’t know of any ways that would for-sure optimize that…
Yeah, the search results don’t contain meta-data, like line numbers. It might be possible to count the number of ‘/R’ characters from the last hit, the first search result using the slow function (or count the ‘/R’ from beginning of file).
Not sure if that would equate to the number of lines between search results? Maybe.
something like
int LineFromPosition(int position, int startpos=0, int startline=0) { return startline + CountLineReturns(startpos, position); }
Something like that, more python-ey, with maybe a +1 / -1 extra line somewhere.
if (position is < startpos), we’ve gone back to the top of the file, so, will need to do something a bit more clever, but no biggie.
int LineFromPosition(int position, int startpos=0, int startline=0) { if (position is < startpos) { return LineFromPosition(position, 0, 0); } else { return startline + CountLineReturns(startpos, position); } }
Something like that anyway.
if it were my script, I might look into whether I could cache a mapping that would help simplify, to avoid making PythonScript re-count line number from every call to the method…
Possibly, although that will get invalidated when the file content changes.
Selecting
!HERE!
and using the Mark dialog took a couple seconds. Running my script (below) took >15sec.Interesting. I wonder what they do in that function. Not surprised though.
I don’t know whether the caching really helped speed things up or not – if you’re interested, you may try to compare your algorithm to mine on your own data.
It’s OK, I don’t mind the script being reasonably slow. It used to be far worse. I think the olde Notepad++ would lock up, at least it’s doing the search in a background process (I think).
But what I am sure about is that the Mark dialog was significantly faster than the script equivalent for my example data.
Yup, I may give it a go, if it gets very, very slow. I don’t want to spend much time on optimising a tool. As an exercise, it could be interesting, develop a method that is doing that bookmarking fast.
-
@Alan-Kilborn Could be linked to the same issue I encounter? Using some sort of file-position-to-line-number comversion, which means having to run through the begining of the file every time?
Anyway, no big deal. Thanks for your input everyone!
-
@0BZEN said in Little script is very slow (depends on file size):
Not necessarily, no, it’s just very convenient, especially with shortcuts.
I’m just saying, select,
Ctrl+M
, visually confirm checkboxes, and click Mark All isn’t that onerous… and if your script is “very slow”, it’s sure to be faster than >15sec for the script version.Possibly, although that will get invalidated when the file content changes
Mine doesn’t cache that information from run-to-run. It just precomputes the mapping of the positions-from-lines in a way that only requires going through the whole document once (as far as I can tell), rather than counting from the beginning every time.
Yeah, the search results don’t contain meta-data, like line numbers. It might be possible to count the number of ‘/R’ characters from the last hit, the first search result using the slow function (or count the ‘/R’ from beginning of file).
I’m pretty sure that the extra effort of counting between matches (which isn’t implemented already, so it’d have to be manually done) would be more time-consuming than the current.