Community
    • Login

    Little script is very slow (depends on file size)

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    9 Posts 3 Posters 533 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • 0BZEN0
      0BZEN
      last edited by

      Hello,
      I have a little PythonScript that is super important for my line of work, but can be very slow parsing large files. Seems that it gets worse the bigger the file is.

      It’s very simple, you highlight a portion of text, and it will bookmark all the lines containing that portion of text. Then you can copy the bookmarked lines, cut them , ect…

      MARK_BOOKMARK = 20
      
      def match_found(m):
          targetStart = m.span(0)[0]
          lineNumber = editor.lineFromPosition(targetStart)
          editor.markerAdd(lineNumber, MARK_BOOKMARK)
         
      pattern = editor.getSelText()
      if pattern != '':
          editor.search(pattern, match_found) 
      

      I suspect editor.lineFromPosition(targetStart) is where the problem is. As it starts bookmarking lower and lower down the file, it gets increasingly slow.

      Would be nice to speed this up a bit. Not quite sure what to do. I suspect the search result ‘m’ should have the line number (m.lineNumber, … that kind of thing) when it finds a pattern somewhere in the file, so there would be no need for the clumsy conversion.

      0BZEN0 PeterJonesP 2 Replies Last reply Reply Quote 0
      • 0BZEN0
        0BZEN @0BZEN
        last edited by

        @0BZEN We’re talking potentially 10’s of megabytes, if not 100’s. It’s not as bas as it used to be, but I suspect it could be made much more efficient.

        0BZEN0 1 Reply Last reply Reply Quote 0
        • PeterJonesP
          PeterJones @0BZEN
          last edited by PeterJones

          @0BZEN ,

          Regarding the behavior: do you really need it as a script? Because selecting the text and hitting Ctrl+M will bring up the Mark dialog with the selected text in the Find what (assuming normal Settings > Preferences > Searching > ☑ Fill Find Field with Selected Text). Ensuring ☑ Bookmark Line is turned on in the Mark dialog will then bookmark all those same lines, just using Notepad++'s native search, rather than having to go through the plugin. I would think that would be faster. (Though any searches and actions over hundreds of MB will be slow.)

          Regarding script optimization: I don’t know of any ways that would for-sure optimize that… if it were my script, I might look into whether I could cache a mapping that would help simplify, to avoid making PythonScript re-count line number from every call to the method… I might give it a little thought to flesh out the idea some, though I won’t guarantee it will actually be faster than your current behavior.

          Update:

          I took the lines

          one one one one one one one one one one one one one one one one one one one one 
          two two two two two two two two two two two two two two two two two two two two 
          thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr thr 
          fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou fou 
          fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv fiv 
          six six six six six six six six six !HERE!  six six six six six six six six six 
          sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev sev 
          eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig eig 
          nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin nin 
          ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten ten 
          

          and Ctrl+A Ctrl+D until there were ~1.3M lines with ~107MB. In this file, 10% of the lines have a match for !HERE!.

          Selecting !HERE! and using the Mark dialog took a couple seconds. Running my script (below) took >15sec.

          # encoding=utf-8
          """in response to https://community.notepad-plus-plus.org/topic/25902/"""
          from Npp import notepad,console,editor1
          # editor = None # update: comment this out
          
          def _doit():
              #console.clear()
              #console.show()
              linemap = {}
              for l in range(editor1.getLineCount()):
                  p = editor1.positionFromLine(l)
                  linemap[p] = l
          
              # console.write(str(linemap)+"\n\n")
          
              def match_found(m):
                  s = m.start()
                  p = s
                  l = None
          
                  # console.write("match start={} p={} line={} [before p search]\n".format(s,p,l))
          
                  while p>0:
                      # console.write("{} => {}\n".format(p, p in linemap))
                      if p in linemap:
                          l = linemap[p]
                          break
                      else:
                          p = p - 1
          
                  # console.write("match start={} p={} line={}\n".format(s,p,l))
                  editor1.markerAdd(l,20)
          
              editor1.search("!HERE!", match_found)
          
          _doit()
          del(_doit)
          

          I don’t know whether the caching really helped speed things up or not – if you’re interested, you may try to compare your algorithm to mine on your own data. But what I am sure about is that the Mark dialog was significantly faster than the script equivalent for my example data.

          (note: the doit() function is the algorithm itself; I wrap it in a function and then delete that function at the end to avoid cluttering my PythonScript with variables held from previous script runs, etc)

          update: comment out the line referenced by @Alan-Kilborn , below

          Alan KilbornA 1 Reply Last reply Reply Quote 4
          • Alan KilbornA
            Alan Kilborn
            last edited by

            Isn’t scripting with bookmarks involved historically slow for some reason? I seem to recall this being discussed a few times here, with no real root cause determined, or solution found to speed things up. :-(

            0BZEN0 1 Reply Last reply Reply Quote 3
            • Alan KilbornA
              Alan Kilborn @PeterJones
              last edited by

              @PeterJones said in Little script is very slow (depends on file size):

              editor = None

              Doesn’t this kill usage of the all-important editor in subsequently-run scripts??

              PeterJonesP 1 Reply Last reply Reply Quote 1
              • PeterJonesP
                PeterJones @Alan Kilborn
                last edited by

                @Alan-Kilborn said in Little script is very slow (depends on file size):

                Doesn’t this kill usage of the all-important editor in subsequently-run scripts??

                So, funny story: when I first started writing/debugging the script, the script was in editor2 and the test file in editor1, and I wouldn’t always remember to click in editor1 before running the script… so when I found I had mixed some editor. and some editor1. during debug, I eventually Noned the editor so that it would flag me if I made that mistake again.

                And then, by the time the script was fully working, I had forgotten I’d done that…

                So when I went to clean up before publishing, I couldn’t figure out why it wasn’t working when I tried to switch back to editor. … Instead of digging into it more, I just left it as-is with the editor1 code. And hence, a stupid bug

                Doesn’t this kill usage of the all-important editor in subsequently-run scripts??

                Only if your subsequently-run scripts assume that they come after from Npp import editor or equivalent in your startup; if they always have their own from Npp import editor line, then they will always correctly define editor for their own usage. :-)

                1 Reply Last reply Reply Quote 1
                • 0BZEN0
                  0BZEN @0BZEN
                  last edited by 0BZEN

                  @PeterJones said in Little script is very slow (depends on file size):

                  @0BZEN ,

                  Regarding the behavior: do you really need it as a script?

                  Not necessarily, no, it’s just very convenient, especially with shortcuts.

                  Regarding script optimization: I don’t know of any ways that would for-sure optimize that…

                  Yeah, the search results don’t contain meta-data, like line numbers. It might be possible to count the number of ‘/R’ characters from the last hit, the first search result using the slow function (or count the ‘/R’ from beginning of file).

                  Not sure if that would equate to the number of lines between search results? Maybe.

                  something like

                  int LineFromPosition(int position, int startpos=0, int startline=0)
                  {
                      return startline + CountLineReturns(startpos, position);
                  }
                  

                  Something like that, more python-ey, with maybe a +1 / -1 extra line somewhere.

                  if (position is < startpos), we’ve gone back to the top of the file, so, will need to do something a bit more clever, but no biggie.

                  int LineFromPosition(int position, int startpos=0, int startline=0)
                  {
                      if (position is < startpos)
                      {
                          return LineFromPosition(position, 0, 0);
                      }
                      else
                      {
                          return startline + CountLineReturns(startpos, position);
                      }
                  }
                  

                  Something like that anyway.

                  if it were my script, I might look into whether I could cache a mapping that would help simplify, to avoid making PythonScript re-count line number from every call to the method…

                  Possibly, although that will get invalidated when the file content changes.

                  Selecting !HERE! and using the Mark dialog took a couple seconds. Running my script (below) took >15sec.

                  Interesting. I wonder what they do in that function. Not surprised though.

                  I don’t know whether the caching really helped speed things up or not – if you’re interested, you may try to compare your algorithm to mine on your own data.

                  It’s OK, I don’t mind the script being reasonably slow. It used to be far worse. I think the olde Notepad++ would lock up, at least it’s doing the search in a background process (I think).

                  But what I am sure about is that the Mark dialog was significantly faster than the script equivalent for my example data.

                  Yup, I may give it a go, if it gets very, very slow. I don’t want to spend much time on optimising a tool. As an exercise, it could be interesting, develop a method that is doing that bookmarking fast.

                  PeterJonesP 1 Reply Last reply Reply Quote 0
                  • 0BZEN0
                    0BZEN @Alan Kilborn
                    last edited by

                    @Alan-Kilborn Could be linked to the same issue I encounter? Using some sort of file-position-to-line-number comversion, which means having to run through the begining of the file every time?

                    Anyway, no big deal. Thanks for your input everyone!

                    1 Reply Last reply Reply Quote 0
                    • PeterJonesP
                      PeterJones @0BZEN
                      last edited by

                      @0BZEN said in Little script is very slow (depends on file size):

                      Not necessarily, no, it’s just very convenient, especially with shortcuts.

                      I’m just saying, select, Ctrl+M, visually confirm checkboxes, and click Mark All isn’t that onerous… and if your script is “very slow”, it’s sure to be faster than >15sec for the script version.

                      Possibly, although that will get invalidated when the file content changes

                      Mine doesn’t cache that information from run-to-run. It just precomputes the mapping of the positions-from-lines in a way that only requires going through the whole document once (as far as I can tell), rather than counting from the beginning every time.

                      Yeah, the search results don’t contain meta-data, like line numbers. It might be possible to count the number of ‘/R’ characters from the last hit, the first search result using the slow function (or count the ‘/R’ from beginning of file).

                      I’m pretty sure that the extra effort of counting between matches (which isn’t implemented already, so it’d have to be manually done) would be more time-consuming than the current.

                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors