Poor performance removing blank lines



  • With a 600,000 line .I preprocessor file:

    Edit→Line Operations→Remove Empty Lines? >3 minutes

    Ctrl+A→TextFX→TextFX Edit→Delete Blank Lines? 1 second

    :/



  • @endolith

    3 minutes does seem excessive.
    Maybe still beats the time it would take to do it by hand, though. :-)

    What’s your timing for a regex replacement operation with that data?

    find: ^\R
    repl: nothing
    search mode: Regular expression



  • @endolith ,

    For me, a 1M line file, about 30% blank and 70% with 300-character lines, in Notepad++ v7.9.5-32bit, the builtin action took no more than 30sec, whereas the TextFX took considerably longer (multiple minutes).

    The exact durations may depend on the density of text and maybe other factors.

    My guess is that if there’s a difference in time for you, and if Notepad++ really is slower, that it’s because ::removeEmptyLine() code invokes the regex engine, rather than looking at the lines manually.

    But again, my experiment showed that the TextFX took considerably longer.



  • @PeterJones said in Poor performance removing blank lines:

    it’s because ::removeEmptyLine() code invokes the regex engine, rather than looking at the lines manually.

    You raise an interesting point here.
    Which choice (of those two) would be faster/slower, on a fixed data set?

    It is also interesting that Notepad++ uses the regex ^$(\\r\\n|\\r|\\n) which seems like it would be more “effort” than ^\R, but that could be misleading as well.

    (Note that I don’t care one iota about the obsolete TextFX)



  • @Alan-Kilborn said in Poor performance removing blank lines:

    (Note that I don’t care one iota about the obsolete TextFX)

    The reason I cared enough to install it on a 32-bit NPP was that, if I had confirmed that TextFX too 1/180th of the time as the builtin, I was going to suggest to the developers that they look into the algorithm that TextFX used and see if they could borrow from it. But since it was slower in my experiments, there isn’t anything “magical” about their algorithm.

    I also tried

    def delTrulyEmpty(contents, lineNumber, totalLines):
        if contents.strip('\r\n') == "":
            editor.deleteLine(lineNumber)
    
    editor.beginUndoAction()
    editor.forEachLine(delTrulyEmpty)
    editor.endUndoAction()
    

    … but that was the slowest so far.

    ^$(\\r\\n|\\r|\\n) which seems like it would be more “effort” than ^\R

    It probably depends on how \R is defined under the Boost regex engine’s hood.



  • This is quite fast

    editor.setText(''.join(x for x in editor.getText().splitlines(True) if x.strip() != ''))
    


  • @Ekopalypse said in Poor performance removing blank lines:

    This is quite fast

    It almost seems like we need a sample file and then some benchmarking, for all the solutions proposed. :-)

    Eko’s one-liner removes empty lines AND lines containing only whitespace. Since N++ makes a distinction (by having two separate menu commands) for those, maybe a one-liner for removing only empty lines is in order?

    I don’t know if it is totally correct, but I came up with this one:

    editor.setText('\r\n'.join(x for x in editor.getText().splitlines() if x != ''))
    


  • @Alan-Kilborn

    how about

    editor.setText('A line with some content\n\n'*1000000)
    

    ?



  • @Ekopalypse

    Well, I guess.
    Sometimes there’s an art to data creation, though.
    For instance, perhaps really long lines impact how fast something will run.
    Perhaps the ratio of empty to non-empty lines makes a difference.
    Perhaps…perhaps…perhaps…

    I suppose it would have been best to have the OP’s data file, since it was the one that had the original complaint…


Log in to reply