Community
    • Login

    FunctionList: v8.8.7 Perl parser is apparently inefficient

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    functionlistregex
    3 Posts 2 Posters 58 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones
      last edited by PeterJones

      Based on older conversations here, I had been using an advanced version of the Perl parser for the Notepad++ FunctionList, which handles Classes (including, in Perl nomenclature, “Packages”) and Functions, compared to the simple FunctionList definition which had been shipped by default with Notepad++.

      When Issue #17043 was submitted, I was prompted to submit my advanced Perl parser to be distributed, and it was released with v8.8.7.

      However, soon after the release, Issue #17152 was created, claiming that even for simple Perl code, it can take multiple seconds when they switch tabs into Perl code – even for the unitTest file. I obviously hadn’t seen that over my 5 years of actively using my advanced Perl FL, but I don’t doubt that there might be inefficiencies in the regex used…

      The FunctionList definition shipped with v8.8.7/8.8.8 is here.

      I know that @guy038 enjoys trying to make regex more efficient, and that @MAPJe71 is much better at FL definitions than I am.

      So if anyone wants to take a look at provide some efficiency advice on this FL, that would be great.

      The expectations are that it would need to have the same unitTest results in the automated process, which looks like the following FL in the GUI:
      65afebab-025e-469b-8fbb-9a979d3c7430-image.png

      I am not great at determining regex efficiency, so I don’t know what might be able to be done to the regex to help the multiple users in the new issue, without giving up and reverting to the less-functional parser that was distributed in v8.8.6-and-earlier

      CoisesC 1 Reply Last reply Reply Quote 0
      • CoisesC
        Coises @PeterJones
        last edited by

        @PeterJones:

        I’m not familiar with Perl or with Notepad++ function lists, but just on general regex principles, this bit at line 60 and again at line 84 looks suspicious:

        (?:\s*\([^()]*\))?
        (?:\s*\:\s*[^{]+)? 
        \s*\{
        

        especially because [^{] includes \s; so a lot of pointless retrying could occur. I think this:

        \s*+(?:\([^()]*+\)\s*+)?+
        (?:\:[^{]+)?+
        \{
        

        says the same thing without as much retrying. (You would think that not all those plus signs should be needed, and some might not be; but at least with the boost:regex engine in Notepad++, I’ve seen evidence that sometimes what seem like brain-dead obvious optimizations you would expect the engine to make are not in fact applied.)

        However, since \s*[^{]+ matches the same strings as [^{]+ (and makes the following \s* redundant when it matches), I suspect that might not be what you meant to write. What you have says, after a colon there has to be a string of one or more characters, which can be anything other than a left brace (including just one blank, line feed, etc.), then a left brace. If you add a possessive qualifier to the space — \s*+[^{]+ — then there has to be at least one non-space character other than a left brace present after the colon and before the left brace (because if they’re all space characters the possessive \s*+ will “eat them up” and there won’t be anything left to match [^{]+). So the three lines would be:

        \s*+(?:\([^()]*+\)\s*+)?+
        (?:\:\s*+[^{]+)?+
        \{
        

        if you meant to require at least one non-space character between the colon and the left brace.

        I am not great at determining regex efficiency

        The main principle is to think through the matching from left to right and try to write it so that at each character, there is only one choice: match, or don’t. In the three lines we started with, suppose the text to be matched starts with a blank. Is it the optional space at the start of the first line, or the one at the start of the second line, or the one at the start of the third line? Each possibility has to be tried to see if it works out.

        In my rewrites, the first thing that happens is that any leading spaces are matched by \s*+; there’s no retracing our steps later. Now, either the next character is a left parenthesis or it isn’t, so we immediately know whether we do or don’t use the first optional group. At the end of the first optional group, if we use it, we match all space characters again, so when we get to the second optional group, either there is a colon or there isn’t. And so on. Use possessive groups whenever possible. And assume the regex engine is dumb as dirt… just because you can “see” that some retry is pointless doesn’t mean it won’t plod along trying matches that can’t possibly work anyway.

        PeterJonesP 1 Reply Last reply Reply Quote 0
        • PeterJonesP
          PeterJones @Coises
          last edited by

          @Coises ,

          Thanks for the suggestions.

          When I tried updating those, with the possessives for the spaces, I could still notice the delay (if anything, it moved from about half a second to a full second when I would switch tabs).

          But I decided at that point to go back to the v<=8.8.6 version, and start adding back in “features” of the updated version until it slowed down (because even in mine, the v8.8.6 was near instantaneous for the unitTest, whereas the v8.8.8 was noticeable, but usually less than a second). And I was surprised that it slowed down on the first thing I tried: the final <className><nameExpr ...> was what was slowing it down.

          So looking at the v8.8.8 nameExpr at line 99, \s\K((::)?\w+)+(?=::), the (::)?\w+ falls prey to the uncertainty, and probably backtracked a lot. So I changed to sub\s+\K\w+(::\w+)*(?=::\w), and now the ambiguity causing backtracking goes away, and it’s back to nearly instantaneous

          So with your possessives for the space, and reworking the className nameExpr, I think it’s probably much better. I will try to share that one with the users from the issue, and see if it improves things for them, as well.

          1 Reply Last reply Reply Quote 0
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors