Community
    • Login

    FunctionList: v8.8.7 Perl parser is apparently inefficient

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    functionlistregex
    6 Posts 3 Posters 114 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones
      last edited by PeterJones

      Based on older conversations here, I had been using an advanced version of the Perl parser for the Notepad++ FunctionList, which handles Classes (including, in Perl nomenclature, “Packages”) and Functions, compared to the simple FunctionList definition which had been shipped by default with Notepad++.

      When Issue #17043 was submitted, I was prompted to submit my advanced Perl parser to be distributed, and it was released with v8.8.7.

      However, soon after the release, Issue #17152 was created, claiming that even for simple Perl code, it can take multiple seconds when they switch tabs into Perl code – even for the unitTest file. I obviously hadn’t seen that over my 5 years of actively using my advanced Perl FL, but I don’t doubt that there might be inefficiencies in the regex used…

      The FunctionList definition shipped with v8.8.7/8.8.8 is here.

      I know that @guy038 enjoys trying to make regex more efficient, and that @MAPJe71 is much better at FL definitions than I am.

      So if anyone wants to take a look at provide some efficiency advice on this FL, that would be great.

      The expectations are that it would need to have the same unitTest results in the automated process, which looks like the following FL in the GUI:
      65afebab-025e-469b-8fbb-9a979d3c7430-image.png

      I am not great at determining regex efficiency, so I don’t know what might be able to be done to the regex to help the multiple users in the new issue, without giving up and reverting to the less-functional parser that was distributed in v8.8.6-and-earlier

      CoisesC 1 Reply Last reply Reply Quote 0
      • CoisesC
        Coises @PeterJones
        last edited by

        @PeterJones:

        I’m not familiar with Perl or with Notepad++ function lists, but just on general regex principles, this bit at line 60 and again at line 84 looks suspicious:

        (?:\s*\([^()]*\))?
        (?:\s*\:\s*[^{]+)? 
        \s*\{
        

        especially because [^{] includes \s; so a lot of pointless retrying could occur. I think this:

        \s*+(?:\([^()]*+\)\s*+)?+
        (?:\:[^{]+)?+
        \{
        

        says the same thing without as much retrying. (You would think that not all those plus signs should be needed, and some might not be; but at least with the boost:regex engine in Notepad++, I’ve seen evidence that sometimes what seem like brain-dead obvious optimizations you would expect the engine to make are not in fact applied.)

        However, since \s*[^{]+ matches the same strings as [^{]+ (and makes the following \s* redundant when it matches), I suspect that might not be what you meant to write. What you have says, after a colon there has to be a string of one or more characters, which can be anything other than a left brace (including just one blank, line feed, etc.), then a left brace. If you add a possessive qualifier to the space — \s*+[^{]+ — then there has to be at least one non-space character other than a left brace present after the colon and before the left brace (because if they’re all space characters the possessive \s*+ will “eat them up” and there won’t be anything left to match [^{]+). So the three lines would be:

        \s*+(?:\([^()]*+\)\s*+)?+
        (?:\:\s*+[^{]+)?+
        \{
        

        if you meant to require at least one non-space character between the colon and the left brace.

        I am not great at determining regex efficiency

        The main principle is to think through the matching from left to right and try to write it so that at each character, there is only one choice: match, or don’t. In the three lines we started with, suppose the text to be matched starts with a blank. Is it the optional space at the start of the first line, or the one at the start of the second line, or the one at the start of the third line? Each possibility has to be tried to see if it works out.

        In my rewrites, the first thing that happens is that any leading spaces are matched by \s*+; there’s no retracing our steps later. Now, either the next character is a left parenthesis or it isn’t, so we immediately know whether we do or don’t use the first optional group. At the end of the first optional group, if we use it, we match all space characters again, so when we get to the second optional group, either there is a colon or there isn’t. And so on. Use possessive groups whenever possible. And assume the regex engine is dumb as dirt… just because you can “see” that some retry is pointless doesn’t mean it won’t plod along trying matches that can’t possibly work anyway.

        PeterJonesP 1 Reply Last reply Reply Quote 1
        • PeterJonesP
          PeterJones @Coises
          last edited by

          @Coises ,

          Thanks for the suggestions.

          When I tried updating those, with the possessives for the spaces, I could still notice the delay (if anything, it moved from about half a second to a full second when I would switch tabs).

          But I decided at that point to go back to the v<=8.8.6 version, and start adding back in “features” of the updated version until it slowed down (because even in mine, the v8.8.6 was near instantaneous for the unitTest, whereas the v8.8.8 was noticeable, but usually less than a second). And I was surprised that it slowed down on the first thing I tried: the final <className><nameExpr ...> was what was slowing it down.

          So looking at the v8.8.8 nameExpr at line 99, \s\K((::)?\w+)+(?=::), the (::)?\w+ falls prey to the uncertainty, and probably backtracked a lot. So I changed to sub\s+\K\w+(::\w+)*(?=::\w), and now the ambiguity causing backtracking goes away, and it’s back to nearly instantaneous

          So with your possessives for the space, and reworking the className nameExpr, I think it’s probably much better. I will try to share that one with the users from the issue, and see if it improves things for them, as well.

          PeterJonesP 1 Reply Last reply Reply Quote 0
          • PeterJonesP
            PeterJones @PeterJones
            last edited by

            @PeterJones said in FunctionList: v8.8.7 Perl parser is apparently inefficient:

            I will try to share that one with the users from the issue, and see if it improves things for them, as well.

            Got two confirmations for people who had the problem: the optimized regex improve performance dramatically. So I’ve submitted the updated parser to the codebase: hopefully, it will be released in the next version.

            @Coises , thanks for the suggestion.

            1 Reply Last reply Reply Quote 2
            • guy038G
              guy038
              last edited by guy038

              Hello, @peterjones, @coises and All,

              Hum… Peter, I did not understand how the unitTest.expected.result is supposed to work !

              Sorry for being such a beginner, in this regard. I tried to rename this file as unitTest.pl. without any improvement ! The Function List panel is desperately empty, after the unitTest.pl name :-((

              … And I cannot get your snapshot showing a functional Function List panel, with name UnitTest. What obvious thing am I forgetting ?

              Best Regards,

              guy038

              PeterJonesP 1 Reply Last reply Reply Quote 0
              • PeterJonesP
                PeterJones @guy038
                last edited by

                Hum… Peter, I did not understand how the unitTest.expected.result is supposed to work !

                unitTest.expected.result is the JSON file that Notepad++ uses during its automated tests to check the result (User Manual > Function List > Unit Test File Is Absent describes how to manually run).

                But the unitTest.expected.result is not the file you should use to test the Perl lexer. The unitTest file is Perl source code, and is the file that Notepad++ needs to load.

                If you were trying to get a FunctionList on the unitTest.expected.result, that’s not going to happen, which probably explains the rest of your questions.

                For the unitTest file, which is Perl code, it has a perl shebang line, so Notepad++ can auto-recognize it. But if, for some strange reason it doesn’t, renaming unitTest to unitTest.pl or manually choosing Language > P > Perl should be sufficient to get FunctionList to display (maybe with clicking the REFRESH button on the FL panel)

                But with this perl.xml in the pull request, the users who were having performance issues with the v8.8.7/8 parser have reported that performance issues have gone away. So I think we’ve solved it, at this point.

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors