functionList not ignoring comments

mpheath

Thanks for the files. This is the output of the zones…

commentZone: 51, 209
commentZone: 213, 409
classZone: 0, 203
classZone: 301, 460
funcZone: 462, 487

This is the visual (Disregard green at end as it is missing a closing span tag). The Class match is from the start of the document (pos 0) and ends at endclass (pos 203) inside the comment zone. Code within comments are invalid as a match. This class match is not within a comment zone so is a valid match as far as the parser is concerned. Making it invalid because it is partly into a comment zone would IMO lose the Class from being displayed in the functionlist panel and would be a worse result.

Looking at the pascal.xml function list file, I do not see patterns trying to locate the end keyword of a Class, Function or Procedure. So in testd.xml , is it necessary to match to endclass to acknowledge a Class definition?

Class definition:

class ccs_Object(vInstanceId) of Timer() custom

I do not know the fine details of recognizing class functions from ordinary functions with the functionlist patterns so @dinkumoil may be able to help with that.

Greediness in the patterns seems to be a problem with the functionlist parsing so suggest to avoid using greedy patterns. The larger the match, the more vunerable it is to run into an issue like this one. Try to aim for the smallest match to achieve a successful result. I consider that @dinkumoil has achieved that with pascal.xml.

Lycan Thrope

@mpheath ,
What am I looking at there? Is that the character/caret positions?

To answer your question, yes, we needed to have the endclass found to close the class. That regex was giving me trouble, as you point out, trying to find only the least amount. @PeterJones , @guy038 , and a few other users here helped formulate the needed regex in that functionList parser file. We needed that to make the regex run the entire length down to the endclass so we could find and close the Class definition.

The dBASE Plus OOP isn’t like C++, where typically, one file = one class. We can have multiple classes defined per file, and thanks to @PeterJones , we also were able to have our objects display like functions inside a Class definition, because they are basically UI objects inside a Class…they technically are other class objects being defined inside a class. So yes, it needs to find the endclass to close one class, and find another.

We can’t use the OpenSymbole/CloseSymbole delimiters like C++ can. We tried putting the terms class/endclass inside those, but I suspect they are only meant for single characters as we never were able to get them to be used for identifying the open/close it needed to frame the Class definitions.

But, again…the endclass should be ignored, inside comments. Period. :)

Edit: Yes, they are character/caret positions, so I see and you see what I’m saying. :) The numbers are off, I suspect for some kind of whitespace offset or newline offset. I finally realized you had a link for me to follow to see that graphic. Yes…how did you produce that?

mpheath

@Lycan-Thrope

Edit: Yes, they are character/caret positions, so I see and you see what I’m saying. :) The numbers are off, I suspect for some kind of whitespace offset or newline offset. I finally realized you had a link for me to follow to see that graphic. Yes…how did you produce that?

They are positions of start and end pairs of a match. I consider the start pos as off by 1. Produced by stdout with some debugging code that the Python script captures and processes it into html. It gives a reasonable visual of the matches, though not perfect as the missing unclosed span tag probably due to overlapping zones.

But, again…the endclass should be ignored, inside comments. Period. :)

You can state that, though the reality is that you may need to work with the parser as it is. I could state it is the pattern that needs fixing as it trys to handle everything from class to endclass. Demands have no relevence without reason.

To answer your question, yes, we needed to have the endclass found to close the class.

Not sure why. The pattern searches to the endclass and yet the pattern does not properly handle code in comments. Seems like an issue with the pattern as it wants to do this rather large match. To blame the parser for the greedy pattern seems unreasonable.

The subject is not about the C++ functionlist so is opening a new argument. I mentioned the pascal.xml which Pascal uses begin and end, though as I mentioned, the Pascal patterns do not look for the end keyword.

The way I see it is that the patterns need to work with the parser. The parser will probably not be changed to make up for patterns that fail to work due to being large. The maintainer IMO will not risk all the functionlist patterns in use to make you or your developer happy even if it were possible. There is a bash functionlist file issue at least 2 years old and risking the other functionlist files could be a bad option. Suggest you fix the pattern.

Lycan Thrope

@mpheath said in functionList not ignoring comments:

You can state that, though the reality is that you may need to work with the parser as it is. I could state it is the pattern that needs fixing as it trys to handle everything from class to endclass. Demands have no relevence without reason.

I will have to respectfully, disagree.
The pattern works. It has been working. It continues to work.

The problem is the parser, and in particular, the commentZones code, in this instance.

If developers don’t comment their code, the parser is safe, but that’s not realistic, either. The simple fact is, the parser doesn’t recognize comments inside a class. It recognizes it before, or after, but not inside…particularly when those comments have actual keywords. Perhaps, you could say, it’s the Zone that is the problem, as it doesn’t recognize when something (a class in this instance) has not ended, but has comments started inside it. Multi-line comments, in this case, and it is the parsers weakness that it doesn’t realize it’s still inside a class. That’s why the classrange element has these:

openSymbole ="\{"
closeSymbole="\}"

And the parser makes calls like this:

bool FunctionParsersManager::getZonePaserParameters(TiXmlNode *classRangeParser, generic_string &mainExprStr, generic_string &openSymboleStr, generic_string &closeSymboleStr, std::vector<generic_string> &classNameExprArray, generic_string &functionExprStr, std::vector<generic_string> &functionNameExprArray)

It can’t possibly parse a zone, if it doesn’t know the beginning and end of it.

It’s because the parser doesn’t allow keywords to be used in the open and close Symbole, that having to find endclass became necessary.

Another way to work around this is not to put block comments inside a class, because the parser doesn’t work like a real parser. It’s a quasi-parser. That kind of encourages not commenting code, which seems like it is encouraging a bad habit that is already rampant. Either way, you can’t reasonably expect the parser to work properly if it doesn’t allow the parserLanguage.xml to define it’s open and close pattern, and that’s what this parser is doing, and so this pattern was improvised to make the parser function.

So please, don’t disparage the pattern. I’d like to think some really talented people created it, sans myself, right @PeterJones ? :)

mpheath

@Lycan-Thrope Try changing line 91 in testd.xml

						  (?si:.*?^\h*endclass)           #  must match all the way to 'endclass'

to

						  (?si:.*?^\h*(?!//\h*)endclass(?!\s*\*/))           #  must match all the way to 'endclass'

if endclass is preceded by // or followed by */ then it is probably commented.

commentZone: 51, 209
commentZone: 213, 409
funcZone: 411, 432
classZone: 0, 460
funcZone: 462, 487

Instead of 2 classZones as before, now there is only 1 classZone.

PeterJones

@Lycan-Thrope said in functionList not ignoring comments:

So please, don’t disparage the pattern. I’d like to think some really talented people created it, sans myself, right @PeterJones ? :)

But those talented people aren’t infallible (at least, I know I’m not), and might not have considered all edge cases.

You seem to be expecting the parser to be running two regex simultaneously – the comment regex and the class regex. Because you seem to expect that it can see the beginning of a class, then, while parsing the single class regex to grab the whole class from class to endclass, that it simultaneously sees a comment using another regex while the class-regex is still active, and can recognize that the endclass is commented out, to prevent the class-parsing regex from seeing it. Running two regex simultaneously would be a pretty awesome design if you could make it work… but I doubt that’s what was implemented.

Alternately, one could develop a system where the entire source file is parsed using the comment regex, and the comments are all deleted from the in-memory version of the file. Then that shrunken/de-commented in-memory file is run through the class parser, which then couldn’t see anything that used to be in the comments. If I were to have all the experience I have now with Notepad++'s Function List parser, and were to be writing a Function List-like parser from scratch, I think this is the direction I would go (if I could remember to do so). But based on my previous experience with FunctionList, and your descriptions above, that’s obviously not how Notepad++'s parser is written.

Thus, we’re left with the situation as implemented, where it seems (without my having studied N++'s FunctionList source code) that the comment regex is used to avoid starting a new class (or, I believe, standalone function), but that once it starts the class’s (or function’s) regex, it is up to the regex expression to make sure that comments inside that block are ignored while trying to determine the end of the block. (I think it’s not as much of a big deal for functions, because usually the expressions for those are looking for just the function name, not the whole function block). Hence, @mpheath has encouraged you to edit the regex in such a way as to ignore comments when looking for endclass, and even shown an example of how to do that.

Lycan Thrope

@mpheath said in functionList not ignoring comments:

(?si:.?^\h(?!//\h*)endclass(?!\s**/))

1.) Thank you. It works.
2.) I humbly, thank you and now, see what you were referring to with regard handling the comments inside the regex…

I was again trying last night/this morning before I retired, to try different things like you did, but obviously don’t have the chops that I thought I had. :)

@PeterJones , in his follow-up message almost points out how I thought the parser worked.

My impression, was that when it was inside the class zone, it would would read until the endclass and if a commmentZone was encountered, it would ignore reading anything inside, regardless what was in it until it found the end of the commentZone, and then would continue on with it’s previous Zone since it did not find it’s end. In that regard, it would be like a switch, with it’s recursion, like how a function in code works (It seemed like the logical explanation) where it would stop basic operation there until it finished the commentZone, and then return to it’s previous task. Maybe, that is the way it works, but the weak point was, we’re not able to use a keyword, only a character symbol, so I can see, now, why you meant it needed to deal with comments in the regex pattern itself. My thoughts were that was why we defind the comment regex prior to the pattern searching.

So I accept that this was, again, user error. Mine. :)

Thank you again.

Lycan Thrope

@PeterJones ,

Never expected you to be perfect, Peter, but you’re pretty close. :)

As in my reply to @mpheath , you’re mostly right about how I thought the parser worked. The exception was that I didn’t necessarily think it was running two regex simultaneously, but that it would know where the commentZone was, perhaps based on the commentZone character positions it uses, and turn off the reading/consumption mode of the class regex while inside a marked commentZone, basically ignore it until it came to the */ mark, and then continue reading the class.

I guess I was just confused about how I was supposed to alter the regex to deal with comments, as I wasn’t sure what he meant, since he seemed to be saying I needed to stop the pattern from seeking the endclass, which we couldn’t, because otherwise the class pattern would never be recognized. I know, I tried it last night, as that’s what I thought he was referring to, and all the functions inside and out showed up, but not the classes, so I knew that endclass keyword had to stay. I’ll have to chalk this misunderstanding to my lack of experience, which, thanks to his above solution, I can be a little wiser…not much…but a little. :)

Lycan Thrope

@mpheath ,
Just a futher follow up. After the initial fix worked for the example file presented, we have again found the issue rearing it’s head elsewhere, so while it temporarily fixed the issue, I feel the only real fix for this issue, will need to be the code allowing for keywords like class/endclass for the open and close demarcations that the classrange element expects for locating those zones.

I’m trying to get myself up to speed somewhat in C++ so I can at least read the code with a little more knowledge and see if I can’t figure out how to find and change what I think needs to be changed and I guess do a pull request and muck it up. :)

Thanks anyway, and I think this subject here will at least let people know if they find this issue in their functionList implementation of their language, if it has a mixed parser, this is a possible explanation for the problem.

Lycan Thrope

@mpheath ,

After going over a lot of stuff, including reexamining the FunctionList FAQ, I have to agree with you, that at present, I may need to work with the parser, as is. I must have missed that ‘embedded comments’ section, or glossed over it, or didn’t relate it to the kind of comments we do as the example seemed to be an example of excessive use of inline ‘block’ comment style for an inline comment. That may be what threw me as regards that example in the FAQ.

Moving the opening class declaration line after any block comments immediately following it, fixes the problem completely, with my recent testing. It appears, however, that as I discovered before, line comments inside a class taking up a whole line, or after code, inside a class/endclass block does indeed work as it should.

So I guess until I get up to speed with C++ and take a crack at a fix for the block comments giving problems inside a class/endclass code block, block comments will have to be avoided.

I do, however, feel that the simple fix for this, should be just allowing a longer string other than \{ and \} symbol characters would be a proper fix, since being able to use class and endclass in the open/close symbole elements, it would be part of the actual parser demarcation ability to know where a class starts and ends, rather than having to use regex to find all the way to the end like this regex had to do to be able to outline the structure of the class. For now, the crisis is averted until I can revisit this issue, hopefully with some C++ skills at a later date. :)