Trouble with defining a Function List entry

Lokathor

I’m trying to write a function list entry for Haskell. Function declarations are words at the start of a line, then :: then some type definition after it. Such as

updateWindow :: GlobalRecord -> SDL.Renderer -> IO ()
main :: IO ()

and so on. Comments are things within {- and -} pairs, or until the end of the line after a –

I found the funcionList.xml file in the notepad++ install directory, and I deleted the functionList.xml from the AppData directory.

I put an association entry into the file

		<association langID="45" id="haskell_function"/>

And I also put a parser into the file.

        <parser id="haskell_function" displayName="Haskell Function" commentExpr="(({-.*-})|(--.*$))">
            <function
                mainExpr="^\w+ :: .*$"
                displayMode="$functionName">
            </function>
        </parser>

However, when I restart notepad++ it doesn’t show any entries in the function list. It shows the Main.hs file, but then no functions within the file. What am I doing wrong?

guy038

Hello, @Lokathor,

I understood the main reason why you could NOT see any entry in the FunctionList panel :-)) Although it’s not easy to find out !

The functionList regex engine does NOT act as the Notepad++ one. By default, it considers that the dot meta-character ( . ) match, absolutely, any character ( standard or EOL characters )

But, as the Notepad++ regex engine does, it does consider that the ^ and $ assertions stand for the classical beginning and the end of any line

So, when you want to match, from any location, all the possible remaining characters of a line, TWO syntaxes are possible :

.*?$ , which matches any range, even null, of standard characters, till the nearest EOL location, due to the lazy quantifier *?

OR

(?-s).* , which matches any range, even null, of standard characters only, due to the in-line modifier (?-s), which forces the regex engine to consider that the dot, special character, matches standard characters, only !

Of course, if the range of characters must be a non-zero length string, prefer the syntaxes .+?$ OR (?-s).+

Secondly, your regex, relative to comments was invalid : The two characters { and } must be escaped ( \{ and \} ) because they have a special meaning in regexes. Moreover, I simplified a bit the expression and, taking in account, what I said above, we end, for the comments regex, to ONE of the two syntaxes :

\{-.*-\}|--.*?$

OR

\{-.*-\}|(?-s)--.*

Thirdly, although, the minimum form of your parser, with the two corrections, above, would work and display the entire line where function declarations are, it would be better to add the functionName node, in order to get the function name, ONLY, that is to display !

For instance, from the original Haskell text, below :

Normal text--Comment text

Normal text{-Comment text-}

updateWindow :: GlobalRecord -> SDL.Renderer -> IO ()

main :: IO ()

Test :: bla bla bla

Your parser would display, in the functionList panel, the three entries :

updateWindow :: GlobalRecord -> SDL.Renderer -> IO ()

main :: IO ()

Test :: bla bla bla

With that new parser, below, ( The TWO syntaxes are equivalent ! ) :

<parser id="haskell_function" displayName="Haskell Function" commentExpr="\{-.*-\}|--.*?$">
    <function
        mainExpr="^\w+ :: .*?$"
        displayMode="$functionName">
        <functionName>
            <nameExpr expr="^\w+" />
        </functionName>
    </function>
</parser>

OR

<parser id="haskell_function" displayName="Haskell Function" commentExpr="\{-.*-\}|(?-s)--.*">
    <function
        mainExpr="(?-s)^\w+ :: .*"
        displayMode="$functionName">
        <functionName>
            <nameExpr expr="^\w+" />
        </functionName>
    </function>
</parser>

we obtain, in the functionList panel, the three correct entries :

updateWindow

main

Test

Best Regards,

guy0038

P.S. :

I would like to point out, for instance, the main difference between your comments regex and mine !

With your syntax \{-.*-\}|--.*$, as soon as the functionList regex engine meets the comment form --, it matches the greatest range of characters of the current file, till the End of Line assertion ( $ ), due to greedy quantifier *. So, it catches all the remaining characters of the file and, of course, it cannot match, afterwards, any function name !
With the syntax \{-.*-\}|--.*?$, as soon as the functionList regex engine meets the comment form --, it matches the smallest range of characters of the current file, till the next End of Line assertion ( $ ), due to lazy quantifier *?. So, it, only, catches all the remaining characters of the current line. Therefore, the future detection of functions, with the appropriate regex, can occur !

Claudia Frank

@guy038

The functionList regex engine does NOT act as the Notepad++ one. By default, it considers that the dot meta-character ( . ) match, absolutely, any character ( standard or EOL characters )

is that the only difference you found so far?

Cheers
Claudia

Lokathor

Ah! Regex engine differences, classic.

Your version partly works. I’m not sure why it’s only partly. I updated notepad++ to the latest version to be sure.

When I tested with this file, two functions were correctly detected:
https://github.com/Lokathor/galaxy-break/blob/master/src/Main.hs

But when I used another, it only detected the very bottom function (atomicPrintLn):
https://github.com/Lokathor/galaxy-break/blob/master/lib/Control/Console.hs

Other Haskell files I had around got similar results, with only the very bottom function being put in the function list.

guy038

Hi Lokathor, Claudia and All

Hey, guys, I’ve just found out a very nice way to build a single multi-lines regex, with, in addition, some comments. If you can’t wait, just go the the end of that post ;-))

Lokathor, sorry to be late, but I preferred to reply to the Claudia’s question, first ! Next, I’ll try to understand why you cannot see all your functions, from the Console.hs file !

Claudia, I’ve just ended a series of tests and, so far, here are my first conclusions :

1) As said in my previous reply, by default, the functionList regex engine considers that the dot , . , matches, absolutely, any character ( standard or EOL ones )

2) The functionList regex engine searches, by default, in an insensitive way ! So, if you need to match any exact case, just insert the syntax (?-i), at the beginning of your regex

3) The functionList regex engine considers, by default, that all the Unicode blank characters are significant characters, in the regex, in the same way Notepad++ does !

But, if you insert the (?x) syntax ( PCRE-EXTENDED option ), at beginning of the regex, all the blank characters, below, are totally ignored, by the regex engine ! These are :

The 3 horizontal blank characters :
- \t Tabulation ( TAB ), \x20 Space ( SP ) and \xA0 Non Breaking space ( NBSP )
The 8 vertical blank characters :
- \r\n Windows EOL ( CRLF ), \n Line Feed ( LF ) and \r Carriage Return ( CR )
- \x0B Vertical Tabulation ( VT ), \f Form Feed ( FF ) and \x85 Next Line ( NEL )
- \x{2028} Line Separator ( LS ) and \x{2029} Paragraph Separator ( PS )

Remark : If you need to search for one of these characters, above, literally, while using the (?x) option, two solutions :

Escape that character, with a backslash ( \ ), just before the character
Use its own regex form, as, for instance, \n, \x20 or \x{2028}…

Moreover, if you use the (?x) syntax, any sharp character ( # ) ( NOT preceded by an antislash character ) begins a comment section, till the end of the current line

All these particularities are really interesting, when writing your parser, in the fucntionList.xml file. Indeed, just see, for instance, the quite neat regexes, created by MAPJe71 :

https://notepad-plus-plus.org/community/topic/12691/function-list-with-java-problems/6

Quite nice to be able writing multi-lines regexes, with some comments, isn’t it ?

But the great news is that I found out a way to reproduce this behaviour, with the Find dialog !!!

For instance, follow the few steps, below :

Copy all the text, below, in a new tab :

(?x)

BEGINNING of the regex

(\d+) # a NUMBER, stored as GROUP 1
\w+ # IMMEDIATELY followed by a WORD

and

\1 # the SAME number, as before
#\x20# # ending with TWO “sharp” characters, separated by a single SPACE

| # OR

(?-i)ExAMplE # the word “ExAMplE” in that EXACT case

END of the regex

ExAMplE 123Test123# # # our SUBJECT text
In that new tab, select ALL the text, WITHOUT the last line, ExAMplE 123Test...... ( IMPORTANT )
Open the Find dialog (Ctrl + F)
Click, TWICE, on the Find Next button

=> The regex engine should match, successively, the string ExAMplE, then the expression 123Test123# # !

Really magic, isn’t it :-))

Of course, the syntax (?x) must begin the regex search expression !

Otherwise, Claudia, I tried to test, the behaviour of the functionList regex engine, with :

The escaped, collating, symbolic and equivalence forms of characters
The character classes
The comments forms
The assertions
The look-arounds
The conditional IF structure
The priority order of the regex structures

And, up to now, except from whose above, I was not able to see any other difference with the default N++ regex engine :-))

Cheers,

guy038

Claudia Frank

Hi Guy,
thanks - another goodie.

Cheers
Claudia

guy038

Hi Lokathor,

I understood what happens !

As you have a comment form {-.....-}, at beginning of your Console.hs file

{-# LANGUAGE CPP, Trustworthy #-}

and a last comment identical form, near the end, before your last function ( atomicPrintLn )

{-| performs a putStrLn action while holding the lock given and with special
processing to attempt to keep the buffered text working properly.
-}

when it executes the comments regex \{-.*-\}|--.*?$ and meet the comment form {-, the first alternative is chosen, but as the dot character matches, absolutely, any character, the part \{-.*-\}, of the comment regex select from the first comment till the last comment ( so 76 lines ! ). Therefore, only, the last function is displayed in the functionList panel !

So, the solution consists to add a question mark, just before the two characters -}, in the comments regex. By this mean the regex will select any individual comments block, instead of catching almost everything !

Remark : We cannot use the partial syntax (?-s)\{-.*-\}, because the comment form {-.....-} may exist as a single line or as a block of several lines !

So, the final parser needed is, either :

<parser id="haskell_function" displayName="Haskell Function" commentExpr="\{-.*?-\}|--.*?$">
    <function
        mainExpr="^\w+ :: .*?$"
        displayMode="$functionName">
        <functionName>
            <nameExpr expr="^\w+" />
        </functionName>
    </function>
</parser>

OR

<parser id="haskell_function" displayName="Haskell Function" commentExpr="\{-.*?-\}|(?-s)--.*">
    <function
        mainExpr="(?-s)^\w+ :: .*"
        displayMode="$functionName">
        <functionName>
            <nameExpr expr="^\w+" />
        </functionName>
    </function>
</parser>

Cheers,

guy038

P.S. ;

In your console.hs file, you can search for, successively, the regexes :

(?s)\{-.*-\}
(?s)\{-.*?-\}

The difference is obvious !!

古旮

To @Lokathor
So, the problem seems to be the perl modifiers. Here’s some document that might help:
(?flags-not-flags)
alters which of the perl modifiers are in effect within the pattern, changes take effect from the point that the block is first seen and extend to any enclosing ).
The flags and not-flags could be:

i: case insensitive (default: off)
m: ^ and $ match embedded newlines (default: as per “. matches newline”)
s: dot matches newline (default: as per “. matches newline”)
x: Ignore unescaped whitespace in regex (default: off)

E.g. If you want the regexp…

to be case sensitive
NOT to match a newline character with dot

Then, your regexp shall start with: (?-si)

To @guy038 :
Nice to know the usage of (?x) (the “x” modifier).
But it’s quite easy to accomplish the same thing using some other regexp. As below:
STEP 1
find:(?-s)(?<!\\)#.*
replace:<EMPTY>
STEP 2
find:\s+
replace:<EMPTY>
As long as I don’t get the confirmed document, I prefer to do it this way.
BTW, so, it seems that, when the modifier x is on, character between # and \R in the regexp are treated as comment?
Best regards.

guy038

Hello, 古旮

Of course, I knew that, in order to match my previous example phrase ExAMplE 123Test123# #, the shorter regex was, simply :

(\d+)\w+\1\#\x20\#|(?-i)ExAMplE

But, I wanted to spread over this regex, on several lines, on purpose! I wanted to show that there is a simple way to split any complicated regex, with some comments, in order to explain the different parts of the regex :

Firstly, select any multi-lines regex block of text, previously created
Secondly, open any tab of the Find dialog

=> All the multi-lines text are, automatically, filled up, in the Find what: field

Thirdly, just perform your find and/or replace operation

Cheers,

guy038

BTW, 古旮, your regex, in two steps, ( which allows to get the irreducible form of a regex, from any expanded form of that same regex ! ), can be performed, in one step, only, with the syntax, below ;-))

SEARCH $\?x$|\s+|(?-s)(?<!\\)#.*

REPLACE Leave EMPTY

Lokathor

It works. You’ve turned Notepad++ into nearly the perfect editor for my Haskell work. Many thanks.

I’m not sure what the process would be to add this back into the main notepad++ program so that others could have this functionality by default though.

MAPJe71

@Lokathor
I will add the Haskell parser to a future FunctionList update.

Trouble with defining a Function List entry

BEGINNING of the regex

and

END of the regex