How to: Match Nested Pairs closed by END

Per Isakson

Context: functionList for MATLAB

The php documentation shows a PCRE regex, \( ( (?>[^()]+) | (?R) )* \), which solves the problem of matching a string in parentheses, allowing for unlimited nested parentheses. This regex works as expected in N++.

Question is, can this regex be modified to take strings, e.g properties,methods,if,for,while etc., as opening symbols and end as closing symbol. I failed.

Example: match if ... matching end of if a, b, for a, b, while a, b, end, end, end, more text

guy038

Hello, @per-isakson,

My first attempt, with the initial-key words if, for, while, switch, try, parfor and the final-key word end, all in lower-case, would be the regex, below :

Just try it against the sample text :


if

  while

  end

end

forend  ( Test : normally NOT matched )

if

end

end


for if end while for if end end end for if end end end end end

for  if  end  while  for  if  end  end  end  for  if  end  end  end  end  end  end  end


for
  if
  end
   while
    for
      if
      end
    end
  end
  for
    if
    end
  end
end
end
end
end

You should obtain five occurrences, capturing 2 single-line zones and 3 multi-lines blocks

Of course, as this regex just search for the exact words, without word boundaries, it wouldn’t mind, for instance, about the wrong block, below, selecting from the letters for to the letters end

afor
....
....
....
endz

I tried to build a more complete regex, with word boundaries, as \W, and the look-arounds (?<=\W) and (?=\W), but my solutions led to other problems and didn’t match all the cases, anyway ! Then, I realized that it would be better, finally, to detect, FIRST, in your code, any key-word, used in that regex, when “glued” to other letters, in a bigger word ! To that purpose, use the regex( again ! ), below :

\w(if|(?<!par)for|while|switch|try|parfor|end)|(?1)\w

IMPORTANT : Sometimes, when clicking, ONE MORE time, on the Find Next button, all the file contents are wrongly selected. It’s a well-known bug, which occurs, while using, mostly, recursive regular expressions :-(( I can’t explain that behaviour ! May be, my regex is not well-formed !?

Best Regards,

guy038

P.S. :

If this regex put you on the right direction, I give you, next time, some details on what it means !!

MAPJe71

@guy038 FYI it’s related to #13505.

Per Isakson

Hello @guy038

Thank you for the recursive regular expression. I have modified it a bit and included into my functionList.xml.

The expression works with TextFX, Quick, FindReplace and https://regex101.com/. (classdef must be the first keyword in the file, but that’s okay.)

The expression is costly to execute (some seconds on my old PC) and regex101 says Catastrofic Backtracking to many of my test-files. I failed to significantly improve the performance.

However, embedded in functionList.xml it’s not a total success. Some test-files, which TextFX|Quick|FindReplace handles well, only produces the filename in the Function List pane. I will report on that in the thread Trouble making a functionList parser for MATLAB, #13505.

Best Regards
per isakson

<classRange
    mainExpr = "(?x)(?s)                # dot matches new line
                (?-i)                   # case sensitive
                (                       # --- open 1st group 
                    \b                  # word boundary  
                    (                   # --- open 2nd group
                        classdef        # keywords that open a  
                        |properties     # Balanced Construct
                        |events         # that is closed by 
                        |enumaration    # 'end'.
                        |methods            
                        |function
                        |if
                        |for
                        |while
                        |switch
                        |try
                        |parfor
                    )                   # --- close 2nd group
                    \b                  # word boundary 
                )                       # --- close 1st group
                
                (                       # open 3rd capturing group    
                    (?:                 # open non-capturing group
                        (?!             # negative look-ahead
                        (?1)|\bend\b    # if not a keyword
                        ).              # then one character
                    )+                  # repeat 
                    |                   # until keyword found
                    (?0)                # recurse RE from start
                )+                      # repeat 3rd group
                \bend\b                 #      
                "

guy038

Hi, @per-isakson and All,

Sorry, but I was quite busy to answer a ( long ! ) e-mail to @iona-hine.

https://notepad-plus-plus.org/community/topic/13513/proximity-search/11

And, while testing some regexes, for him, I just noticed that the interesting syntaxes (?#), that represents a subroutine to the group #, itself, may, in some cases, especially in big files, lead to wrong results, as, for instance, to the all contents wrong match !

So, referring to my previous regex :

It would be better, finally, to rewrite it as :

Therefore, @per-isakson, may be, it would be better to change, in your regex, the part :

                        (?1)|\bend\b

By the following :

                            \b                  # word boundary  
                            (                   # --- open 4th group
                                classdef        # keywords, of the 2nd group + 'end',
                                |properties     # which must NOT occur,
                                |events         # at ANY position, of the
                                |enumaration    # present sub-block, till
                                |methods        # its associated 'end' closing word
                                |function
                                |if
                                |for
                                |while
                                |switch
                                |try
                                |parfor
                                |end
                            )                   # --- close 4th group
                            \b                  # word boundary

BTW, you succeed, to include word boundaries in your regex ! Fine :-) Cant remember while I found out some problems, while trying to do the same thing !

Oh ! May be that, instead of the keyword enumaration, you would mean enumeration !

REMARK :

On my small example test, I tested the new syntax of my previous regex, but it does produce, too, the all contents bug :-((. So, I suppose that it’s rather related to the recursion handling by the present N++ Boost regex engine ?!

Cheers,

guy038

MAPJe71

@guy038 and @Per-Isakson,

Using a named capturing group for the keywords instead of a numbered group and use it as a subroutine probably wouldn’t make a difference i.e. leading to wrong results in big files …

<classRange
    mainExpr = "(?x)                    # free-spacing
                (?s)                    # dot matches new line
                (?-i)                   # case sensitive
                (?'KEYWORDS'            # --- open named capturing group
                    \b                  # word boundary
                    (?:                 # open non-capturing group with keywords that open a Balanced Construct that is closed by 'end'.
                        classdef
                    |   e(?:numeration|vents)
                    |   f(?:or|unction)
                    |   if
                    |   methods
                    |   p(?:arfor|roperties)
                    |   switch
                    |   try
                    |   while
                    )
                    \b                  # word boundary
                )                       # --- close named capturing group
                (?:                     # open non-capturing group
                    (?:                 # open non-capturing group
                        (?!             # negative look-ahead
                            (?&amp;KEYWORDS)
                        |   \bend\b
                        )               # if not a keyword
                        .               # then one character
                    )+                  # repeat until keyword found
                |   (?0)                # recurse RE from start
                )+                      # repeat
                \bend\b
                "

And subroutine definition(s)?
e.g.

<classRange
    mainExpr = "(?x)                            # free-spacing
                (?s)                            # dot matches new line
                (?-i)                           # case sensitive
                (?(DEFINE)                      # Define subroutines
                    (?'OPEN_KEYWORDS'
                        \b                      # word boundary
                        (?:                     # open non-capturing group with keywords that open a Balanced Construct that is closed by 'end'.
                            classdef
                        |   e(?:numeration|vents)
                        |   f(?:or|unction)
                        |   if
                        |   methods
                        |   p(?:arfor|roperties)
                        |   switch
                        |   try
                        |   while
                        )
                        \b                      # word boundary
                    )
                    (?'CLOSE_KEYWORDS'
                        \b                      # word boundary
                        end
                        \b                      # word boundary
                    )
                )
                (?&amp;OPEN_KEYWORDS)               # call subroutine
                (?:                             # open non-capturing group
                    (?:                         # open non-capturing group
                        (?!                     # negative look-ahead
                            (?&amp;OPEN_KEYWORDS)   # call subroutine
                        |   (?&amp;CLOSE_KEYWORDS)  # call subroutine
                        )                       # if not a keyword
                        .                       # then one character
                    )+                          # repeat until keyword found
                |   (?0)                        # recurse RE from start
                )+                              # repeat
                (?&amp;CLOSE_KEYWORDS)              # call subroutine
                "

guy038

Hi, @mapje71,

I, recently, noticed a difference between repeating a group #n and using its associated subroutine call (?n), while using some regexes, in a .txt file, that @iona-hine sent me, by e-mail, a couple of days, ago. Refer to this post, speaking of the general problem of finding a range of characters between two words ( Either the forms Word1......Word2 OR Word2......Word1 ! )

https://notepad-plus-plus.org/community/topic/13513/proximity-search/6

The Iona’s file, of size 533,237, is an ONE-line UTF-8 file, which contains 532,875 characters, from column 1 to column 532,875 !, organized in 74115 words. The characters are, mostly, word characters ( 458,337 ) + other symbols ( 663 ) and space characters ( 73,875 )

In that file, Iona try to search for any range, between the initial boundary man_n and the final boundary city_n, or the opposite, with a maximum of 50 words between !. It’s interesting to note that this file contains 467 words man_n but ONLY ONE occurrence of the word city_n. Therefore, any regex looking for a range between these two words, should find only ONE occurrence !

Using, for instance, the regexes A and B, below, we get, as expected, ONE match ( a range of 167 characters, beginning at column 21,481 ), in both cases.

Regex A : (?si)(?<=\W)(man_n)((?:\W+\w+){0,50}\W+)(city_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

Regex B : (?si)(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)|(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)

Now, if we try the regex C, we get TWO matches ! ( The first one is correct, but the second match wrongly selects all the file contents ! )

Regex C : (?si)(?<=\W)(city_n)((?:\W+\w+){0,50}\W+)(man_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

However, if we use the equivalent regex D, below, without the (?#) syntaxes, it does match ONE match, only, without any bug !!

Regex D : (?si)(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)|(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)

But, @mapje71, this behaviour could occur because of my weak hardware configuration ( An old Win XP laptop, with 1 Mo of memory ! )
I just imagine your surprise, guys ! And you’re right, I, quickly, need a 21th century machine ! But, on the other hand, working without any UAC feature and other goodies, of modern OS, is quite relaxing, too :-))

So, @mapje71, I’m going to send you, that txt file, by e-mail, for further tests. And I hope that you’ll succeed to sort it out !

Cheers,

guy038

MAPJe71

Hi @guy038,

Thanks for sending “A01466.headed.xml.txt”.
Confirming your findings with some additions, note the influence of “wrap around”.
Tested on a Desktop running Windows XP Home + Sp3 with Windows Classic GUI ;-)
Need to get my Windows 10 x64 ready to be able to dig in and debug this!

Regex A: (?si)(?<=\W)(man_n)((?:\W+\w+){0,50}\W+)(city_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
and
Regex B: (?si)(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)|(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)
and
Regex D: (?si)(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)|(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)

Pass i.e. one match even after repeated search:

Find w/ wrap around;
Using RegexTester (by @Claudia-Frank);
Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor

FAIL:

Find w/o wrap around - “Can’t find text”

Regex C: (?si)(?<=\W)(city_n)((?:\W+\w+){0,50}\W+)(man_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
and
Regex G: (?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?1)(?2)(?3)|(?3)(?2)(?1)

Pass i.e. one match even after repeated search:

Using RegexTester (by @Claudia-Frank);
Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor

FAIL:

Find w/ wrap around - toggles between two matches on repeated search (partial and complete text);
Find w/o wrap around - “Can’t find text”

Regex E: (?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?<=\W)(?1)(?2)(?3)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
and
Regex F: (?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?<=\W)(?:(?1)(?2)(?3)|(?3)(?2)(?1))(?=\W)

Pass i.e. one match even after repeated search:

Using RegexTester (by @Claudia-Frank);
Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor

FAIL:

Find w/ wrap around - toggles between two matches on repeated search (partial and complete text);
Find w/o wrap around - toggles between two matches on repeated search (complete text and “Can’t find text”)

Regards,
Menno