How to: Match Nested Pairs closed by END
-
Context: functionList for MATLAB
The php documentation shows a PCRE regex,
\( ( (?>[^()]+) | (?R) )* \)
, which solves the problem of matching a string in parentheses, allowing for unlimited nested parentheses. This regex works as expected in N++.Question is, can this regex be modified to take strings, e.g
properties,methods,if,for,while
etc., as opening symbols andend
as closing symbol. I failed.Example: match
if ... matching end
ofif a, b, for a, b, while a, b, end, end, end, more text
-
Hello, @per-isakson,
My first attempt, with the initial-key words
if
,for
,while
,switch
,try
,parfor
and the final-key wordend
, all in lower-case, would be the regex, below :(?s-i)(if|for|while|switch|try|parfor)((?:(?!(?1)|end).)+|(?R))+end
Just try it against the sample text :
if while end end forend ( Test : normally NOT matched ) if end end for if end while for if end end end for if end end end end end for if end while for if end end end for if end end end end end end end for if end while for if end end end for if end end end end end end
You should obtain five occurrences, capturing 2 single-line zones and 3 multi-lines blocks
Of course, as this regex just search for the exact words, without word boundaries, it wouldn’t mind, for instance, about the wrong block, below, selecting from the letters for to the letters end
afor .... .... .... endz
I tried to build a more complete regex, with word boundaries, as
\W
, and the look-arounds(?<=\W)
and(?=\W)
, but my solutions led to other problems and didn’t match all the cases, anyway ! Then, I realized that it would be better, finally, to detect, FIRST, in your code, any key-word, used in that regex, when “glued” to other letters, in a bigger word ! To that purpose, use the regex( again ! ), below :\w(if|(?<!par)for|while|switch|try|parfor|end)|(?1)\w
IMPORTANT : Sometimes, when clicking, ONE MORE time, on the Find Next button, all the file contents are wrongly selected. It’s a well-known bug, which occurs, while using, mostly, recursive regular expressions :-(( I can’t explain that behaviour ! May be, my regex is not well-formed !?
Best Regards,
guy038
P.S. :
If this regex put you on the right direction, I give you, next time, some details on what it means !!
-
-
Hello @guy038
Thank you for the recursive regular expression. I have modified it a bit and included into my functionList.xml.
The expression works with TextFX, Quick, FindReplace and https://regex101.com/. (
classdef
must be the first keyword in the file, but that’s okay.)The expression is costly to execute (some seconds on my old PC) and regex101 says
Catastrofic Backtracking
to many of my test-files. I failed to significantly improve the performance.However, embedded in functionList.xml it’s not a total success. Some test-files, which TextFX|Quick|FindReplace handles well, only produces the filename in the Function List pane. I will report on that in the thread
Trouble making a functionList parser for MATLAB
, #13505.Best Regards
per isakson<classRange mainExpr = "(?x)(?s) # dot matches new line (?-i) # case sensitive ( # --- open 1st group \b # word boundary ( # --- open 2nd group classdef # keywords that open a |properties # Balanced Construct |events # that is closed by |enumaration # 'end'. |methods |function |if |for |while |switch |try |parfor ) # --- close 2nd group \b # word boundary ) # --- close 1st group ( # open 3rd capturing group (?: # open non-capturing group (?! # negative look-ahead (?1)|\bend\b # if not a keyword ). # then one character )+ # repeat | # until keyword found (?0) # recurse RE from start )+ # repeat 3rd group \bend\b # "
-
Hi, @per-isakson and All,
Sorry, but I was quite busy to answer a ( long ! ) e-mail to @iona-hine.
https://notepad-plus-plus.org/community/topic/13513/proximity-search/11
And, while testing some regexes, for him, I just noticed that the interesting syntaxes
(?#)
, that represents a subroutine to the group #, itself, may, in some cases, especially in big files, lead to wrong results, as, for instance, to the all contents wrong match !So, referring to my previous regex :
(?s-i)(if|for|while|switch|try|parfor)((?:(?!(?1)|end).)+|(?R))+end
It would be better, finally, to rewrite it as :
(?s-i)(if|for|while|switch|try|parfor)((?:(?!if|for|while|switch|try|parfor|end).)+|(?R))+end
Therefore, @per-isakson, may be, it would be better to change, in your regex, the part :
(?1)|\bend\b
By the following :
\b # word boundary ( # --- open 4th group classdef # keywords, of the 2nd group + 'end', |properties # which must NOT occur, |events # at ANY position, of the |enumaration # present sub-block, till |methods # its associated 'end' closing word |function |if |for |while |switch |try |parfor |end ) # --- close 4th group \b # word boundary
BTW, you succeed, to include word boundaries in your regex ! Fine :-) Cant remember while I found out some problems, while trying to do the same thing !
Oh ! May be that, instead of the keyword
enumaration
, you would meanenumeration
!
REMARK :
On my small example test, I tested the new syntax of my previous regex, but it does produce, too, the all contents bug :-((. So, I suppose that it’s rather related to the recursion handling by the present N++ Boost regex engine ?!
Cheers,
guy038
-
@guy038 and @Per-Isakson,
Using a named capturing group for the keywords instead of a numbered group and use it as a subroutine probably wouldn’t make a difference i.e. leading to wrong results in big files …
<classRange mainExpr = "(?x) # free-spacing (?s) # dot matches new line (?-i) # case sensitive (?'KEYWORDS' # --- open named capturing group \b # word boundary (?: # open non-capturing group with keywords that open a Balanced Construct that is closed by 'end'. classdef | e(?:numeration|vents) | f(?:or|unction) | if | methods | p(?:arfor|roperties) | switch | try | while ) \b # word boundary ) # --- close named capturing group (?: # open non-capturing group (?: # open non-capturing group (?! # negative look-ahead (?&KEYWORDS) | \bend\b ) # if not a keyword . # then one character )+ # repeat until keyword found | (?0) # recurse RE from start )+ # repeat \bend\b "
And subroutine definition(s)?
e.g.<classRange mainExpr = "(?x) # free-spacing (?s) # dot matches new line (?-i) # case sensitive (?(DEFINE) # Define subroutines (?'OPEN_KEYWORDS' \b # word boundary (?: # open non-capturing group with keywords that open a Balanced Construct that is closed by 'end'. classdef | e(?:numeration|vents) | f(?:or|unction) | if | methods | p(?:arfor|roperties) | switch | try | while ) \b # word boundary ) (?'CLOSE_KEYWORDS' \b # word boundary end \b # word boundary ) ) (?&OPEN_KEYWORDS) # call subroutine (?: # open non-capturing group (?: # open non-capturing group (?! # negative look-ahead (?&OPEN_KEYWORDS) # call subroutine | (?&CLOSE_KEYWORDS) # call subroutine ) # if not a keyword . # then one character )+ # repeat until keyword found | (?0) # recurse RE from start )+ # repeat (?&CLOSE_KEYWORDS) # call subroutine "
-
Hi, @mapje71,
I, recently, noticed a difference between repeating a group #n and using its associated subroutine call
(?n)
, while using some regexes, in a .txt file, that @iona-hine sent me, by e-mail, a couple of days, ago. Refer to this post, speaking of the general problem of finding a range of characters between two words ( Either the formsWord1......Word2
ORWord2......Word1
! )https://notepad-plus-plus.org/community/topic/13513/proximity-search/6
The Iona’s file, of size 533,237, is an ONE-line UTF-8 file, which contains 532,875 characters, from column 1 to column 532,875 !, organized in 74115 words. The characters are, mostly, word characters ( 458,337 ) + other symbols ( 663 ) and space characters ( 73,875 )
In that file, Iona try to search for any range, between the initial boundary
man_n
and the final boundarycity_n
, or the opposite, with a maximum of 50 words between !. It’s interesting to note that this file contains 467 wordsman_n
but ONLY ONE occurrence of the wordcity_n
. Therefore, any regex looking for a range between these two words, should find only ONE occurrence !Using, for instance, the regexes
A
andB
, below, we get, as expected, ONE match ( a range of 167 characters, beginning at column 21,481 ), in both cases.Regex A :
(?si)(?<=\W)(man_n)((?:\W+\w+){0,50}\W+)(city_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
Regex B :
(?si)(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)|(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)
Now, if we try the regex
C
, we get TWO matches ! ( The first one is correct, but the second match wrongly selects all the file contents ! )Regex C :
(?si)(?<=\W)(city_n)((?:\W+\w+){0,50}\W+)(man_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
However, if we use the equivalent regex
D
, below, without the(?#)
syntaxes, it does match ONE match, only, without any bug !!Regex D :
(?si)(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)|(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)
But, @mapje71, this behaviour could occur because of my weak hardware configuration ( An old Win XP laptop, with 1 Mo of memory ! )
I just imagine your surprise, guys ! And you’re right, I, quickly, need a 21th century machine ! But, on the other hand, working without any UAC feature and other goodies, of modern OS, is quite relaxing, too :-))So, @mapje71, I’m going to send you, that txt file, by e-mail, for further tests. And I hope that you’ll succeed to sort it out !
Cheers,
guy038
-
Hi @guy038,
Thanks for sending “A01466.headed.xml.txt”.
Confirming your findings with some additions, note the influence of “wrap around”.
Tested on a Desktop running Windows XP Home + Sp3 with Windows Classic GUI ;-)
Need to get my Windows 10 x64 ready to be able to dig in and debug this!
Regex A:
(?si)(?<=\W)(man_n)((?:\W+\w+){0,50}\W+)(city_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
and
Regex B:(?si)(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)|(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)
and
Regex D:(?si)(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)|(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)
Pass i.e. one match even after repeated search:
- Find w/ wrap around;
- Using RegexTester (by @Claudia-Frank);
- Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor
FAIL:
- Find w/o wrap around - “Can’t find text”
Regex C:
(?si)(?<=\W)(city_n)((?:\W+\w+){0,50}\W+)(man_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
and
Regex G:(?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?1)(?2)(?3)|(?3)(?2)(?1)
Pass i.e. one match even after repeated search:
- Using RegexTester (by @Claudia-Frank);
- Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor
FAIL:
- Find w/ wrap around - toggles between two matches on repeated search (partial and complete text);
- Find w/o wrap around - “Can’t find text”
Regex E:
(?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?<=\W)(?1)(?2)(?3)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
and
Regex F:(?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?<=\W)(?:(?1)(?2)(?3)|(?3)(?2)(?1))(?=\W)
Pass i.e. one match even after repeated search:
- Using RegexTester (by @Claudia-Frank);
- Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor
FAIL:
- Find w/ wrap around - toggles between two matches on repeated search (partial and complete text);
- Find w/o wrap around - toggles between two matches on repeated search (complete text and “Can’t find text”)
Regards,
Menno