Community
    • Login

    How to: Match Nested Pairs closed by END

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    8 Posts 3 Posters 4.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Per IsaksonP
      Per Isakson
      last edited by

      Context: functionList for MATLAB

      The php documentation shows a PCRE regex, \( ( (?>[^()]+) | (?R) )* \), which solves the problem of matching a string in parentheses, allowing for unlimited nested parentheses. This regex works as expected in N++.

      Question is, can this regex be modified to take strings, e.g properties,methods,if,for,while etc., as opening symbols and end as closing symbol. I failed.

      Example: match if ... matching end of if a, b, for a, b, while a, b, end, end, end, more text

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @per-isakson,

        My first attempt, with the initial-key words if, for, while, switch, try, parfor and the final-key word end, all in lower-case, would be the regex, below :

        (?s-i)(if|for|while|switch|try|parfor)((?:(?!(?1)|end).)+|(?R))+end

        Just try it against the sample text :

        
        if
        
          while
        
          end
        
        end
        
        forend  ( Test : normally NOT matched )
        
        if
        
        end
        
        end
        
        
        for if end while for if end end end for if end end end end end
        
        for  if  end  while  for  if  end  end  end  for  if  end  end  end  end  end  end  end
        
        
        for
          if
          end
           while
            for
              if
              end
            end
          end
          for
            if
            end
          end
        end
        end
        end
        end
        

        You should obtain five occurrences, capturing 2 single-line zones and 3 multi-lines blocks


        Of course, as this regex just search for the exact words, without word boundaries, it wouldn’t mind, for instance, about the wrong block, below, selecting from the letters for to the letters end

        afor
        ....
        ....
        ....
        endz
        

        I tried to build a more complete regex, with word boundaries, as \W, and the look-arounds (?<=\W) and (?=\W), but my solutions led to other problems and didn’t match all the cases, anyway ! Then, I realized that it would be better, finally, to detect, FIRST, in your code, any key-word, used in that regex, when “glued” to other letters, in a bigger word ! To that purpose, use the regex( again ! ), below :

        \w(if|(?<!par)for|while|switch|try|parfor|end)|(?1)\w

        IMPORTANT : Sometimes, when clicking, ONE MORE time, on the Find Next button, all the file contents are wrongly selected. It’s a well-known bug, which occurs, while using, mostly, recursive regular expressions :-(( I can’t explain that behaviour ! May be, my regex is not well-formed !?

        Best Regards,

        guy038

        P.S. :

        If this regex put you on the right direction, I give you, next time, some details on what it means !!

        1 Reply Last reply Reply Quote 0
        • MAPJe71M
          MAPJe71
          last edited by MAPJe71

          @guy038 FYI it’s related to #13505.

          1 Reply Last reply Reply Quote 0
          • Per IsaksonP
            Per Isakson
            last edited by

            Hello @guy038

            Thank you for the recursive regular expression. I have modified it a bit and included into my functionList.xml.

            The expression works with TextFX, Quick, FindReplace and https://regex101.com/. (classdef must be the first keyword in the file, but that’s okay.)

            The expression is costly to execute (some seconds on my old PC) and regex101 says Catastrofic Backtracking to many of my test-files. I failed to significantly improve the performance.

            However, embedded in functionList.xml it’s not a total success. Some test-files, which TextFX|Quick|FindReplace handles well, only produces the filename in the Function List pane. I will report on that in the thread Trouble making a functionList parser for MATLAB, #13505.

            Best Regards
            per isakson

            <classRange
                mainExpr = "(?x)(?s)                # dot matches new line
                            (?-i)                   # case sensitive
                            (                       # --- open 1st group 
                                \b                  # word boundary  
                                (                   # --- open 2nd group
                                    classdef        # keywords that open a  
                                    |properties     # Balanced Construct
                                    |events         # that is closed by 
                                    |enumaration    # 'end'.
                                    |methods            
                                    |function
                                    |if
                                    |for
                                    |while
                                    |switch
                                    |try
                                    |parfor
                                )                   # --- close 2nd group
                                \b                  # word boundary 
                            )                       # --- close 1st group
                            
                            (                       # open 3rd capturing group    
                                (?:                 # open non-capturing group
                                    (?!             # negative look-ahead
                                    (?1)|\bend\b    # if not a keyword
                                    ).              # then one character
                                )+                  # repeat 
                                |                   # until keyword found
                                (?0)                # recurse RE from start
                            )+                      # repeat 3rd group
                            \bend\b                 #      
                            "
            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by

              Hi, @per-isakson and All,

              Sorry, but I was quite busy to answer a ( long ! ) e-mail to @iona-hine.

              https://notepad-plus-plus.org/community/topic/13513/proximity-search/11

              And, while testing some regexes, for him, I just noticed that the interesting syntaxes (?#), that represents a subroutine to the group #, itself, may, in some cases, especially in big files, lead to wrong results, as, for instance, to the all contents wrong match !

              So, referring to my previous regex :

              (?s-i)(if|for|while|switch|try|parfor)((?:(?!(?1)|end).)+|(?R))+end

              It would be better, finally, to rewrite it as :

              (?s-i)(if|for|while|switch|try|parfor)((?:(?!if|for|while|switch|try|parfor|end).)+|(?R))+end


              Therefore, @per-isakson, may be, it would be better to change, in your regex, the part :

                                      (?1)|\bend\b
              

              By the following :

                                          \b                  # word boundary  
                                          (                   # --- open 4th group
                                              classdef        # keywords, of the 2nd group + 'end',
                                              |properties     # which must NOT occur,
                                              |events         # at ANY position, of the
                                              |enumaration    # present sub-block, till
                                              |methods        # its associated 'end' closing word
                                              |function
                                              |if
                                              |for
                                              |while
                                              |switch
                                              |try
                                              |parfor
                                              |end
                                          )                   # --- close 4th group
                                          \b                  # word boundary 
              
              

              BTW, you succeed, to include word boundaries in your regex ! Fine :-) Cant remember while I found out some problems, while trying to do the same thing !

              Oh ! May be that, instead of the keyword enumaration, you would mean enumeration !


              REMARK :

              On my small example test, I tested the new syntax of my previous regex, but it does produce, too, the all contents bug :-((. So, I suppose that it’s rather related to the recursion handling by the present N++ Boost regex engine ?!

              Cheers,

              guy038

              1 Reply Last reply Reply Quote 0
              • MAPJe71M
                MAPJe71
                last edited by

                @guy038 and @Per-Isakson,

                Using a named capturing group for the keywords instead of a numbered group and use it as a subroutine probably wouldn’t make a difference i.e. leading to wrong results in big files …

                <classRange
                    mainExpr = "(?x)                    # free-spacing
                                (?s)                    # dot matches new line
                                (?-i)                   # case sensitive
                                (?'KEYWORDS'            # --- open named capturing group
                                    \b                  # word boundary
                                    (?:                 # open non-capturing group with keywords that open a Balanced Construct that is closed by 'end'.
                                        classdef
                                    |   e(?:numeration|vents)
                                    |   f(?:or|unction)
                                    |   if
                                    |   methods
                                    |   p(?:arfor|roperties)
                                    |   switch
                                    |   try
                                    |   while
                                    )
                                    \b                  # word boundary
                                )                       # --- close named capturing group
                                (?:                     # open non-capturing group
                                    (?:                 # open non-capturing group
                                        (?!             # negative look-ahead
                                            (?&amp;KEYWORDS)
                                        |   \bend\b
                                        )               # if not a keyword
                                        .               # then one character
                                    )+                  # repeat until keyword found
                                |   (?0)                # recurse RE from start
                                )+                      # repeat
                                \bend\b
                                "
                

                And subroutine definition(s)?
                e.g.

                <classRange
                    mainExpr = "(?x)                            # free-spacing
                                (?s)                            # dot matches new line
                                (?-i)                           # case sensitive
                                (?(DEFINE)                      # Define subroutines
                                    (?'OPEN_KEYWORDS'
                                        \b                      # word boundary
                                        (?:                     # open non-capturing group with keywords that open a Balanced Construct that is closed by 'end'.
                                            classdef
                                        |   e(?:numeration|vents)
                                        |   f(?:or|unction)
                                        |   if
                                        |   methods
                                        |   p(?:arfor|roperties)
                                        |   switch
                                        |   try
                                        |   while
                                        )
                                        \b                      # word boundary
                                    )
                                    (?'CLOSE_KEYWORDS'
                                        \b                      # word boundary
                                        end
                                        \b                      # word boundary
                                    )
                                )
                                (?&amp;OPEN_KEYWORDS)               # call subroutine
                                (?:                             # open non-capturing group
                                    (?:                         # open non-capturing group
                                        (?!                     # negative look-ahead
                                            (?&amp;OPEN_KEYWORDS)   # call subroutine
                                        |   (?&amp;CLOSE_KEYWORDS)  # call subroutine
                                        )                       # if not a keyword
                                        .                       # then one character
                                    )+                          # repeat until keyword found
                                |   (?0)                        # recurse RE from start
                                )+                              # repeat
                                (?&amp;CLOSE_KEYWORDS)              # call subroutine
                                "
                
                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, @mapje71,

                  I, recently, noticed a difference between repeating a group #n and using its associated subroutine call (?n), while using some regexes, in a .txt file, that @iona-hine sent me, by e-mail, a couple of days, ago. Refer to this post, speaking of the general problem of finding a range of characters between two words ( Either the forms Word1......Word2 OR Word2......Word1 ! )

                  https://notepad-plus-plus.org/community/topic/13513/proximity-search/6

                  The Iona’s file, of size 533,237, is an ONE-line UTF-8 file, which contains 532,875 characters, from column 1 to column 532,875 !, organized in 74115 words. The characters are, mostly, word characters ( 458,337 ) + other symbols ( 663 ) and space characters ( 73,875 )

                  In that file, Iona try to search for any range, between the initial boundary man_n and the final boundary city_n, or the opposite, with a maximum of 50 words between !. It’s interesting to note that this file contains 467 words man_n but ONLY ONE occurrence of the word city_n. Therefore, any regex looking for a range between these two words, should find only ONE occurrence !

                  Using, for instance, the regexes A and B, below, we get, as expected, ONE match ( a range of 167 characters, beginning at column 21,481 ), in both cases.

                  Regex A : (?si)(?<=\W)(man_n)((?:\W+\w+){0,50}\W+)(city_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

                  Regex B : (?si)(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)|(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)


                  Now, if we try the regex C, we get TWO matches ! ( The first one is correct, but the second match wrongly selects all the file contents ! )

                  Regex C : (?si)(?<=\W)(city_n)((?:\W+\w+){0,50}\W+)(man_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

                  However, if we use the equivalent regex D, below, without the (?#) syntaxes, it does match ONE match, only, without any bug !!

                  Regex D : (?si)(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)|(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)

                  But, @mapje71, this behaviour could occur because of my weak hardware configuration ( An old Win XP laptop, with 1 Mo of memory ! )
                  I just imagine your surprise, guys ! And you’re right, I, quickly, need a 21th century machine ! But, on the other hand, working without any UAC feature and other goodies, of modern OS, is quite relaxing, too :-))

                  So, @mapje71, I’m going to send you, that txt file, by e-mail, for further tests. And I hope that you’ll succeed to sort it out !

                  Cheers,

                  guy038

                  1 Reply Last reply Reply Quote 0
                  • MAPJe71M
                    MAPJe71
                    last edited by

                    Hi @guy038,

                    Thanks for sending “A01466.headed.xml.txt”.
                    Confirming your findings with some additions, note the influence of “wrap around”.
                    Tested on a Desktop running Windows XP Home + Sp3 with Windows Classic GUI ;-)
                    Need to get my Windows 10 x64 ready to be able to dig in and debug this!


                    Regex A: (?si)(?<=\W)(man_n)((?:\W+\w+){0,50}\W+)(city_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
                    and
                    Regex B: (?si)(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)|(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)
                    and
                    Regex D: (?si)(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)|(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)

                    Pass i.e. one match even after repeated search:

                    1. Find w/ wrap around;
                    2. Using RegexTester (by @Claudia-Frank);
                    3. Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor

                    FAIL:

                    1. Find w/o wrap around - “Can’t find text”

                    Regex C: (?si)(?<=\W)(city_n)((?:\W+\w+){0,50}\W+)(man_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
                    and
                    Regex G: (?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?1)(?2)(?3)|(?3)(?2)(?1)

                    Pass i.e. one match even after repeated search:

                    1. Using RegexTester (by @Claudia-Frank);
                    2. Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor

                    FAIL:

                    1. Find w/ wrap around - toggles between two matches on repeated search (partial and complete text);
                    2. Find w/o wrap around - “Can’t find text”

                    Regex E: (?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?<=\W)(?1)(?2)(?3)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
                    and
                    Regex F: (?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?<=\W)(?:(?1)(?2)(?3)|(?3)(?2)(?1))(?=\W)

                    Pass i.e. one match even after repeated search:

                    1. Using RegexTester (by @Claudia-Frank);
                    2. Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor

                    FAIL:

                    1. Find w/ wrap around - toggles between two matches on repeated search (partial and complete text);
                    2. Find w/o wrap around - toggles between two matches on repeated search (complete text and “Can’t find text”)

                    Regards,
                    Menno

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors