• Login
Community
  • Login

How to: Match Nested Pairs closed by END

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
8 Posts 3 Posters 4.0k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P
    Per Isakson
    last edited by Mar 27, 2017, 9:00 PM

    Context: functionList for MATLAB

    The php documentation shows a PCRE regex, \( ( (?>[^()]+) | (?R) )* \), which solves the problem of matching a string in parentheses, allowing for unlimited nested parentheses. This regex works as expected in N++.

    Question is, can this regex be modified to take strings, e.g properties,methods,if,for,while etc., as opening symbols and end as closing symbol. I failed.

    Example: match if ... matching end of if a, b, for a, b, while a, b, end, end, end, more text

    1 Reply Last reply Reply Quote 0
    • G
      guy038
      last edited by guy038 Mar 28, 2017, 11:17 PM Mar 28, 2017, 11:11 PM

      Hello, @per-isakson,

      My first attempt, with the initial-key words if, for, while, switch, try, parfor and the final-key word end, all in lower-case, would be the regex, below :

      (?s-i)(if|for|while|switch|try|parfor)((?:(?!(?1)|end).)+|(?R))+end

      Just try it against the sample text :

      
      if
      
        while
      
        end
      
      end
      
      forend  ( Test : normally NOT matched )
      
      if
      
      end
      
      end
      
      
      for if end while for if end end end for if end end end end end
      
      for  if  end  while  for  if  end  end  end  for  if  end  end  end  end  end  end  end
      
      
      for
        if
        end
         while
          for
            if
            end
          end
        end
        for
          if
          end
        end
      end
      end
      end
      end
      

      You should obtain five occurrences, capturing 2 single-line zones and 3 multi-lines blocks


      Of course, as this regex just search for the exact words, without word boundaries, it wouldn’t mind, for instance, about the wrong block, below, selecting from the letters for to the letters end

      afor
      ....
      ....
      ....
      endz
      

      I tried to build a more complete regex, with word boundaries, as \W, and the look-arounds (?<=\W) and (?=\W), but my solutions led to other problems and didn’t match all the cases, anyway ! Then, I realized that it would be better, finally, to detect, FIRST, in your code, any key-word, used in that regex, when “glued” to other letters, in a bigger word ! To that purpose, use the regex( again ! ), below :

      \w(if|(?<!par)for|while|switch|try|parfor|end)|(?1)\w

      IMPORTANT : Sometimes, when clicking, ONE MORE time, on the Find Next button, all the file contents are wrongly selected. It’s a well-known bug, which occurs, while using, mostly, recursive regular expressions :-(( I can’t explain that behaviour ! May be, my regex is not well-formed !?

      Best Regards,

      guy038

      P.S. :

      If this regex put you on the right direction, I give you, next time, some details on what it means !!

      1 Reply Last reply Reply Quote 0
      • M
        MAPJe71
        last edited by MAPJe71 Mar 29, 2017, 7:15 PM Mar 29, 2017, 7:14 PM

        @guy038 FYI it’s related to #13505 .

        1 Reply Last reply Reply Quote 0
        • P
          Per Isakson
          last edited by Apr 1, 2017, 2:52 AM

          Hello @guy038

          Thank you for the recursive regular expression. I have modified it a bit and included into my functionList.xml.

          The expression works with TextFX, Quick, FindReplace and https://regex101.com/ . (classdef must be the first keyword in the file, but that’s okay.)

          The expression is costly to execute (some seconds on my old PC) and regex101 says Catastrofic Backtracking to many of my test-files. I failed to significantly improve the performance.

          However, embedded in functionList.xml it’s not a total success. Some test-files, which TextFX|Quick|FindReplace handles well, only produces the filename in the Function List pane. I will report on that in the thread Trouble making a functionList parser for MATLAB, #13505.

          Best Regards
          per isakson

          <classRange
              mainExpr = "(?x)(?s)                # dot matches new line
                          (?-i)                   # case sensitive
                          (                       # --- open 1st group 
                              \b                  # word boundary  
                              (                   # --- open 2nd group
                                  classdef        # keywords that open a  
                                  |properties     # Balanced Construct
                                  |events         # that is closed by 
                                  |enumaration    # 'end'.
                                  |methods            
                                  |function
                                  |if
                                  |for
                                  |while
                                  |switch
                                  |try
                                  |parfor
                              )                   # --- close 2nd group
                              \b                  # word boundary 
                          )                       # --- close 1st group
                          
                          (                       # open 3rd capturing group    
                              (?:                 # open non-capturing group
                                  (?!             # negative look-ahead
                                  (?1)|\bend\b    # if not a keyword
                                  ).              # then one character
                              )+                  # repeat 
                              |                   # until keyword found
                              (?0)                # recurse RE from start
                          )+                      # repeat 3rd group
                          \bend\b                 #      
                          "
          1 Reply Last reply Reply Quote 0
          • G
            guy038
            last edited by Apr 1, 2017, 2:24 PM

            Hi, @per-isakson and All,

            Sorry, but I was quite busy to answer a ( long ! ) e-mail to @iona-hine.

            https://notepad-plus-plus.org/community/topic/13513/proximity-search/11

            And, while testing some regexes, for him, I just noticed that the interesting syntaxes (?#), that represents a subroutine to the group #, itself, may, in some cases, especially in big files, lead to wrong results, as, for instance, to the all contents wrong match !

            So, referring to my previous regex :

            (?s-i)(if|for|while|switch|try|parfor)((?:(?!(?1)|end).)+|(?R))+end

            It would be better, finally, to rewrite it as :

            (?s-i)(if|for|while|switch|try|parfor)((?:(?!if|for|while|switch|try|parfor|end).)+|(?R))+end


            Therefore, @per-isakson, may be, it would be better to change, in your regex, the part :

                                    (?1)|\bend\b
            

            By the following :

                                        \b                  # word boundary  
                                        (                   # --- open 4th group
                                            classdef        # keywords, of the 2nd group + 'end',
                                            |properties     # which must NOT occur,
                                            |events         # at ANY position, of the
                                            |enumaration    # present sub-block, till
                                            |methods        # its associated 'end' closing word
                                            |function
                                            |if
                                            |for
                                            |while
                                            |switch
                                            |try
                                            |parfor
                                            |end
                                        )                   # --- close 4th group
                                        \b                  # word boundary 
            
            

            BTW, you succeed, to include word boundaries in your regex ! Fine :-) Cant remember while I found out some problems, while trying to do the same thing !

            Oh ! May be that, instead of the keyword enumaration, you would mean enumeration !


            REMARK :

            On my small example test, I tested the new syntax of my previous regex, but it does produce, too, the all contents bug :-((. So, I suppose that it’s rather related to the recursion handling by the present N++ Boost regex engine ?!

            Cheers,

            guy038

            1 Reply Last reply Reply Quote 0
            • M
              MAPJe71
              last edited by Apr 1, 2017, 8:18 PM

              @guy038 and @Per-Isakson,

              Using a named capturing group for the keywords instead of a numbered group and use it as a subroutine probably wouldn’t make a difference i.e. leading to wrong results in big files …

              <classRange
                  mainExpr = "(?x)                    # free-spacing
                              (?s)                    # dot matches new line
                              (?-i)                   # case sensitive
                              (?'KEYWORDS'            # --- open named capturing group
                                  \b                  # word boundary
                                  (?:                 # open non-capturing group with keywords that open a Balanced Construct that is closed by 'end'.
                                      classdef
                                  |   e(?:numeration|vents)
                                  |   f(?:or|unction)
                                  |   if
                                  |   methods
                                  |   p(?:arfor|roperties)
                                  |   switch
                                  |   try
                                  |   while
                                  )
                                  \b                  # word boundary
                              )                       # --- close named capturing group
                              (?:                     # open non-capturing group
                                  (?:                 # open non-capturing group
                                      (?!             # negative look-ahead
                                          (?&amp;KEYWORDS)
                                      |   \bend\b
                                      )               # if not a keyword
                                      .               # then one character
                                  )+                  # repeat until keyword found
                              |   (?0)                # recurse RE from start
                              )+                      # repeat
                              \bend\b
                              "
              

              And subroutine definition(s)?
              e.g.

              <classRange
                  mainExpr = "(?x)                            # free-spacing
                              (?s)                            # dot matches new line
                              (?-i)                           # case sensitive
                              (?(DEFINE)                      # Define subroutines
                                  (?'OPEN_KEYWORDS'
                                      \b                      # word boundary
                                      (?:                     # open non-capturing group with keywords that open a Balanced Construct that is closed by 'end'.
                                          classdef
                                      |   e(?:numeration|vents)
                                      |   f(?:or|unction)
                                      |   if
                                      |   methods
                                      |   p(?:arfor|roperties)
                                      |   switch
                                      |   try
                                      |   while
                                      )
                                      \b                      # word boundary
                                  )
                                  (?'CLOSE_KEYWORDS'
                                      \b                      # word boundary
                                      end
                                      \b                      # word boundary
                                  )
                              )
                              (?&amp;OPEN_KEYWORDS)               # call subroutine
                              (?:                             # open non-capturing group
                                  (?:                         # open non-capturing group
                                      (?!                     # negative look-ahead
                                          (?&amp;OPEN_KEYWORDS)   # call subroutine
                                      |   (?&amp;CLOSE_KEYWORDS)  # call subroutine
                                      )                       # if not a keyword
                                      .                       # then one character
                                  )+                          # repeat until keyword found
                              |   (?0)                        # recurse RE from start
                              )+                              # repeat
                              (?&amp;CLOSE_KEYWORDS)              # call subroutine
                              "
              
              1 Reply Last reply Reply Quote 0
              • G
                guy038
                last edited by guy038 Apr 2, 2017, 9:54 AM Apr 2, 2017, 1:54 AM

                Hi, @mapje71,

                I, recently, noticed a difference between repeating a group #n and using its associated subroutine call (?n), while using some regexes, in a .txt file, that @iona-hine sent me, by e-mail, a couple of days, ago. Refer to this post, speaking of the general problem of finding a range of characters between two words ( Either the forms Word1......Word2 OR Word2......Word1 ! )

                https://notepad-plus-plus.org/community/topic/13513/proximity-search/6

                The Iona’s file, of size 533,237, is an ONE-line UTF-8 file, which contains 532,875 characters, from column 1 to column 532,875 !, organized in 74115 words. The characters are, mostly, word characters ( 458,337 ) + other symbols ( 663 ) and space characters ( 73,875 )

                In that file, Iona try to search for any range, between the initial boundary man_n and the final boundary city_n, or the opposite, with a maximum of 50 words between !. It’s interesting to note that this file contains 467 words man_n but ONLY ONE occurrence of the word city_n. Therefore, any regex looking for a range between these two words, should find only ONE occurrence !

                Using, for instance, the regexes A and B, below, we get, as expected, ONE match ( a range of 167 characters, beginning at column 21,481 ), in both cases.

                Regex A : (?si)(?<=\W)(man_n)((?:\W+\w+){0,50}\W+)(city_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

                Regex B : (?si)(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)|(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)


                Now, if we try the regex C, we get TWO matches ! ( The first one is correct, but the second match wrongly selects all the file contents ! )

                Regex C : (?si)(?<=\W)(city_n)((?:\W+\w+){0,50}\W+)(man_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

                However, if we use the equivalent regex D, below, without the (?#) syntaxes, it does match ONE match, only, without any bug !!

                Regex D : (?si)(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)|(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)

                But, @mapje71, this behaviour could occur because of my weak hardware configuration ( An old Win XP laptop, with 1 Mo of memory ! )
                I just imagine your surprise, guys ! And you’re right, I, quickly, need a 21th century machine ! But, on the other hand, working without any UAC feature and other goodies, of modern OS, is quite relaxing, too :-))

                So, @mapje71, I’m going to send you, that txt file, by e-mail, for further tests. And I hope that you’ll succeed to sort it out !

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 0
                • M
                  MAPJe71
                  last edited by Apr 2, 2017, 11:07 AM

                  Hi @guy038,

                  Thanks for sending “A01466.headed.xml.txt”.
                  Confirming your findings with some additions, note the influence of “wrap around”.
                  Tested on a Desktop running Windows XP Home + Sp3 with Windows Classic GUI ;-)
                  Need to get my Windows 10 x64 ready to be able to dig in and debug this!


                  Regex A: (?si)(?<=\W)(man_n)((?:\W+\w+){0,50}\W+)(city_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
                  and
                  Regex B: (?si)(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)|(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)
                  and
                  Regex D: (?si)(?<=\W)city_n(?:\W+\w+){0,50}\W+man_n(?=\W)|(?<=\W)man_n(?:\W+\w+){0,50}\W+city_n(?=\W)

                  Pass i.e. one match even after repeated search:

                  1. Find w/ wrap around;
                  2. Using RegexTester (by @Claudia-Frank);
                  3. Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor

                  FAIL:

                  1. Find w/o wrap around - “Can’t find text”

                  Regex C: (?si)(?<=\W)(city_n)((?:\W+\w+){0,50}\W+)(man_n)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
                  and
                  Regex G: (?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?1)(?2)(?3)|(?3)(?2)(?1)

                  Pass i.e. one match even after repeated search:

                  1. Using RegexTester (by @Claudia-Frank);
                  2. Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor

                  FAIL:

                  1. Find w/ wrap around - toggles between two matches on repeated search (partial and complete text);
                  2. Find w/o wrap around - “Can’t find text”

                  Regex E: (?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?<=\W)(?1)(?2)(?3)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)
                  and
                  Regex F: (?i)(?(DEFINE)(\bcity_n\b)((?:\W+\w+){0,50}\W+)(\bman_n\b))(?<=\W)(?:(?1)(?2)(?3)|(?3)(?2)(?1))(?=\W)

                  Pass i.e. one match even after repeated search:

                  1. Using RegexTester (by @Claudia-Frank);
                  2. Using RegexBuddy w/ boost::regex 1.58-1.59 ECMAScript (default) flavor

                  FAIL:

                  1. Find w/ wrap around - toggles between two matches on repeated search (partial and complete text);
                  2. Find w/o wrap around - toggles between two matches on repeated search (complete text and “Can’t find text”)

                  Regards,
                  Menno

                  1 Reply Last reply Reply Quote 0
                  3 out of 8
                  • First post
                    3/8
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors