• Login
Community
  • Login

Repeated capturing groups

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
7 Posts 5 Posters 959 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J
    Joe McCay
    last edited by Aug 19, 2024, 11:17 PM

    Is it possible to reference a captured group that repeated multiple times. For example,

    ^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*[(](?:([-_.:/&a-zAZ0-9]+)[,]*){1,5}[)]
    

    will match all of the following.

    INSERT INTO mine (countrycode,statecode,id,statename,sort)
    

    Is there a way I can reference the individual matches (like countrycode or id)? If I use ‘$2’, I get the last match of sort. The results for $3, $4, and $5 are empty. I would like to capture and reference the individual matches without having to repeat the same regular expression.

    C M 2 Replies Last reply Aug 20, 2024, 12:59 AM Reply Quote 1
    • C
      Coises @Joe McCay
      last edited by Aug 20, 2024, 12:59 AM

      @Joe-McCay said in Repeated capturing groups:

      Is there a way I can reference the individual matches (like countrycode or id)? If I use ‘$2’, I get the last match of sort. The results for $3, $4, and $5 are empty. I would like to capture and reference the individual matches without having to repeat the same regular expression.

      There is no way to do that using the Notepad++ regular expression implementation.

      The closest I could come was this:
      ^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*[(]([-_.:/&a-zAZ0-9]+)(?:,((?2)))?(?:,((?2)))?(?:,((?2)))?(?:,((?2)))?[)]
      which isn’t much better than just repeating the expression — though if the actual expression were more complex or subject to change, the technique might help. The problem is that you have to have an actual, written-out pair of parentheses for each numbered capture group; if a parenthesized group matches more than once, only the last match is saved.

      The Boost regular expression engine which Notepad++ uses has an option for that called Repeated Captures , but it is only accessible through the programming interface; there is no support for using it in a replacement string. A plugin could use this feature, but it would have to call its own copy of Boost::regex directly; I don’t know if any of the scripting interfaces can do it.

      J 1 Reply Last reply Aug 20, 2024, 2:21 PM Reply Quote 5
      • J
        Joe McCay @Coises
        last edited by Joe McCay Aug 20, 2024, 2:26 PM Aug 20, 2024, 2:21 PM

        @Coises Thanks. That is what I thought.

        1 Reply Last reply Reply Quote 0
        • G
          guy038
          last edited by guy038 Aug 20, 2024, 4:05 PM Aug 20, 2024, 3:22 PM

          Hello, @joe-mccay, @coises and all,

          I found out a solution, similar to the @coises’s one, which seems slightly easier to understand :

          SEARCH (?i-s)^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*\(((?1)),?((?1)?),?((?1)?),?((?1)?),?((?1)?)\)

          From this INPUT text, below :

          INSERT INTO mine (countrycode,statecode,id,statename,sort)
          INSERT INTO mine (countrycode,statecode,id,statename)
          INSERT INTO mine (countrycode,statecode,id)
          INSERT INTO mine (countrycode,statecode)
          INSERT INTO mine (countrycode)
          

          The following regex S/R :

          SEARCH (?i-s)^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*\(((?1)),?((?1)?),?((?1)?),?((?1)?),?((?1)?)\)

          REPLACE >$1< >$2< >$3< >$4< >$5< >$6<

          Would produce this OUTPUT text :

          >mine<    >countrycode<    >statecode<    >id<    >statename<    >sort<
          >mine<    >countrycode<    >statecode<    >id<    >statename<    ><
          >mine<    >countrycode<    >statecode<    >id<    ><    ><
          >mine<    >countrycode<    >statecode<    ><    ><    ><
          >mine<    >countrycode<    ><    ><    ><    ><
          

          Notes :

          • After the first part (?i-s)^insert into[ \t], the group 1 is the part [-_.:/&a-zAZ0-9]+

          • Then, after possible leading blank chars and the opening parenthesis, the true regex (?1) is repeated and surrounded,itself, with parentheses to get the group 2 and followed with a possible comma char ,?

          • Again, the regex (?1) is, this time, optionally repeated and enclosed, as before, between parentheses to get the optional group 3

          • The whole regex include three other ranges ,?((?1)?) to cover from possible groups four to six and ends with the ending parenthesis !

          Best Regards,

          guy038

          P.S. :

          See the fundamental difference between these two regexes :

          A (?-is)(\d+)ABC\1

          and

          B (?-is)(\d+)ABC(?1)

          Given the INPUT text :

          1ABC1
          12345ABC12345
          456ABC456
          89ABC89
          
          456ABC789
          789ABC456
          0ABC123456789
          0123456789ABC1
          111ABC999
          

          The regex (?-is)(\d+)ABC\1 matches the first four lines only of the INPUT text, whereas the regex (?-is)(\d+)ABC(?1) matches also the five other lines, below !

          Indeed, the regex (?-is)(\d+)ABC(?1) is identical to the regex (?-is)(\d+)ABC(\d+). So, the (?1) syntax is just a shortcut to the regex which represents the whole group 1 !

          But the \1 syntax, in the regex A, represents the present value of group 1 ( i.e. a reference to group 1 )

          1 Reply Last reply Reply Quote 3
          • M
            Mark Olson
            last edited by Aug 20, 2024, 6:01 PM

            C# System.Text.RegularExpressions supports repeated capture groups, so in principle someone could build a C# plugin that does regex search with repeated capture groups.

            I could even add support for repeated capture groups to the regex search functionalities of the JsonTools plugin, since it is implemented in C#. I’m just not currently sure what would be the most user-friendly way to do that.

            1 Reply Last reply Reply Quote 2
            • M
              mkupper @Joe McCay
              last edited by Aug 20, 2024, 6:47 PM

              @Joe-McCay I’m going to re-do @guy038’s solution a little to make it something that seemed a little more understandable to me…

              (?xi)                    # (?x) Enables free-spacing mode which allows me to spread the expression over several lines and allows for # prefixed comments. (?i) enabled ignore-case mode so that [a-z] also matches [A-Z]
                  ^insert\ into[\ \t]  # Due to free-spacing mode we need a backslash in front of spaces that we want to be part of the match pattern
                  ([-_.:/&a-z0-9]+)    # I removed the seemingly spurious "AZ" you had which is also not needed as we are in ignore-case mode
                  [\ \t]*
                  \(
                  ((?1))               # This reuses the $1 regexp to match the first parameter of the INSERT INTO
                  (?:,((?1)))?         # The second up through fifth parameters are optional with all of them also reusing the $1 regexp
                  (?:,((?1)))?
                  (?:,((?1)))?
                  (?:,((?1)))?
                  \)
              

              Look for free-spacing on https://npp-user-manual.org/docs/searching/#search-modifiers to see how the (?x) and (?i) things work.

              Look for subexpression on https://npp-user-manual.org/docs/searching/ to see how the (?ℕ) thing works. Subexpressions were used by both @Coises and @guy038 and are key to doing what you want to do.

              1 Reply Last reply Reply Quote 2
              • G
                guy038
                last edited by Aug 20, 2024, 11:36 PM

                Hi, @joe-mccay, @coises, @mark-olson, @mkupper and All,

                Ah… yes, the @mkupper’s formulation of the search regex is very clever and quite clear, thanks to the free-spacing mode !


                I particularly like :

                • The (?:,((?1)))? syntax, where you join the optional states of, both, the (?1) form and the comma

                • The use of the leading i modifier to simplify the group 1 syntax

                Bravo !!

                BR

                guy038

                1 Reply Last reply Reply Quote 2
                6 out of 7
                • First post
                  6/7
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors