Community
    • Login

    Repeated capturing groups

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    7 Posts 5 Posters 2.4k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Joe McCayJ Offline
      Joe McCay
      last edited by

      Is it possible to reference a captured group that repeated multiple times. For example,

      ^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*[(](?:([-_.:/&a-zAZ0-9]+)[,]*){1,5}[)]
      

      will match all of the following.

      INSERT INTO mine (countrycode,statecode,id,statename,sort)
      

      Is there a way I can reference the individual matches (like countrycode or id)? If I use ‘$2’, I get the last match of sort. The results for $3, $4, and $5 are empty. I would like to capture and reference the individual matches without having to repeat the same regular expression.

      CoisesC mkupperM 2 Replies Last reply Reply Quote 1
      • CoisesC Offline
        Coises @Joe McCay
        last edited by

        @Joe-McCay said in Repeated capturing groups:

        Is there a way I can reference the individual matches (like countrycode or id)? If I use ‘$2’, I get the last match of sort. The results for $3, $4, and $5 are empty. I would like to capture and reference the individual matches without having to repeat the same regular expression.

        There is no way to do that using the Notepad++ regular expression implementation.

        The closest I could come was this:
        ^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*[(]([-_.:/&a-zAZ0-9]+)(?:,((?2)))?(?:,((?2)))?(?:,((?2)))?(?:,((?2)))?[)]
        which isn’t much better than just repeating the expression — though if the actual expression were more complex or subject to change, the technique might help. The problem is that you have to have an actual, written-out pair of parentheses for each numbered capture group; if a parenthesized group matches more than once, only the last match is saved.

        The Boost regular expression engine which Notepad++ uses has an option for that called Repeated Captures, but it is only accessible through the programming interface; there is no support for using it in a replacement string. A plugin could use this feature, but it would have to call its own copy of Boost::regex directly; I don’t know if any of the scripting interfaces can do it.

        Joe McCayJ 1 Reply Last reply Reply Quote 5
        • Joe McCayJ Offline
          Joe McCay @Coises
          last edited by Joe McCay

          @Coises Thanks. That is what I thought.

          1 Reply Last reply Reply Quote 0
          • guy038G Online
            guy038
            last edited by guy038

            Hello, @joe-mccay, @coises and all,

            I found out a solution, similar to the @coises’s one, which seems slightly easier to understand :

            SEARCH (?i-s)^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*\(((?1)),?((?1)?),?((?1)?),?((?1)?),?((?1)?)\)

            From this INPUT text, below :

            INSERT INTO mine (countrycode,statecode,id,statename,sort)
            INSERT INTO mine (countrycode,statecode,id,statename)
            INSERT INTO mine (countrycode,statecode,id)
            INSERT INTO mine (countrycode,statecode)
            INSERT INTO mine (countrycode)
            

            The following regex S/R :

            SEARCH (?i-s)^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*\(((?1)),?((?1)?),?((?1)?),?((?1)?),?((?1)?)\)

            REPLACE >$1< >$2< >$3< >$4< >$5< >$6<

            Would produce this OUTPUT text :

            >mine<    >countrycode<    >statecode<    >id<    >statename<    >sort<
            >mine<    >countrycode<    >statecode<    >id<    >statename<    ><
            >mine<    >countrycode<    >statecode<    >id<    ><    ><
            >mine<    >countrycode<    >statecode<    ><    ><    ><
            >mine<    >countrycode<    ><    ><    ><    ><
            

            Notes :

            • After the first part (?i-s)^insert into[ \t], the group 1 is the part [-_.:/&a-zAZ0-9]+

            • Then, after possible leading blank chars and the opening parenthesis, the true regex (?1) is repeated and surrounded,itself, with parentheses to get the group 2 and followed with a possible comma char ,?

            • Again, the regex (?1) is, this time, optionally repeated and enclosed, as before, between parentheses to get the optional group 3

            • The whole regex include three other ranges ,?((?1)?) to cover from possible groups four to six and ends with the ending parenthesis !

            Best Regards,

            guy038

            P.S. :

            See the fundamental difference between these two regexes :

            A (?-is)(\d+)ABC\1

            and

            B (?-is)(\d+)ABC(?1)

            Given the INPUT text :

            1ABC1
            12345ABC12345
            456ABC456
            89ABC89
            
            456ABC789
            789ABC456
            0ABC123456789
            0123456789ABC1
            111ABC999
            

            The regex (?-is)(\d+)ABC\1 matches the first four lines only of the INPUT text, whereas the regex (?-is)(\d+)ABC(?1) matches also the five other lines, below !

            Indeed, the regex (?-is)(\d+)ABC(?1) is identical to the regex (?-is)(\d+)ABC(\d+). So, the (?1) syntax is just a shortcut to the regex which represents the whole group 1 !

            But the \1 syntax, in the regex A, represents the present value of group 1 ( i.e. a reference to group 1 )

            1 Reply Last reply Reply Quote 3
            • Mark OlsonM Offline
              Mark Olson
              last edited by

              C# System.Text.RegularExpressions supports repeated capture groups, so in principle someone could build a C# plugin that does regex search with repeated capture groups.

              I could even add support for repeated capture groups to the regex search functionalities of the JsonTools plugin, since it is implemented in C#. I’m just not currently sure what would be the most user-friendly way to do that.

              1 Reply Last reply Reply Quote 2
              • mkupperM Offline
                mkupper @Joe McCay
                last edited by

                @Joe-McCay I’m going to re-do @guy038’s solution a little to make it something that seemed a little more understandable to me…

                (?xi)                    # (?x) Enables free-spacing mode which allows me to spread the expression over several lines and allows for # prefixed comments. (?i) enabled ignore-case mode so that [a-z] also matches [A-Z]
                    ^insert\ into[\ \t]  # Due to free-spacing mode we need a backslash in front of spaces that we want to be part of the match pattern
                    ([-_.:/&a-z0-9]+)    # I removed the seemingly spurious "AZ" you had which is also not needed as we are in ignore-case mode
                    [\ \t]*
                    \(
                    ((?1))               # This reuses the $1 regexp to match the first parameter of the INSERT INTO
                    (?:,((?1)))?         # The second up through fifth parameters are optional with all of them also reusing the $1 regexp
                    (?:,((?1)))?
                    (?:,((?1)))?
                    (?:,((?1)))?
                    \)
                

                Look for free-spacing on https://npp-user-manual.org/docs/searching/#search-modifiers to see how the (?x) and (?i) things work.

                Look for subexpression on https://npp-user-manual.org/docs/searching/ to see how the (?ℕ) thing works. Subexpressions were used by both @Coises and @guy038 and are key to doing what you want to do.

                1 Reply Last reply Reply Quote 2
                • guy038G Online
                  guy038
                  last edited by

                  Hi, @joe-mccay, @coises, @mark-olson, @mkupper and All,

                  Ah… yes, the @mkupper’s formulation of the search regex is very clever and quite clear, thanks to the free-spacing mode !


                  I particularly like :

                  • The (?:,((?1)))? syntax, where you join the optional states of, both, the (?1) form and the comma

                  • The use of the leading i modifier to simplify the group 1 syntax

                  Bravo !!

                  BR

                  guy038

                  1 Reply Last reply Reply Quote 2

                  Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                  Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                  With your input, this post could be even better 💗

                  Register Login
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors