Community
    • Login

    Repeated capturing groups

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    7 Posts 5 Posters 875 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Joe McCayJ
      Joe McCay
      last edited by

      Is it possible to reference a captured group that repeated multiple times. For example,

      ^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*[(](?:([-_.:/&a-zAZ0-9]+)[,]*){1,5}[)]
      

      will match all of the following.

      INSERT INTO mine (countrycode,statecode,id,statename,sort)
      

      Is there a way I can reference the individual matches (like countrycode or id)? If I use ‘$2’, I get the last match of sort. The results for $3, $4, and $5 are empty. I would like to capture and reference the individual matches without having to repeat the same regular expression.

      CoisesC mkupperM 2 Replies Last reply Reply Quote 1
      • CoisesC
        Coises @Joe McCay
        last edited by

        @Joe-McCay said in Repeated capturing groups:

        Is there a way I can reference the individual matches (like countrycode or id)? If I use ‘$2’, I get the last match of sort. The results for $3, $4, and $5 are empty. I would like to capture and reference the individual matches without having to repeat the same regular expression.

        There is no way to do that using the Notepad++ regular expression implementation.

        The closest I could come was this:
        ^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*[(]([-_.:/&a-zAZ0-9]+)(?:,((?2)))?(?:,((?2)))?(?:,((?2)))?(?:,((?2)))?[)]
        which isn’t much better than just repeating the expression — though if the actual expression were more complex or subject to change, the technique might help. The problem is that you have to have an actual, written-out pair of parentheses for each numbered capture group; if a parenthesized group matches more than once, only the last match is saved.

        The Boost regular expression engine which Notepad++ uses has an option for that called Repeated Captures, but it is only accessible through the programming interface; there is no support for using it in a replacement string. A plugin could use this feature, but it would have to call its own copy of Boost::regex directly; I don’t know if any of the scripting interfaces can do it.

        Joe McCayJ 1 Reply Last reply Reply Quote 5
        • Joe McCayJ
          Joe McCay @Coises
          last edited by Joe McCay

          @Coises Thanks. That is what I thought.

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hello, @joe-mccay, @coises and all,

            I found out a solution, similar to the @coises’s one, which seems slightly easier to understand :

            SEARCH (?i-s)^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*\(((?1)),?((?1)?),?((?1)?),?((?1)?),?((?1)?)\)

            From this INPUT text, below :

            INSERT INTO mine (countrycode,statecode,id,statename,sort)
            INSERT INTO mine (countrycode,statecode,id,statename)
            INSERT INTO mine (countrycode,statecode,id)
            INSERT INTO mine (countrycode,statecode)
            INSERT INTO mine (countrycode)
            

            The following regex S/R :

            SEARCH (?i-s)^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*\(((?1)),?((?1)?),?((?1)?),?((?1)?),?((?1)?)\)

            REPLACE >$1< >$2< >$3< >$4< >$5< >$6<

            Would produce this OUTPUT text :

            >mine<    >countrycode<    >statecode<    >id<    >statename<    >sort<
            >mine<    >countrycode<    >statecode<    >id<    >statename<    ><
            >mine<    >countrycode<    >statecode<    >id<    ><    ><
            >mine<    >countrycode<    >statecode<    ><    ><    ><
            >mine<    >countrycode<    ><    ><    ><    ><
            

            Notes :

            • After the first part (?i-s)^insert into[ \t], the group 1 is the part [-_.:/&a-zAZ0-9]+

            • Then, after possible leading blank chars and the opening parenthesis, the true regex (?1) is repeated and surrounded,itself, with parentheses to get the group 2 and followed with a possible comma char ,?

            • Again, the regex (?1) is, this time, optionally repeated and enclosed, as before, between parentheses to get the optional group 3

            • The whole regex include three other ranges ,?((?1)?) to cover from possible groups four to six and ends with the ending parenthesis !

            Best Regards,

            guy038

            P.S. :

            See the fundamental difference between these two regexes :

            A (?-is)(\d+)ABC\1

            and

            B (?-is)(\d+)ABC(?1)

            Given the INPUT text :

            1ABC1
            12345ABC12345
            456ABC456
            89ABC89
            
            456ABC789
            789ABC456
            0ABC123456789
            0123456789ABC1
            111ABC999
            

            The regex (?-is)(\d+)ABC\1 matches the first four lines only of the INPUT text, whereas the regex (?-is)(\d+)ABC(?1) matches also the five other lines, below !

            Indeed, the regex (?-is)(\d+)ABC(?1) is identical to the regex (?-is)(\d+)ABC(\d+). So, the (?1) syntax is just a shortcut to the regex which represents the whole group 1 !

            But the \1 syntax, in the regex A, represents the present value of group 1 ( i.e. a reference to group 1 )

            1 Reply Last reply Reply Quote 3
            • Mark OlsonM
              Mark Olson
              last edited by

              C# System.Text.RegularExpressions supports repeated capture groups, so in principle someone could build a C# plugin that does regex search with repeated capture groups.

              I could even add support for repeated capture groups to the regex search functionalities of the JsonTools plugin, since it is implemented in C#. I’m just not currently sure what would be the most user-friendly way to do that.

              1 Reply Last reply Reply Quote 2
              • mkupperM
                mkupper @Joe McCay
                last edited by

                @Joe-McCay I’m going to re-do @guy038’s solution a little to make it something that seemed a little more understandable to me…

                (?xi)                    # (?x) Enables free-spacing mode which allows me to spread the expression over several lines and allows for # prefixed comments. (?i) enabled ignore-case mode so that [a-z] also matches [A-Z]
                    ^insert\ into[\ \t]  # Due to free-spacing mode we need a backslash in front of spaces that we want to be part of the match pattern
                    ([-_.:/&a-z0-9]+)    # I removed the seemingly spurious "AZ" you had which is also not needed as we are in ignore-case mode
                    [\ \t]*
                    \(
                    ((?1))               # This reuses the $1 regexp to match the first parameter of the INSERT INTO
                    (?:,((?1)))?         # The second up through fifth parameters are optional with all of them also reusing the $1 regexp
                    (?:,((?1)))?
                    (?:,((?1)))?
                    (?:,((?1)))?
                    \)
                

                Look for free-spacing on https://npp-user-manual.org/docs/searching/#search-modifiers to see how the (?x) and (?i) things work.

                Look for subexpression on https://npp-user-manual.org/docs/searching/ to see how the (?ℕ) thing works. Subexpressions were used by both @Coises and @guy038 and are key to doing what you want to do.

                1 Reply Last reply Reply Quote 2
                • guy038G
                  guy038
                  last edited by

                  Hi, @joe-mccay, @coises, @mark-olson, @mkupper and All,

                  Ah… yes, the @mkupper’s formulation of the search regex is very clever and quite clear, thanks to the free-spacing mode !


                  I particularly like :

                  • The (?:,((?1)))? syntax, where you join the optional states of, both, the (?1) form and the comma

                  • The use of the leading i modifier to simplify the group 1 syntax

                  Bravo !!

                  BR

                  guy038

                  1 Reply Last reply Reply Quote 2
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors