Repeated capturing groups

Joe McCay

Is it possible to reference a captured group that repeated multiple times. For example,

^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*[(](?:([-_.:/&a-zAZ0-9]+)[,]*){1,5}[)]

will match all of the following.

INSERT INTO mine (countrycode,statecode,id,statename,sort)

Is there a way I can reference the individual matches (like countrycode or id)? If I use ‘$2’, I get the last match of sort. The results for $3, $4, and $5 are empty. I would like to capture and reference the individual matches without having to repeat the same regular expression.

Coises

@Joe-McCay said in Repeated capturing groups:

Is there a way I can reference the individual matches (like countrycode or id)? If I use ‘$2’, I get the last match of sort. The results for $3, $4, and $5 are empty. I would like to capture and reference the individual matches without having to repeat the same regular expression.

There is no way to do that using the Notepad++ regular expression implementation.

The closest I could come was this:
^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*[(]([-_.:/&a-zAZ0-9]+)(?:,((?2)))?(?:,((?2)))?(?:,((?2)))?(?:,((?2)))?[)]
which isn’t much better than just repeating the expression — though if the actual expression were more complex or subject to change, the technique might help. The problem is that you have to have an actual, written-out pair of parentheses for each numbered capture group; if a parenthesized group matches more than once, only the last match is saved.

The Boost regular expression engine which Notepad++ uses has an option for that called Repeated Captures, but it is only accessible through the programming interface; there is no support for using it in a replacement string. A plugin could use this feature, but it would have to call its own copy of Boost::regex directly; I don’t know if any of the scripting interfaces can do it.

Joe McCay

@Coises Thanks. That is what I thought.

guy038

Hello, @joe-mccay, @coises and all,

I found out a solution, similar to the @coises’s one, which seems slightly easier to understand :

SEARCH (?i-s)^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*$((?1)),?((?1)?),?((?1)?),?((?1)?),?((?1)?)$

From this INPUT text, below :

INSERT INTO mine (countrycode,statecode,id,statename,sort)
INSERT INTO mine (countrycode,statecode,id,statename)
INSERT INTO mine (countrycode,statecode,id)
INSERT INTO mine (countrycode,statecode)
INSERT INTO mine (countrycode)

The following regex S/R :

SEARCH (?i-s)^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*$((?1)),?((?1)?),?((?1)?),?((?1)?),?((?1)?)$

REPLACE >$1< >$2< >$3< >$4< >$5< >$6<

Would produce this OUTPUT text :

>mine<    >countrycode<    >statecode<    >id<    >statename<    >sort<
>mine<    >countrycode<    >statecode<    >id<    >statename<    ><
>mine<    >countrycode<    >statecode<    >id<    ><    ><
>mine<    >countrycode<    >statecode<    ><    ><    ><
>mine<    >countrycode<    ><    ><    ><    ><

Notes :

After the first part (?i-s)^insert into[ \t], the group 1 is the part [-_.:/&a-zAZ0-9]+
Then, after possible leading blank chars and the opening parenthesis, the true regex (?1) is repeated and surrounded,itself, with parentheses to get the group 2 and followed with a possible comma char ,?
Again, the regex (?1) is, this time, optionally repeated and enclosed, as before, between parentheses to get the optional group 3
The whole regex include three other ranges ,?((?1)?) to cover from possible groups four to six and ends with the ending parenthesis !

Best Regards,

guy038

P.S. :

See the fundamental difference between these two regexes :

A (?-is)(\d+)ABC\1

and

B (?-is)(\d+)ABC(?1)

Given the INPUT text :

1ABC1
12345ABC12345
456ABC456
89ABC89

456ABC789
789ABC456
0ABC123456789
0123456789ABC1
111ABC999

The regex (?-is)(\d+)ABC\1 matches the first four lines only of the INPUT text, whereas the regex (?-is)(\d+)ABC(?1) matches also the five other lines, below !

Indeed, the regex (?-is)(\d+)ABC(?1) is identical to the regex (?-is)(\d+)ABC(\d+). So, the (?1) syntax is just a shortcut to the regex which represents the whole group 1 !

But the \1 syntax, in the regex A, represents the present value of group 1 ( i.e. a reference to group 1 )

Mark Olson

C# System.Text.RegularExpressions supports repeated capture groups, so in principle someone could build a C# plugin that does regex search with repeated capture groups.

I could even add support for repeated capture groups to the regex search functionalities of the JsonTools plugin, since it is implemented in C#. I’m just not currently sure what would be the most user-friendly way to do that.

mkupper

@Joe-McCay I’m going to re-do @guy038’s solution a little to make it something that seemed a little more understandable to me…

(?xi)                    # (?x) Enables free-spacing mode which allows me to spread the expression over several lines and allows for # prefixed comments. (?i) enabled ignore-case mode so that [a-z] also matches [A-Z]
    ^insert\ into[\ \t]  # Due to free-spacing mode we need a backslash in front of spaces that we want to be part of the match pattern
    ([-_.:/&a-z0-9]+)    # I removed the seemingly spurious "AZ" you had which is also not needed as we are in ignore-case mode
    [\ \t]*
    \(
    ((?1))               # This reuses the $1 regexp to match the first parameter of the INSERT INTO
    (?:,((?1)))?         # The second up through fifth parameters are optional with all of them also reusing the $1 regexp
    (?:,((?1)))?
    (?:,((?1)))?
    (?:,((?1)))?
    \)

Look for free-spacing on https://npp-user-manual.org/docs/searching/#search-modifiers to see how the (?x) and (?i) things work.

Look for subexpression on https://npp-user-manual.org/docs/searching/ to see how the (?ℕ) thing works. Subexpressions were used by both @Coises and @guy038 and are key to doing what you want to do.

guy038

Hi, @joe-mccay, @coises, @mark-olson, @mkupper and All,

Ah… yes, the @mkupper’s formulation of the search regex is very clever and quite clear, thanks to the free-spacing mode !

I particularly like :

The (?:,((?1)))? syntax, where you join the optional states of, both, the (?1) form and the comma
The use of the leading i modifier to simplify the group 1 syntax

Bravo !!

BR

guy038