Repeated capturing groups
-
Is it possible to reference a captured group that repeated multiple times. For example,
^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*[(](?:([-_.:/&a-zAZ0-9]+)[,]*){1,5}[)]
will match all of the following.
INSERT INTO mine (countrycode,statecode,id,statename,sort)
Is there a way I can reference the individual matches (like countrycode or id)? If I use ‘$2’, I get the last match of sort. The results for $3, $4, and $5 are empty. I would like to capture and reference the individual matches without having to repeat the same regular expression.
-
@Joe-McCay said in Repeated capturing groups:
Is there a way I can reference the individual matches (like countrycode or id)? If I use ‘$2’, I get the last match of sort. The results for $3, $4, and $5 are empty. I would like to capture and reference the individual matches without having to repeat the same regular expression.
There is no way to do that using the Notepad++ regular expression implementation.
The closest I could come was this:
^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*[(]([-_.:/&a-zAZ0-9]+)(?:,((?2)))?(?:,((?2)))?(?:,((?2)))?(?:,((?2)))?[)]
which isn’t much better than just repeating the expression — though if the actual expression were more complex or subject to change, the technique might help. The problem is that you have to have an actual, written-out pair of parentheses for each numbered capture group; if a parenthesized group matches more than once, only the last match is saved.The Boost regular expression engine which Notepad++ uses has an option for that called Repeated Captures, but it is only accessible through the programming interface; there is no support for using it in a replacement string. A plugin could use this feature, but it would have to call its own copy of Boost::regex directly; I don’t know if any of the scripting interfaces can do it.
-
@Coises Thanks. That is what I thought.
-
Hello, @joe-mccay, @coises and all,
I found out a solution, similar to the @coises’s one, which seems slightly easier to understand :
SEARCH
(?i-s)^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*\(((?1)),?((?1)?),?((?1)?),?((?1)?),?((?1)?)\)
From this INPUT text, below :
INSERT INTO mine (countrycode,statecode,id,statename,sort) INSERT INTO mine (countrycode,statecode,id,statename) INSERT INTO mine (countrycode,statecode,id) INSERT INTO mine (countrycode,statecode) INSERT INTO mine (countrycode)
The following regex S/R :
SEARCH
(?i-s)^insert into[ \t]([-_.:/&a-zAZ0-9]+)[ \t]*\(((?1)),?((?1)?),?((?1)?),?((?1)?),?((?1)?)\)
REPLACE
>$1< >$2< >$3< >$4< >$5< >$6<
Would produce this OUTPUT text :
>mine< >countrycode< >statecode< >id< >statename< >sort< >mine< >countrycode< >statecode< >id< >statename< >< >mine< >countrycode< >statecode< >id< >< >< >mine< >countrycode< >statecode< >< >< >< >mine< >countrycode< >< >< >< ><
Notes :
-
After the first part
(?i-s)^insert into[ \t]
, the group1
is the part[-_.:/&a-zAZ0-9]+
-
Then, after possible leading blank chars and the opening parenthesis, the true regex
(?1)
is repeated and surrounded,itself, with parentheses to get the group2
and followed with a possible comma char,?
-
Again, the regex
(?1)
is, this time, optionally repeated and enclosed, as before, between parentheses to get the optional group3
-
The whole regex include three other ranges
,?((?1)?)
to cover from possible groups four to six and ends with the ending parenthesis !
Best Regards,
guy038
P.S. :
See the fundamental difference between these two regexes :
A
(?-is)(\d+)ABC\1
and
B
(?-is)(\d+)ABC(?1)
Given the INPUT text :
1ABC1 12345ABC12345 456ABC456 89ABC89 456ABC789 789ABC456 0ABC123456789 0123456789ABC1 111ABC999
The regex
(?-is)(\d+)ABC\1
matches the first four lines only of the INPUT text, whereas the regex(?-is)(\d+)ABC(?1)
matches also the five other lines, below !Indeed, the regex
(?-is)(\d+)ABC(?1)
is identical to the regex(?-is)(\d+)ABC(\d+)
. So, the(?1)
syntax is just a shortcut to the regex which represents the whole group1
!But the
\1
syntax, in the regex A, represents the present value of group1
( i.e. a reference to group1
) -
-
C#
System.Text.RegularExpressions
supports repeated capture groups, so in principle someone could build a C# plugin that does regex search with repeated capture groups.I could even add support for repeated capture groups to the regex search functionalities of the JsonTools plugin, since it is implemented in C#. I’m just not currently sure what would be the most user-friendly way to do that.
-
@Joe-McCay I’m going to re-do @guy038’s solution a little to make it something that seemed a little more understandable to me…
(?xi) # (?x) Enables free-spacing mode which allows me to spread the expression over several lines and allows for # prefixed comments. (?i) enabled ignore-case mode so that [a-z] also matches [A-Z] ^insert\ into[\ \t] # Due to free-spacing mode we need a backslash in front of spaces that we want to be part of the match pattern ([-_.:/&a-z0-9]+) # I removed the seemingly spurious "AZ" you had which is also not needed as we are in ignore-case mode [\ \t]* \( ((?1)) # This reuses the $1 regexp to match the first parameter of the INSERT INTO (?:,((?1)))? # The second up through fifth parameters are optional with all of them also reusing the $1 regexp (?:,((?1)))? (?:,((?1)))? (?:,((?1)))? \)
Look for
free-spacing
on https://npp-user-manual.org/docs/searching/#search-modifiers to see how the(?x)
and(?i)
things work.Look for
subexpression
on https://npp-user-manual.org/docs/searching/ to see how the(?ℕ)
thing works. Subexpressions were used by both @Coises and @guy038 and are key to doing what you want to do. -
Hi, @joe-mccay, @coises, @mark-olson, @mkupper and All,
Ah… yes, the @mkupper’s formulation of the search regex is very clever and quite clear, thanks to the free-spacing mode !
I particularly like :
-
The
(?:,((?1)))?
syntax, where you join the optional states of, both, the(?1)
form and the comma -
The use of the leading
i
modifier to simplify the group1
syntax
Bravo !!
BR
guy038
-