-
+ BACKTRACKING CONTROL verbs - FIRST post
Hi, All,
Since Notepad++
v7.7
, its regex engine uses a new Boost regex library. First, I advice you to to see here , which discusses of variousBoost
version numbers.Related to our present regex engine status, these versions are :
-
The general Boost C++ libraries version number :
-1.64.0
-
The Boost-Regex library version number :
-5.1.3
-
The Boost-Regex documentation version number :
1.70.0
Our Boost regex engine uses the
PCRE
( Perl Compatible Regular Expression syntax ) and, since the old1.55.0
documentation, the new1.70.0
regex documentation, contains only a significant difference, relative to a new feature, namedBacktracking Control Verbs
. Refer to the official Boost documentation, below :Note that on Rexegg site, it is said :
The slow speed of adoption of the backtracking control verbs reflects some simple truths: they are rarely used, likely because they are poorly known and little understood by a considerable proportion of those few who have heard of them.
Frankly, this lack of awareness is not an issue because there is so much basic material that most people, who only occasionally use regex, need to master before the features offered by backtracking verbs become meaningful !
I would add that the use of the backtracking control verbs are really significant in rare occasions only ! However, I’ll try to give you an oversight of these new assertions and some examples to describe them as well as some practical uses, found out from general documentation and from some of my tests !
But, before studying these backtracking control Verbs, let’s say a word about the backtracking process itself !
-
If we consider the simple regex
\w+\w{3}\d+
, against the string ABC12345DEFABC56667892, the main steps are :-
First, the
\w+
pattern matches, from the starting position1
, all the string ABC12345DEFABC56667892 -
Then, the
\w+
backtracks ( so decreases by3
the current value of quantifier+
), in order that the\w+
pattern matches the string ABC12345DEFABC56667 and that the\w{3}
pattern matches the string 892 -
Finally, the
\w+\w{3}
backtracks ( so decrease by1
the current value of quantifier+
, in\w+
pattern ), in order that the\w+
pattern matches ABC12345DEFABC5666 , the\w{3}
pattern matches the string 789 and the\d+
pattern matches the digit 2
-
In this specific regex, only the first
+
quantifier of\w+
was involved in the backtracking process. The+
quantifier of\d+
was not concerned at all as nothing follows the pattern\d+
Remark : It is important to note that the backtracking process is not related, exactly, to backward directions, but, rather, to the idea that the regex engine recalculates an other possible match, from current starting position, in the string !
-
Let’s consider the regex
(?-i)\w+?C\d{6}
, against the same string ABC12345DEFABC56667892-
First, the regex
\w+?
, with the lazy quantifier beginning at1
, matches the letter A. But as the other letter is not a C, it backtracks and, from position1
in string, increases the quantifier to2
to get the string AB and, this time, theC
pattern does match the next C letter -
Then the
\d{6}
pattern tries to find six digits. However, after the five digits 12345, the D letter cannot be matched by the\d
pattern. Thus, the regex engine backtracks to starting position1
and increase the common value2
of the+?
quantifier, by one, …till value13
, in order that the next char is, again, a letter C -
Finally the
\d{6}
pattern looks for six consecutive digits and does find a match with the string 566678. So, the search process stops with pattern\w+?
= ABC12345DEFAB, patternC
= C and pattern\d{6}
= 566678
-
To begin with, let’s say that the backtracking control verbs can be described as zero-width assertions, absolutely invisible when the regex engine looks forward, in current regex pattern. Six different verbs are managed by the
Boost
regex engineAs mentioned in the Rexegg site, the two verbs
(*ACCEPT)
and(*FAIL)
should not be strictly designed as backtracking control Verbs. Indeed :-
The
(*ACCEPT)
assertion should be simply seen as acontrol
Verb -
The
(*FAIL)
assertion should be called abacktracking
Verb.
It is important to point out that these two verbs act as soon as they are encountered, in forward direction !
- Four other verbs
(*THEN)
,(*PRUNE)
,(*SKIP)
and(*COMMIT)
can be fully designed asbacktracking control
verbs, as they act only during the backtracking process and, of course, forces the regex engine to realize a specific action
However, if we consider the general pattern
Regex_A(*VERB)Regex_B
, note that backtracking is allowed, both, inside theRegex_A
pattern and/or inside theRegex_B
pattern. However, the regex engine cannot backtrack from theRegex_B
part to theRegex_A
part, crossing backward the(*VERB)
assertion !-
For instance, the regex
\w+\d+(*PRUNE).+\d
, against the string AAAA123456789BBBB----YYYY0ZZZZ, matches the part AAAA123456789BBBB----YYYY0 ( Refer to thePRUNE
backtracking control verb, further on ! )-
The
regex_A
(\w+\d+
) backtracks from last B letter to digit 8, in order to match the AAAA123456789 part -
The
regex_B
(.+\d
) backtracks from last Z to last Y, in order to match the BBBB----YYYY0 part
-
-
As no additional backtracking process is needed, an overall match occurs
-
Now, this same regex, against the string AAAA123456789BBBB----YYYYZZZZ ( with
0
digit missing ) would not match anything :-
First, the
regex_A
(\w+\d+
), as before, would match the string AAAAA123456789 -
However, the
Regex_B
(.+\d
) cannot obviously match the remainder BBB----YYYYZZZZ -
Even after backtracking to the first B letter, no match can be found, because no more digit occurs. So, the regex engine would need to backtrack onto the
Regex_A
pattern till before the 8 digit, in order that.+
would match the 8 digit and\d
would match the 9 digit -
Because of the
(*PRUNE)
verb, this behavior is not allowed and the match attempt fails at current position, Then, the regex engine advances to next position, in string, but no match occurs at any subsequent position, too !
-
Just see the difference with the simple regex
\w+\d+.+\d
, against the string AAAA123456789BBBB----YYYYZZZZ. This time, the substring AAAA123456789 matches ! (\w+
= AAAA123456,\d+
= 7,.+
= 8 and\d
= 9 )
-
When the regex engine moves forward, in the pattern, these four verbs have absolutely no influence and they always match. So their special semantics are never activated !
-
When the regex engine moves downward, because backtracking is needed, it’s not allowed to cross them. In other words, the regex engine can never go back to the left of these backtracking control verbs !
Finally, the
(*MARK)
verb is not supported by the presentBoost
library, implemented in Notepad++, whatever its version
We’ll begin to study the
4
true backtracking control verbs according to this order :(*COMMIT)
,(*SKIP)
,(*PRUNE)
and(*THEN)
. Next, it should be easier, for anyone, to figure out the relations and differences between each of them :-)Then, we’ll study the two remaining verbs
(*ACCEPT)
and(*FAIL)
. The later is often used in conjunction with other backtracking control verbs to perform enhanced behaviorsThis rather long documentation must be divided into five posts !
_______________
(*COMMIT)
_______________In
BOOST
documentation, it is said :(*COMMIT)
Has no effect unless backtracked onto, in which case all subsequent matching/searching attempts are abandoned.This means that, if at current position in string, the regex engine can match the part before
(*COMMIT)
but cannot match the part after(*COMMIT)
, as any backtracking is canceled to the left of(*COMMIT)
, the regex engine discards any further search and the current match attempt just fails. So, the regex engine simply abandons any further match attempt, the overall match just fails and the search process stopsNote that after the general failure, the current starting position remains right after the last successful attempt
In other words, the general regex
regex_A(*COMMIT)regex_B
means that ifregex_A
have already matched, thenregex_B
must necessarily match too, right after. If not, the overall match fails !On Rexegg site it is said :
“You can read the
(*COMMIT)
token as past this point, we are committed to finding a match, in this match attempt, or none at all”.So, for instance, the regex
(*COMMIT)ABC
means that if the string ABC is not found, at caret location, the search process simply stops !
Consider the regex
(?-i)AB+(*COMMIT)CD|\w{2}.|AB+(*COMMIT)CH
, against the text ---ABBZE---ABBBBCH---ABBBCY---ABBBBCD---—ABBZE—ABBBBCH—ABBBCY—ABBBBCD—
So, according to the effect of a
(*COMMIT)
backtracking control verb, let’s follow the regex engine’s work :-
The first
3
positions of the string (\-\-\-
) are skipped, as no alternative can match a dash character -
So, the starting position is right before the string ABBZE :
-
The first found pattern
AB+(*COMMIT)CD
matches ABB but cannot match CD -
So, it tries to backtrack at left of the
(*COMMIT)
syntax to test other values ofAB+
-
Because of the
(*COMMIT)
control verb, this behavior is forbidden So, all which is matched, so far (ABB
), is discarded and the regex engine abandons any further match attempt, too. So the overall match fails. -
However, note that the starting position remains at the very beginning of the scanned string, so right before the string ---ABBZE
-
The regex
1(?=2(*COMMIT)3)
finds a1
digit IF followed with 23, but all the search process stops completely as soon as a1
digit is followed by a2
and any char different from3
. Note that, after this failure, the current starting point remains after the last successful attempt ! ( the digit1
which is followed with the string 23 )Whereas the regex
1(?=23)
find a1
digit IF followed with 23 and search process goes on, to next positions, in string, to search for other successful match attempts as, for instance, the third value123
!Test these two regexes
1(?=2(*COMMIT)3)
and1(?=23)
, against the text, below :10 11 123 OK 14 18 19 123 OK 12 123 KO 127
Third example : the regex
(a(*COMMIT)b)+ac
can be re-written(a(*COMMIT)ba(*COMMIT)ba.....a(*COMMIT)b)ac
fails against the text abac, because in the second repeat,a(*COMMIT)b
fails to match ac and, as backtracking is not allowed in order to choose only one repeat, followed with ac, the overall match fails due to the verb(*COMMIT)
It’s worth to point out that the
(*COMMIT)
verb does not always commit. It’s the case when an other backtracking control verb is located at right at the(*COMMIT)
verb :For instance, the regex
(?-i)AB+(*COMMIT)(*SKIP)CD|\w{2}.
, against the string ---ABBZE--- does match the string ZE-, because the(*SKIP)
verb if found, first, before reaching, backwards, the(*COMMIT)
verb. In contrast, with the almost similar regex(?-i)AB+(*COMMIT)B(*SKIP)CD|\w{2}.
, the regex engine would need to backtrack at left of the(*COMMIT)
verb => Immediate failure !
I found a practical use of the
(**cOMMIT)
verb :We all know, that if the
Wrap around
option is not used, the process is run from current location till the very end of file(s). Now, imagine that you would like to execute an S/R from current position till the very beginning of file. The backward option is usually not enabled when regex search is involved and, anyway, some regexes do not behave properly, in backward direction !So, add a simple character, not used yet, alone in a line, anywhere in your current file. I chose the
#
character but, of course, any other char may be chosen ! Now, see the behavior of the regex#(*COMMIT*)\w+|\w+
-
As long as the
#
character is not found, at current starting position, the second alternative of the regex is used and matches any word (\w+
) -
As soon as the
#
character is found, at current starting position, the regex engine tries to find some word chars, right after the#
symbol. This is impossible as it is followed with a line-break ! So, due to the(*COMMIT)
verb, the overall regex fails and the process stops !
So, if we consider the shorter search regex
(#(*COMMIT)|)\w+
and the replacement regex\U$0
with theWrap around
option ticked, this S/R would change any word in uppercase but the replacement process would stop soon as a#
symbol, not immediately followed by a word char, is found, further on, in current text ! -
-
+ BACKTRACKING CONTROL verbs - SECOND post ( continuation )
_______________
(*SKIP)
_______________In
BOOST
documentation, it is said :(*SKIP)
Behaves the same as(*PRUNE)
except that it is assumed that no match can possibly occur prior to the current point in the string being searched. This can be used to optimize searches by skipping over chunks of text that have already been determined can not form a match.This means that, if at current position in string, the regex engine can match the part before
(*SKIP)
but cannot match the part after(*SKIP)
, as any backtracking is canceled to the left of(*SKIP)
, the regex engine discards any further search and the current match attempt just fails. So the regex engine must advance to the location in the string coresponding to where(*SKIP)
was matched, for the next new match attempt !In other words, if the
(*SKIP)
is crossed, during the backtracking process, the regex search process stops, at current position and begins again, at location right after the(*SKIP)
verb. So, the text matched before(*SKIP)
will never be part of a subsequent successful match ! Sometimes, this behavior can save a lot of fruitless match attempts !
Consider the regex
(?-i)AB+(*SKIP)CD|\w{2}.|AB+(*SKIP)CH
, against the text ---ABBZE---ABBBBCH---ABBBCY---ABBBBCD---So, according to the effect of a
(*SKIP)
backtracking control verb, let’s follow the regex engine’s work :-
The first
3
positions of the string (\-\-\-
) are skipped, as no alternative can match a dash character -
So, the starting position is right before the string ABBZE :
-
The first found pattern
AB+(*SKIP)CD
matches ABB but cannot matchCD
-
So, it tries to backtrack at left of the
(*SKIP)
syntax to test other values ofAB+
-
Because of the
(*SKIP)
control verb, this behavior is forbidden So, all which is matched, so far (ABB
), is discarded and the regex engine moves to the location where(*SKIP)
occurs, so right after the ABB string
-
-
Now, the starting position is right before the string ZE :
-
The first alternative cannot match in forward direction but the second one
\w{2}.
does match the string ZE- -
So, after
2
moves, again, the starting position is, now, right before the string ABBBBCH :-
At this stage, it’s worth noting that last alternative
AB+(*SKIP)CH
could match the ABBBBCH string ! However it won’t be used, because the first alternativeAB+(*SKIP)CD
is tested first -
So, the first found pattern
AB+(*SKIP)CD
matches ABBBB but cannot matchCD
-
So, it tries to backtrack at left of the
(*SKIP)
syntax to test other values ofAB+
-
Because of the
(*SKIP)
control verb, this behavior is forbidden So, all which is matched, so far (ABBBB
), is discarded and the regex engine moves to the location where(*SKIP)
occurs, so right after the ABBBB string
-
-
Now, the starting position is right before the string CH :
-
Again, the first alternative cannot match in forward direction but the second one,
\w{2}.
, does match the string CH- -
After
2
moves, the starting position is, now, right before the string ABBBCY :-
Process is identical as it was when the starting position was right before the string ABBZE
-
Because of the
(*SKIP)
control verb, current match attempt fails and the regex engine moves to the location where(*SKIP)
occurs, so right before the string CY :
-
-
The first alternative fails, in forward direction but the second pattern
\w{2}.
does match the string CY- -
Finally, after
2
moves, the starting position is, now, right before the string ABBBBCD :- This time, the first alternative
AB+(*SKIP)CD
just matches, in forward direction, all the remaining letters, the string ABBBBCD
- This time, the first alternative
See, below, the letters matched by each numbered occurrence :
---ABBZE---ABBBBCH---ABBBCY---ABBBBCD--- 111 222 333 4444444
Just compare with the same regex, below, where we omit the
(*SKIP)
control verb :(?-i)AB+CD|\w{2}.|AB+CH
Note, again, the letters matched by each occurrence, keeping in mind that the regex engine, in that case, always tries the alternatives from left to right, period !
---ABBZE---ABBBBCH---ABBBCY--ABBBBCD---- 111222 333444 555666 7777777
A second example : against the string aaacaaabxxaacaaaa---, beginning a line, let’s test some regexes :
-
The regex
\w+(*SKIP)b|.+
:-
First this regex matches all the letters aaacaaabxxaacaaaa and tries to backtrack in order to get a final b letter
-
As the backtracking process is forbidden because of the
(*SKIP)
control verb, current match attempt fails -
Then, the starting position moves to the location where
(*SKIP)
is encountered, so right after all the letters -
Finally, as the first alternative does not match, the second alternative
.+
does match the string ---
-
aaacaaabxxaacaaaa--- 111
-
The regex
a+(*SKIP)b|.+
:-
First this regex matches all the letters aaa and, as the next c letter does not match the
b
pattern, it tries to backtrack -
As the backtracking process is forbidden because of the
(*SKIP)
control verb, current match attempt fails -
Then, the starting position moves to the location where
(*SKIP)
is encountered, so right after the aaa string -
Finally, as the first alternative cannot match, the second alternative
.+
alternative does match the remaining characters caaabxxaacaaaa---
-
aaacaaabxxaacaaaa--- 11111111111111111
See the differences with the similar regexes, without the
(*SKIP)
control verb :-
The regex
\w+b|.+
-
First this regex matches all the letters and, as the next - character does not match the
b
pattern, it backtracks from the last a letter to the letter a, before the b letter, for a first match aaacaaab -
Then, as the first alternative cannot match, the second alternative
.+
matches all the remaining characters, so the string xxaacaaaa---
-
aaacaaabxxaacaaaa--- 11111111222222222222
-
The regex
\w+?b|.+
-
First this regex matches the first letter a and, as the next a letter does not match the
b
pattern, it backtracks from the first a letter to the letter a, before the letter b, for a first match aaacaaab -
Then, as the first alternative cannot match, the second alternative
.+
matches all the remaining characters xxaacaaaa---, as above
-
-
The regex
\w++b|.+
-
First this regex matches all the letters and, as the next - character does not match the
b
pattern, the regex engine tries to backtrack -
But, because of the possessive quantifier
++
, the backtracking process is not allowed and the match attempt fails -
So, the second alternative
.+
is tried and matches all the remaining characters, so the entire string aaacaaabxxaacaaaa---
-
Note that, on Rexegg site , it is said :
If
(*SKIP)
is the last backtracking control verb in a template, a match failure beyond(*SKIP)
should, in principle, always trigger the(*SKIP)
control verb. Indeed, backtracking is needed to test new match attempts at current position in string !In practice, internally, among all kinds of clever optimizations,
Perl
,PCRE
andPython
’s regex package are smart enough to avoid fruitless backtracking. But externally,PCRE
andPython
behave as though they were backtracking all the way backThis seems also the behavior of the
Boost
regex engine. If these optimizations had been taken into account, the regex engine would have behaved, sometimes, as if no(*SKIP)
backtracking control verb was part of the overall regexNow, regarding practical uses of the
(*SKIP)
backtracking control verb, we can mention the association of this verb with the verb(*FAIL)
, which will be studied later
_______________
(*PRUNE)
_______________In
BOOST
documentation, it is said :(*PRUNE)
Has no effect unless backtracked onto, in which case all the backtracking information, prior to this point, is discarded.This means that, if at current position in string, the regex engine can match the part before
(*PRUNE)
but cannot match the part after(*PRUNE)
, as any backtracking is canceled to the left of(*PRUNE)
, the regex engine discards any further search and the current match attempt just fails. Thus, the regex engine then advances to next position, in the string, for a new match attemptIn other words, if the
(*PRUNE)
is crossed, during the backtracking process, the regex search process stops, at current position and process begins again, at next starting position, in string !Usually, the
(*PRUNE)
verb is simply an alternative to an atomic group or to a possessive quantifier. However, there are some uses of the(*PRUNE)
verb that cannot be expressed in any other way !
Consider the regex
(?-i)AB+(*PRUNE)CD|\w{3}|AB+(*PRUNE)CH
, against the text ---ABBZE---ABBBBCH---ABBBCY---ABBBBCD---So, according to the effect of a
(*PRUNE)
backtracking control verb, let’s follow the regex engine’s work :-
The first
3
positions of the string ( --- ) are skipped, as no alternative can match a dash character -
So, the starting position is right before the string ABBZE :
-
The first found pattern
AB+(*PRUNE)CD
matches ABB but cannot match CD -
So, it tries to backtrack at left of the
(*PRUNE)
syntax to test other values ofAB+
-
Because of the
(*PRUNE)
control verb, this behavior is forbidden So, all which is matched, so far (ABB
), is discarded and the regex engine moves to the next starting position, in string. So, right before the string BBZE
-
-
The first alternative cannot match in forward direction but the second alternative
\w{3}
does match the string BBZ -
Then, the starting position is right before the letter E, which cannot be matched by any part of the overall regex
-
So, after
3
moves, again, the starting position is, now, right before the string ABBBBCH :-
At this stage it’s worth noting that last alternative
AB+(*PRUNE)CH
could match the ABBBBCH string ! However it won’t be used, because the first alternativeAB+(*PRUNE)CD
is tried first -
So, the first found pattern
AB+(*PRUNE)CD
matches ABBBB but cannot match CD -
So, it tries to backtrack at left of the
(*PRUNE)
syntax to test other values ofAB+
-
Because of the
(*PRUNE)
control verb, this behavior is forbidden So, all which is matched, so far (ABBBB
), is discarded and the regex engine moves, immediately, to the next starting position, in string. So right before the string BBBBCH
-
-
Again, the first alternative cannot match in forward direction but the second one,
\w{3}
, does match the string BBB -
Then, the starting position is right before the string BCH :
-
The first alternative cannot match, in forward direction, but the second alternative,
\w[3}
, does match the string BCH -
After
3
moves, the starting position is, now, right before the string ABBBCY :-
Process is identical as it was when the starting position was right before the string ABBZE
-
Because of the
(*PRUNE)
control verb, current match attempt fails and the regex engine moves to the next starting position, right before the string BBBCY -
The first alternative fails, in forward direction but the second pattern
\w{3}
does match the string BBB
-
-
Then, the starting position is right before the string CY, which cannot be matched by any part of the overall regex
-
Finally, after
3
moves, the starting position is, now, right before the string ABBBBCD :- This time, the first alternative
AB+(*PRUNE)CD
just matches, in forward direction, all the remaining letters, the string ABBBBCD
- This time, the first alternative
See, below, the letters matched by each numbered occurrence :
---ABBZE---ABBBBCH---ABBBCY---ABBBBCD--- 111 222333 444 5555555
Just compare with the same regex, below, where we omit the
(*PRUNE)
control verb :(?-i)AB+CD|\w{3}|AB+CH
Note, again, the letters matched by each occurrence, keeping in mind that the regex engine, in that case, always tries the alternatives from left to right, period !
---ABBZE---ABBBBCH---ABBBCY--ABBBBCD--- 111 222333 444555 5555555
If the regex part, located at left of a
(*PRUNE)
backtracking control verb is moderately complex and that a lot of backtracking actions are probable, it may save the regex engine a lot a time because it cancels any backtracking process and skips to next starting position, in stringHowever, in this case, in order for
(*PRUNE)
to become efficient, note that :-
All backtracking time must be greater that time to compile the pattern with this special verb
-
Our
Boost
regex engine may have performed some internal optimizations that prevents backtracking, anyway -
Sometimes, the use of the
(*SKIP)
backtracking control verb is even better !
Personally, I don’t see any practical use of the
(*PRUNE)
backtracking control verb, yet :-((One more point :
In PCRE2, it is said :
In an anchored pattern
(*PRUNE)
has the same effect as(*COMMIT)
Seemingly, it’s not the case, with our
Boost
regex engine, whatever the pattern is anchored or not :For instance, against the simple line : BBBBCX---
-
The regex
B+(*PRUNE)CD|.+
matches the string CX---, but the regexB+(*COMMIT)CD|.+
just fails -
The regex
^B+(*PRUNE)CD|.+
, with an anchored pattern, matches the string BBBCX---*, but the regex^B+(*COMMIT)CD|.+
fails, also
-
-
+ BACKTRACKING CONTROL verbs - THIRD post ( continuation )
_______________
(*THEN)
_______________Preliminary comment :
I must confess that, among all the backtracking control verbs, the description and behavior of the
(*THEN)
verb, with ourBoost
regex engine, is a bit tricky and I still cannot explain some regex patterns, containing the(*THEN)
verb. So, it quite possible that my general understanding of that specific verb is wrong … Just tell me !!The
(*THEN)
verb is generally used in an alternation structure. When the(*THEN)
verb occurs outside an alternation, it behaves exactly like the(*PRUNE)
verbIn
BOOST
documentation, it is said :(*THEN)
has no effect unless backtracked onto, in which case all subsequent alternatives, in a group of alternations, are discardedThis means that, if at current position in string, the regex engine can match the part before
(*THEN)
but cannot match the part after(*THEN)
, of an alternative, in a group, as any backtracking is canceled to the left of(*THEN)
, the current match attempt is discarded and all the subsequent alternatives of the current group. So, the regex engine skips to the pattern of the next alternative, located outside the current groupThe
(*THEN)
verb is a little confusing ! After, numerous tests to determine its true behavior, I came up with these explanations :- Only the first alternative, containing
(*THEN)
, as well as all alternatives without the(*THEN)
verb, located or not in a group, are tried by the regex engine
And, in case of a backtracking at left of the
(*THEN)
verb of this first alternative :-
If this alternative is located outside a group or is the single alternative of a group, without an alternation symbol
|
:-
The current group is not considered as a true group of alternatives
-
Thus, the
(*THEN)
verb rather acts as a(*PRUNE)
verb and the current starting position is increased by1
-
-
If this alternative is located in a group, with, at least, two alternatives, with
(*THEN)
, separated by an alternation symbol|
:-
The current group is considered as a true group of alternatives and the semantics of the
(*THEN)
verb is activated -
So, the current starting position remains unchanged
-
-
Then, in both cases, the regex engine skips to the next alternative, outside the current group, whatever this one is included or not in a group
For instance, let’s consider the regex, below, using the free-spacing mode :
(?x-i) ( AB+(*THEN)CH | AB+(*THEN)CD | AB+(*THEN)CT ) | # Line 1 ( AB+(*THEN)ZE | AB+(*THEN)ZK ) | # Line 2 ( \w{3} ) | # Line 3 ( AB+(*THEN)HA ) | # Line 4 ( AB+(*THEN)ZX | AB+(*THEN)HY ) # Line 5
against the one-line text, below, where the added numbers, designate each numbered occurrence :
ABBCD---ABBBBCH---ABBBBBCT---ABBBZK---ABBBBZE---ABBBBBZX---ABBBBHJ---ABHA---ABBBBBBHY--- 111 2222222 333444 555666 7777777 888999 000111 222 333444555
As you can see, only the first alternative of groups, line
1
and2
, are taken in account, because, on backtracking to the left of(*THEN)
, the regex engine then skips, immediately, to the next group of alternativesHowever, note that the first alternative of groups, line
4
and5
are never tried because the\w{3}
pattern comes before !So, except for the strings ABBBBCH and ABBBBZE, all the other strings are partially matched by the
\w{3}
pattern, line3
!
Now, let’s consider the regex, below, in free-spacing mode :
(?x-i) ( AB+(*THEN)CH | AB+(*THEN)CD | AB+(*THEN)CT ) | # Line 1 ( AB+(*THEN)ZE | AB+(*THEN)ZK ) | # Line 2 ( AB+(*THEN)HA ) | # Line 3 Group WITHOUT an | => (*THEN) acts as (*PRUNE) ( \w{3} ) | # Line 4 ( AB+(*THEN)ZX | AB+(*THEN)HY ) # Line 5
against the same subject string :
ABBCD---ABBBBCH---ABBBBBCT---ABBBZK---ABBBBZE---ABBBBBZX---ABBBBHJ---ABHA---ABBBBBBHY--- 111 2222222 333444 555 6666666 777888 999000 1111 222333
This time, only the first alternative of groups, line
1
,2
and3
, are taken in account, because, on backtracking to the left of(*THEN)
, the regex engine then skips, immediately, to the next group of alternativesBut, this time, due to the lack of any alternation symbol of the group, line
3
, the backtracking control verb(*THEN)
acts rather as the(*PRUNE)
verb. So, in case of a unsuccessful backtracking operation, the starting position of the regex engine advances to next position !Therefore, as the strings ABBCD, ABBBBBCT, ABBBZK, ABBBBBZX, ABBBBHJ and ABBBBBBHY cannot match the
AB+(*THEN)HA
pattern, in line3
, the regex engine goes to next position, which always begins with a B letter.And, as all alternatives, beginning with an
A
letter, cannot match, the\w{3}
pattern is the only one which can match something (3
consecutive letters )
Just compare with the same regex, without any
(*THEN)
verb, below, again the same string :(?x-i) ( AB+CH | AB+CD | AB+CT ) | # Line 1 ( AB+ZE | AB+ZK ) | # Line 2 ( AB+HA ) | # Line 3 ( \w{3} ) | # Line 4 ( AB+ZX | AB+HY ) # Line 5
Note, again, the letters matched by each occurrence, keeping in mind that the regex engine, in that case, always tries the alternatives from left to right, period ! This explains why the strings ABBBBBZX and ABBBBBBHY are never matched by the patterns
AB+ZX
andAB+HY
ABBCD---ABBBBCH---ABBBBBCT---ABBBZK---ABBBBZE---ABBBBBZX---ABBBBHJ---ABHA---ABBBBBBHY--- 11111 2222222 33333333 444444 5555555 666777 888999 0000 111222333
As said before, if the
(*THEN)
verb is not inside an alternation group, it acts like the(*PRUNE)
backtracking control verb.For instance, against the string ---1123ABBBBCHIKOP--- :
-
With the regex
(?-i)\d+(AB+(*THEN)CD)|\w{8}
or(?-i)\d+(AB+(*PRUNE)CD)|\w{8}+
, containing a group, without any alternation symbol :-
Because no D letter, right after the C letter, it fails, and, as backtracking is not allowed because of the verb
(*THEN)
or(*PRUNE)
, the regex engine moves to the next starting position, in string, and retries the first alternative -
After successive moves to next starting position and immediate failure to match the CD string, caret is, now, right before the letter A. This time, the first alternative cannot match. So, the second alternative just matches the next eight characters ABBBBCHI :
-
---1123ABBBBCHIKOP--- 11111111
-
Now, with the regex
(?x-i)\d+(AB+(*THEN)CD|\w\w)|\w{8}
, containing a group with an alternation symbol :-
After the failure in matching D, right after letter C, and, as backtracking on
AB+
is not allowed because of the verb(*THEN)
, the regex engine skips and tries the next alternative of this inner group,\w\w
-
But, this time, the backtracking process onto
\d+
is possible, from the first position in string ( the regex engine must find a range of digits, not followed with the A letter). Thus, it easily finds the 1123A string, where\d+
= 112 and\w\w
= 3A :
-
---1123ABBBBCHIKOP--- 1111122222222
When the regex expression, located before a
(*THEN)
backtracking control verb, is complex and time-consuming, the(*THEN)
verb acts in the same way as an atomic group, because it prevents from any backtracking in current used alternative. However the use of the(*THEN)
verb seems limited and, personally, I haven’t found out, yet, any practical use of this verb !As I said before, my present reasoning about the
(*THEN)
backtracking control verb may be completely erroneous.Here are some oddities that I still cannot understand ! Can someone can explain me :-
Why the regex
(?-i)A(B+(*THEN)L|FK)|\w\w
matches BX, in the string ---ABX--- ? -
and why the regex
(?-i)A(B+(*THEN)L|BK)|\w\w
matches AB, in the same string ---ABX--- ?
IMPORTANT : The
4
full backtracking control verbs, described, above, provide4
different strengths of control when subsequent matching fails :-
The
(*THEN)
is the weakest, carrying on the match at the next alternative, first -
The
(*PRUNE)
verb comes next, failing the match at the current starting position, even in case of subsequent alternatives, and moving, in string, to the next position -
The
(*SKIP)
verb is similar to(*PRUNE)
, except that the regex engine moves, in string, to the(*SKIP)
location ( so, possibly more than one character ! ) -
Finally, the
(*COMMIT)
verb is the strongest and always causes the overall match to fail
In next chapters, we’ll study the two verbs
(*ACCEPT)
and(*FAIL)
. In contrast to the previous verbs, the(*ACCEPT)
and(*FAIL)
verbs act as soon as they are encountered !
_______________
(*ACCEPT)
_______________In
BOOST
documentation, it is said :(*ACCEPT)
Causes the pattern to be considered matched at the current point. Any half-open sub-expressions are closed at the current point.This means that, if at current position in string, the regex engine can match the part right before the
(*ACCEPT)
verb, this partial match is considered as a successful match attempt and any pattern, after the(*ACCEPT)
verb, is simply discarded. Then, as usual, the regex engine advances to the next starting position, in string, for a new match attemptIf the
(*ACCEPT)
backtracking control verb is located inside a capturing group, that group is set to whatever characters have been matched, up to that point !Let’s consider the regex
\d+(*ACCEPT).+|\w+
, against the string 12345----abcde-----
First, the regex matches the digits 12345
-
Without the
(*ACCEPT)
verb, the ending.+
part of the current alternative would have matched all the remaining chars of line -
But, as soon as the
(*ACCEPT)
verb is reached, the current pattern, so far,\d+
, is considered as successful and the ending.+
part of the current alternative is discarded by the regex engine -
Then current starting position is right before the first dash char, which cannot match the regex
-
After moving till right after the last dash character, the second alternative does match the string abcde
The interest of the
(*ACCEPT)
verb is generally limited. But, sometimes, it can be very useful when used in a list of alternatives !
A practical use is described on the [rexegg] (https://www.rexegg.com/backtracking-control-verbs.html#accept) site :
Let’s suppose you would like to match the
3
upper-case strings BAZ, BO and BIZ. Of course, an obvious solution is(?-i)BAZ|BO|BIZ
But, if B and Z stand for complicated sub-patterns, you would probably factor these two patterns, which is normally impossible because of the BO syntax.
With the
B(?:A|O(*ACCEPT)|I)Z
syntax, containing the(*ACCEPT)
verb, the factorization of the B pattern, at the beginning and the Z pattern, at the end becomes possible !Indeed, when the regex engine uses the alternative
O(*ACCEPT)
( so the overall regexBO(*ACCEPT)Z
), it just matches the BO string as the match ofZ
is discarded, due to the(*ACCEPT)
verbAs usual, a solution without any backtracking control verb is still possible, with conditional patterns. You could use, either, the regexes below, with the conditional structure
(?(1)...|...)
SEARCH
B(?:(A|I)|O)(?(1)Z)
orB(?:A|I|(O))(?(1)|Z)
The
(*ACCEPT)
backtracking control verb is the only backtracking verb that is allowed to be quantified. And, when the quantifier is ungreeedy, this allows the(*ACCEPT)
verb to acts only when a backtracking phase, from quantifier0
to quantifier1
occurs !-
Consider the regex
A(*ACCEPT)??BC
, where the(*ACCEPT)
verb is located inside an ungreedy quantified group, with a minimum of0
, against the string AAAABCBBBBBAAAABCCCCCC-
If current starting point is right before an A, followed with BC, as the quantifier takes first the value
0
=> the(*ACCEPT)
verb is not taken in account. Thus, the regex is identical toABC
and matches the string ABC, in forward direction, without any backtracking process -
If an A is not followed by BC, at current starting position, a backtrack is needed and the value
1
is tried => Thus, the regex becomesA(*ACCEPT)B
which always matches an A letter, in all cases
-
-
Now, consider the regex
A(*ACCEPT)?BC
, where the(*ACCEPT)
verb is located inside a greedy quantified group, with a maximum of1
, against the same string :- This time, whether an A is followed or not with the string BC, at current starting point, as the quantifier takes first the value
1
=> The(*ACCEPT)
verb is always taken in account. So, the regex is identical to the patternA(*ACCEPT)BC
, which always matches a A letter, in all cases
- This time, whether an A is followed or not with the string BC, at current starting point, as the quantifier takes first the value
- Only the first alternative, containing
-
+ BACKTRACKING CONTROL verbs - FOURTH post ( continuation )
_______________
(*FAIL)
_______________In
BOOST
documentation, it is said :(*FAIL)
Causes the match to fail unconditionally at this point, can be used to force the engine to backtrackThis means that, if at current position in string, the active part of the pattern meets a zero-width
(*FAIL)
verb, then the current match attempt is always considered as unsuccessful. So the regex engine, necessarily, backtracks in order to find out other patterns, in the overall regex, to make a match attempt successful, at current location in stringBecause of the inconditional failure of the current match attempt, due to
(*FAIL)
, the current starting position has not changed. For instance, the regexAABBCC(*FAIL)|.+
, against the string AABBCCABDEF---GGAC, will match all string contents, from the start location1
, even if the string AABBCC has been matched before the(*FAIL)
verb was encountered !Note that the
(*FAIL)
syntax can be abbreviated as(*F)
and may also be replaced by a negative look-ahead of an empty string(?!)
, which is an always false assertion because any character is, necessarily, followed with an empty string. You could use, also, a regex like(?=A)B
which is, obviously, an always false assertion, too !One example : the regex
A((A|Z)(*FAIL)|B|C)D
is just equivalent to the regexA(B|C)D
, as the alternative(A|Z)(*FAIL)
always fails ! So, this regex only matches the strings ABD and ACD, against the text AZD ABD AAD ACD
I see
4
practical uses of the(*FAIL)
backtracking control verb :- If the overall regex contains some complicated alternatives, you can temporarily omit one of them, for testing. Simply add, after this alternative and before the
|
symbol, a(*F)
verb, which will always fails any match attempt of this specific alternative. For instance, with the free-spacing mode, the regex, below, would only match the strings AAAA, BBBB, DDDD and EEEE
(?x-i) AAAA | BBBB | CCCC (*F) | DDDD | EEEE
Note that we could have used, also, the
#
comment char, placing all the patternCCCC
in comments :(?x-i) AAAA | BBBB | # CCCC | DDDD | EEEE
- The
(*FAIL)
or(*F)
syntaxes can be used to force the regex engine to backtrack, after defining some groups, not intended to be consumed at current position in string, but later
For instance, let’s examine the regex
(?-i)([A-Z])([a-z])([0-9])(*F)|(?1)(?3)(?2)|((?1)|(?2))(?3){2}
against the string A3h AAD x45 Z4a b8Q H09 BAc 000 Y1y-
The first alternative is the part
([A-Z])([a-z])([0-9])(*F)
which defines3
groups :[A-Z]
,[a-z]
and[0-9]
standing ,respectively, for an uppercase letter, a lowercase letter and a digit -
The second alternative
(?1)(?3)(?2)
represents subroutine calls to the groups ( An uppercase letter followed with a digit and a lowercase letter ) -
The third alternative
((?1)|(?2))(?3){2}
represents other subroutine calls to groups ( A letter followed with two digits )
Note that it does not matter whether the first alternative is matched or not. Each time the overall regex is processed, the regex engine always tries this alternative first and, in all cases, it attempts to match the three regexes, consecutively, storing them as groups
1
and2
and3
, for further use, via the subroutine calls !Remark : The use of the
(*FAIL)
verb can also be emulated with the special(DEFINE)
variable, in a conditional group, so the(?(DEFINE).....)
syntaxIn our particular case, it would lead to the regex
(?-i)(?(DEFINE)([A-Z])([a-z])([0-9]))(?1)(?3)(?2)|((?1)|(?2))(?3){2}
Now, I also found out a very simple way to emulate the
(*FAIL)
verb or the(?(DEFINE)...)
conditional structure :- Use the syntax
(?-i)([A-Z])([a-z])([0-9])¤|(?1)(?3)(?2)|((?1)|(?2))(?3){2}
, which substitutes the(*FAIL)
verb with any char or string, not present in your text. I chose the¤
character.
Indeed, although the first alternative
([A-Z])([a-z])([0-9])¤
will never match, the three regexes[A-Z]
,[a-z]
and[0-9]
are, indeed, stored as groups1
,2
and3
for further use, via the subroutine calls(?1)
,(?2)
and(?3)
;-))
So, these regexes, with, either,
(*F)
verb, the(DEFINE)
assertion or the non-existent char¤
, matches any3-CHARS
string of non-accentuated characters :-
Beginning with an upper-case letter, ending with a lower-case letter and with a digit in between
-
Beginning with any letter and followed with two digits
Az1 A3h AAD x45 Z4a Az1 b8Q H09 BAc 000 Y1y 111 222 333 444 555
- In
Perl
, the(*FAIL)
verb is commonly used to get all the different branches of a match tree
In the Rexegg site, it is said, for instance :
-
This script line
'abc' =~ /\w+(?{print "$&\n";})(*F)/
prints, successively, the strings abc, ab, a, bc, b and c -
First, the
\w+
matches the whole string abc -
Then, the code capsule prints the match
-
Now, the
(*F)
verb forces the regex engine to backtrack -
So, the regex engine gives up the
c
-
And the callback prints ab
-
Again, the
(*F)
verb forces the regex engine to backtrack -
And so on !
- In next chapter we’ll learn about the
(*SKIP)(*FAIL)
construct which is a powerful way of excluding some patterns, from the match !
A last point regarding the
(*FAIL)
backtracking control verb :As mentioned in rexegg, this verb is not the opposite of the
(*ACCEPT)
verb. If it were the case, then the(*FAIL)
would mean, at current point, in string, the match attempt just fails. In fact, the true opposite of the(*ACCEPT)
verb is the combination of the two verbs(*PRUNE)
and(*FAIL)
or, even better, the combination of the two verbs(*SKIP)
and(*FAIL)
!For instance, the regex
(?-i)([a-z]+(*ACCEPT)|\w+)[A-Z]+\d+
against the string azrdfgsdFGSDFG12345 is just equivalent to the pattern(?-i)([a-z]+(*ACCEPT))
which matches the beginning azrdfgs of the stringSo, we’re tying to find out a verb structure which means DON’T accept, like the
\K
syntax, which would not take in account or, in other words, would skip the[a-z]+
pattern, matched so far !If we substitute the
(*ACCEPT)
verb with the(*FAIL)
verb, in the same pattern, so the regex(?-i)([a-z]+(*FAIL)|\w+)|[A-Z]+\d+
, it matches all the subject string azrdfgsdFGSDFG12345 which isn’t our expected match. Indeed, because of the(*FAIL)
verb , it forces the regex engine to backtrack, at current position in string, for an other match attempt… which is successful, as the second alternative\w+
graps all the subject stringNow if we add the
(*PRUNE)
verb, before the(*FAIL)
verb, we get the new regex(?-i)([a-z]+(*PRUNE)(*FAIL)|\w+)[A-Z]+\d+
which does matches the remaining part of the string after the lower-case letters. Now, after meeting the(PRUNE)
verb, in backward direction the match attempt fails but, this time, the regex engine moves to the next starting position before doing a new match attemptWe would get the same expected result with the regex
(?-i)([a-z]+(*SKIP)(*FAIL)|\w+)[A-Z]+\d+
, which would, globally, skip the part matched, so far, by the pattern before the(*SKIP)
verb ! The association of these two verbs is described right below.
_______________
(*SKIP)(*FAIL)
_______________The combination
(*SKIP)(*FAIL)
or(*SKIP)(*F)
means that everything matched on left side of(*SKIP)(*FAIL)
is discarded and that the regex engine must move to the location right after this discarded match for a new match attempt !A generic regex would be : What_I_do_not_want
(*SKIP)(*FAIL)|
What_I_want or What_I_do_not_want(*SKIP)(*F)|
What_I_wantFor instance, in order to surround any word with two
-
characters, except whose already surrounded by curly braces,{}
, use one of the regex S/R, below : :-
SEARCH
{[^}]*}(*SKIP)(*F)|\b\w+\b
REPLACE-$0-
-
SEARCH
{[^}]*}(*SKIP)(?!)|\b\w+\b
REPLACE-$0-
Test it, against the sentence : This is a test with some {text} to see {if} it works"
Each time a
{xxxx}
zone is reached, it’s skipped and the current starting position moves to the location where the(*F)
verb or the(?!)
syntax occurOf course, you may, simply, use the regex S/Rs below, without any control verb, for an exact replacement :
-
SEARCH
{[^}]*}|(\b\w+\b)
REPLACE?1-$0-:$0
-
SEARCH
({[^}]*}.*?\K)?\b\w+\b
REPLACE-$0-
Note the difference with this search, without the **
(*SKIP)
**part : :SEARCH
{[^}]*}(*F)|\b\w+\b
This time, as the
(*SKIP)
verb is absent, when the string {text} is reached and evaluated it backtracks because of(*F)
and, as the second alternative cannot match, too, the current starting position, in string, goes next and the part\w+
does match the text partSecond example :
Let’s consider the table, below, where the third delimiter
|
has been changed into the!
symbol| abc | def ! ghi | jkl | mno | | abc | def ! ghi | jkl | mno |
Then the regex S/R :
SEARCH
(?-s).*!(*SKIP)(*F)|\w+
REPLACE
\U$0
would change this table as :
| abc | def ! GHI | JKL | MNO | | abc | def ! GHI | JKL | MNO |
But, we can use the syntax
(?-s)(?!.*!)\w+
, without any backtracking control verb, too !
Before, with the
(*COMMIT)
verb and a dummy character, we saw a method to do an S/R only in the first part of a file.In a similar way, we can run an S/R, only in the last part of a file. Again, let’s choose the
#
character, inserted around the middle of current file. Then the following S/R would change all words uppercase, after the#
symbol, till the very end of fileSEARCH
(?s)\A^.*#(*SKIP)(*F)|\w+
REPLACE
\U$0
However, note that this shorter search syntax,
(?s)\A^.*#\K|\w+
, without any backtracking control verb, would produce the same result !So, in short, the
(*SKIP)(*FAIL)
construct means :Discard anything that has been matched, so far, at left of the two verbs
(*SKIP)(*FAIL)
, in the current regex alternative ! - If the overall regex contains some complicated alternatives, you can temporarily omit one of them, for testing. Simply add, after this alternative and before the
-
+ BACKTRACKING CONTROL verbs - FIFTH and LAST post
A last summary :
-
The zero-width
(*THEN)
,(*PRUNE)
,(*SKIP)
and(*COMMIT)
backtracking control verbs always match in forward direction but and always fail, on backtracking, when the(*......)
verb is crossed in backward direction -
The zero-width
(*ACCEPT)
backtracking control verb always matches the current match attempt, till the(*ACCEPT)
verb, in forward direction
–The zero-width
(*FAIL)
/(*F)
backtracking control verb always fails the current match attempt, in forward direction. So, it always backtracks in order to test other parts of the overall regex, if anyAnd, if we consider the general syntax
Regex_A(*......)RegexB
, the next starting position, of the regex engine, in string, is :-
When the verb
(*......)
is found in forward direction :-
For
(*THEN)
: Right after the characters matched by the overall patternRegex_A(*THEN)Regex_B
-
For
(*PRUNE)
: Right after the characters matched by the overall patternRegex_A(*PRUNE)Regex_B
-
For
(*SKIP)
: Right after the characters matched by the overall patternRegex_A(*SKIP)Regex_B
-
For
(*COMMIT)
: Right after the characters matched by the overall patternRegex_A(*COMMIT)Regex_B
-
For
(*ACCEPT)
: Right after the characters matched by theRegex_A
pattern part -
For
(*FAIL)
: Unchanged, so the location before running the overall patternRegex_A(*FAIL)Regex_B
-
-
When the verb
(*......)
is found in backward direction, because of backtracking :-
For
(*THEN)
: The current location when running the overall patternRegex_A(*THEN)Regex_B
-
For
(*PRUNE)
: The location next to the current location when running the overall patternRegex_A(*PRUNE)Regex_B
-
For
(*SKIP)
: Right after the characters matched by theRegex_A
pattern part -
For
(*COMMIT)
: Unchanged, so the location before running the overall patternRegex_A(*COMMIT)Regex_B
-
To conclude this study of the backtracking control verbs, here is a last example which recapitulates the different backtracking control verbs and some of their combinations. All the regexes, in the table, below, are tested against the same subject string :
aaaa1bbbcccccc#
•-----------------------------•----------------------------------------•-------------------------------------------------------•---------------------------------• | REGULAR Expression | Different MATCHES | SECOND regex alternative STATUS, after FIRST try | Caret AFTER 1st try of 2ND Alt. | •-----------------------------•----------------------------------------•-------------------------------------------------------•---------------------------------• | c+|a+\d|b+|# | aaaa1 bbb cccccc # | MATCH, in FORWARD direction | Right AFTER the 1 DIGIT | | c+|\w+\d|b+|# | aaaa1 bbb cccccc # | MATCH, AFTER backtracking from LAST c to 1 | Right AFTER the 1 DIGIT | | | | | | | c+|a++\d|b+|# | aaaa1 bbb cccccc # | MATCH, in FORWARD direction | Right AFTER the 1 DIGIT | | c+|\w++\d|b+|# | bbb cccccc # | NO match, as backtracking is NOT allowed | Right BEFORE the FIRST a | | | | | | | | | | | | c+|a+(*FAIL)\d|b+|# | bbb cccccc # | Match FAILURE, though MATCH in FORWARD direction | Right BEFORE the FIRST a | | c+|\w+(*FAIL)\d|b+|# | bbb cccccc # | Match FAILURE, though MATCH after BACKTRACKING | Right BEFORE the FIRST a | | | | | | | | | | | | c+|a+(*SKIP)\d|b+|# | aaaa1 bbb cccccc # | MATCH, in FORWARD direction | Right AFTER the 1 DIGIT | | c+|\w+(*SKIP)\d|b+|# | # ( + cccccc IF caret AFTER LAST b ) | NO match, as BACTRACKING is NOT allowed | Right BEFORE the # symbol | | | | | | | c+|a+(*SKIP)(*F)\d|b+|# | bbb cccccc # | Match FAILURE, though MATCH in FORWARD direction | Right BEFORE the 1 DIGIT | | c+|\w+(*SKIP)(*F)\d|b+|# | # ( + cccccc IF caret AFTER LAST b ) | Match FAILURE and NO match as NO backtracking | Right BEFORE the # symbol | | | | | | | | | | | | c+|a+(*PRUNE)\d|b+|# | aaaa1 bbb cccccc # | MATCH, in FORWARD direction | Right AFTER the 1 DIGIT | | c+|\w+(*PRUNE)\d|b+|# | cccccc # | Alternative DISCARDED + NO match as NO backtracking | Right AFTER the FIRST a | | | | | | | c+|a+(*PRUNE)(*F)\d|b+|# | bbb cccccc # | Match FAILURE, though MATCH in FORWARD direction | Right AFTER the FIRST a | | c+|\w+(*PRUNE)(*F)\d|b+|# | cccccc # | Match FAILURE and NO match as NO backtracking | Right AFTER the FIRST a | | | | | | | | | | | | c+|a+(*COMMIT)\d|b+|# | aaaa bbb cccccc # | MATCH, in FORWARD direction | Right AFTER the 1 DIGIT | | c+|\w+(*COMMIT)\d|b+|# | cccccc # ( IF caret AFTER LAST b ) | CANCELS any SUBSEQUENT attempt as NO backtracking | Right BEFORE the FIRST a | | | | | | | c+|a+(*ACCEPT)\d|b+|# | aaaa bbb cccccc # | MATCH of part BEFORE (*ACCEPT), in FORWARD direction | Right BEFORE the 1 DIGIT | | c+|\w+(*ACCEPT)\d|b+|# | aaaa1bbbcccccc # | MATCH of part BEFORE (*ACCEPT), in FORWARD direction | Right BEFORE the # symbol | | | | | | | c+|a+(*THEN)\d|b+|# | aaaa1 bbb cccccc # | MATCH, in FORWARD direction | Right AFTER the 1 DIGIT | | c+|\w+(*THEN)\d|b+|# | cccccc # | NO match, as BACTRACKING is NOT allowed | Right BEFORE the FIRST a | •-----------------------------•----------------------------------------•-------------------------------------------------------•---------------------------------•
For additional information, refer to :
https://www.rexegg.com/backtracking-control-verbs.html
https://www.rexegg.com/backtracking-control-verbs.html#skipfail
https://www.pcre.org/current/doc/html/pcre2pattern.html#SEC29
https://perl.developpez.com/documentations/en/5.20.1/perlre.html#Special-Backtracking-Control-Verbs
and this article on stackoverflow :
https://stackoverflow.com/questions/19992984/verbs-that-act-after-backtracking-and-failure
Now, it’s up to you ! Just experiment all these backtracking control verbs and try to find out some practical uses of them ;-))
Best regards,
guy038
-
-
https://community.notepad-plus-plus.org/post/55641
Hello, All,
During my study of the backtracking control verbs ( see posts above ), I noticed this article which described ways to pre-define some subroutines, which can be used to build numerous modular patterns, at the same time :
https://www.rexegg.com/regex-tricks.html#pseudo-define
Note that the special conditional
(?(DEFINE).....)
structure was created, originally, to this purpose, by modern regex engines, including ourBoost
regex engine ! I already discussed of this syntax in this post below :https://community.notepad-plus-plus.org/post/52608
So, for instance, with the free-spacing mode, the regex below :
(?x-i) (?(DEFINE) # START of the conditional DEFINE structure ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 ) # END of the conditional DEFINE structure (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES
Against the string YH5-RC6-UY0-BD5-AZ3-KL9, would match the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9
Now, in Rexegg site, we are told that, instead of the
(?(DEFINE).....)
conditional structure, we can use the(*FAIL)
backtracking control verb to get a similar behavior, according to the syntax below, using an optional non-capturing group :(?x-i) (?: # START of an OPTIONAL NON-CAPTURING group ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 (*F) # Backtracking Control verb (*FAIL) )? # END of the OPTIONAL NON-CAPTURING group (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES
Remark that, instead of the
(*F)
verb, we may also use the negative look-ahead(?!)
syntax for identical results
Now, I’ve found out a more simple syntax :
(?x-i) ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 (*F) # Backtracking Control verb (*FAIL) | # OR (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES
How this regex syntax works, against the string YH5-RC6-UY0-BD5-AZ3-KL9 ?
-
First, the regex engine tries to match the part
([A-Z]{2}\d)
and, indeed, matches the substring HY5. -
At the same time, it stores the pattern
[A-Z]{2}\d
in group1
-
Then the regex engine meets the backtracking control verb
(*F)
which forces it to backtrack -
So the current match HY5 is discarded and the starting position remains right before the first letter Y
-
Then the regex engine tries the other alternative
?1)-(?1)-(?1)
and matches, successively, the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9, as we expect to !
Note the effect of the
(*FAIL)
verb is exactly the same as if the regex engine would try to match a specific pattern that does not exist in subject string ! So, let simply substitute this verb with, for instance, the CURRENY sign¤
, of Unicode code point ``\x{00a4}`. which is generally not used in files !You can use the Windows input method, hitting the
Alt
key and typing number0164
on the numeric keypadSo we get the regex :
(?x-i) ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 ¤ # The CURRENCY sign character ( In fact, any NON-EXISTENT character, in CURRENT file ! ) | # OR (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES
Why my syntax also works :
-
First, the regex engine tries to match the part
([A-Z]{2}\d)
and, indeed, matches the substring HY5. -
At the same time, it stores the pattern
[A-Z]{2}\d
in group1
-
But it cannot match the following currency sign
¤
and the regex engine naturally backtracks -
The starting position remains right before the first letter Y
-
And the regex engine tries the other alternative
?1)-(?1)-(?1)
, which matches, successively, the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9
Note that the regex engine does remember of the pattern contained in group
1
because, despite of the backtracking phase, the search process still goes on !Of course, instead of the single character
¤
, you can, either, choose a simple string, which does not exist in current file. For instance,///
or@@
or_X_
It’s also important to point out that you can perfectly use named group(s), for all these syntaxes. For instance, my last regex syntax can be rewritten :
(?x-i) (?<Seq>[A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in NAMED group 'Seq' ¤ # The CURRENCY sign character ( In fact, any NON-EXISTENT character, in CURRENT file ! ) | # OR (?&Seq)-(?&Seq)-(?&Seq) # THREE 'Seq' triplets, separated with DASHES
Of course, all the syntaxes, above, which help us to create a library of pre-defined groups are mostly valuable, when you try to search between numerous patterns, built up from these elementary patterns !
To that purpose, let’s expose a more interesting and practical example, regarding the search of specific sequences within the genetic code !
If necessary, refer for a very basic background, about genetic code, at the end of this post !
From now on, the genetic notions invoked are only the fruit of my imagination and are, in no way, based on scientific data. !! It’s just used to demonstrate the interest of the above regex syntaxes
Let’s suppose that the
3
genetic sequencesCGUUUA
,GCCACUAAACAG
andAAUCGACAU
, named Seq_1, Seq_2 and Seq_3, are of main importance in order to build up greater genetic chains from a combination of these componentsNow, let’s assume that we want, at the same time, search for the seven combinations :
- Seq_2 + Seq_3
- Seq_1 + Seq_2
- Seq_1 + Seq_3
- Seq_2 + AAA codon + Seq_3
- Seq_2 + CCC codon + Seq_1
4
consecutive Seq_1 + Seq_24
consecutive Seq_1 + Seq_3
Then, the regex to search any of these sequences, delimited with a start and a stop codon, could be, in free-spacing mode :
(?x-i) # FREE-SPACING mode and NON-INSENSITIVE search (?<Seq_1>CGUUUA) # Seq_1 definition ( 6 bases ) (?<Seq_2>GCCACUAAACAG) # Seq_2 definition ( 12 bases ) (?<Seq_3>AAUCGACAU) # Seq_3 definition ( 9 bases ) X | # An INEXISTANT character in RNA sequence OR (AUG|GUG|UUG) # Possible START codons (?<Codon>[ACGU]{3})*? # Any number of CODONS, even ZERO, in the NAMED group Codon \K # ANYTHING matched, so far, is DISCARDED (?: # Start of a NON-CAPTURING group (?&Seq_2)(?&Seq_3) | # Chain 1 : Seq_2 + Seq_3 OR (?&Seq_1)(?&Seq_2) | # Chain 2 : Seq_1 + Seq_2 OR (?&Seq_1)(?&Seq_3) | # Chain 3 : Seq_1 + Seq_3 OR (?&Seq_2)AAA(?&Seq_3) | # Chain 4 : Seq_2 + AAA + Seq_3 OR (?&Seq_3)CCC(?&Seq_1) | # Chain 5 : Seq_3 + CCC + Seq_1 OR (?&Seq_1){4}(?&Seq_2) | # Chain 6 : FOUR consecutive Seq_1 + Seq_2 OR (?&Seq_1){4}(?&Seq_3) # Chain 7 : FOUR consecutive Seq_1 + Seq_3 ) # END of the NON-CAPTURING group (?= # START of a LOOK-AHEAD (?&Codon)* # Any number of CODONS, even ZERO (UAA|UGA|UAG) # Possible STOP codons ) # END of the LOOK-AHEAD You can test this regex against these lines, below, which **all** begin with a **start** codon and end with a **stop** codon : ~~~diff AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA # Chain 3 GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA # Chain 3 UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG # Chain 1 UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG # Chains 4 then 5 AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA # Chains 6 then 2 GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA # Chains 7 then 3
However, if we change the greedy quantifier, of the Codon group, with a lazy quantifier, we get other matches. This is expected because some areas overlap with other areas or include some others !
So, in order to correctly detect all these chains of nucleotides, we could look, only for the zero-length location, of the start of each sequence, with the appropriate regex, below :
(?x-i) # FREE-SPACING mode and NON-INSENSITIVE search (?<Seq_1>CGUUUA) # Seq_1 definition ( 6 bases ) (?<Seq_2>GCCACUAAACAG) # Seq_2 definition ( 12 bases ) (?<Seq_3>AAUCGACAU) # Seq_3 definition ( 9 bases ) X| # An INEXISTANT character in RNA sequence OR (?= # START of a LOOK-AHEAD (?&Seq_2)(?&Seq_3) | # Case 1 : Seq_2 + Seq_3 OR (?&Seq_1)(?&Seq_2) | # Case 2 : Seq_1 + Seq_2 OR (?&Seq_1)(?&Seq_3) | # Case 3 : Seq_1 + Seq_3 OR (?&Seq_2)AAA(?&Seq_3) | # Case 4 : Seq_2 + AAA + Seq_3 OR (?&Seq_3)CCC(?&Seq_1) | # Case 5 : Seq_3 + CCC + Seq_1 OR (?&Seq_1){4}(?&Seq_2) | # Case 6 : FOUR consecutive Seq_1 + Seq_2 OR (?&Seq_1){4}(?&Seq_3) # Case 7 : FOUR consecutive Seq_1 + Seq_3 ) # END of the LOOK-AHEAD
See below, with the indication of the start of each sequence :
v AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA # Chain 3 v GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA # Chain 3 v UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG # Chain 1 v v UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG # Chains 4 then 5 v v AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA # Chains 6 then 2 v v GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA # Chains 7 then 3
Now, if we want to completely match each composite sequence, we’ll use this last regex :
(?x-i) # FREE-SPACING mode and NON-INSENSITIVE search (?<Seq_1>CGUUUA) # Seq_1 definition ( 6 bases ) (?<Seq_2>GCCACUAAACAG) # Seq_2 definition ( 12 bases ) (?<Seq_3>AAUCGACAU) # Seq_3 definition ( 9 bases ) X| # An INEXISTANT character in RNA sequence OR (?: # START of a NON-CAPTURING group (?&Seq_2)(?&Seq_3) | # Case 1 : Seq_2 + Seq_3 OR (?&Seq_1)(?&Seq_2) | # Case 2 : Seq_1 + Seq_2 OR (?&Seq_1)(?&Seq_3) | # Case 3 : Seq_1 + Seq_3 OR (?&Seq_2)AAA(?&Seq_3) | # Case 4 : Seq_2 + AAA + Seq_3 OR (?&Seq_3)CCC(?&Seq_1) | # Case 5 : Seq_3 + CCC + Seq_1 OR (?&Seq_1){4}(?&Seq_2) | # Case 6 : FOUR consecutive Seq_1 + Seq_2 OR (?&Seq_1){4}(?&Seq_3) # Case 7 : FOUR consecutive Seq_1 + Seq_3 ) # END of a NON-CAPTURING group
See, below, the indication of each sequence with the
v
letter or the^
symbol :vvvvvvvvvvvvvvv AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA # Chain 3 vvvvvvvvvvvvvvv GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA # Chain 3 vvvvvvvvvvvvvvvvvvvvv UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG # Chain 1 vvvvvvvvvvvvvvvvvvvvvvvv # Chain 4 UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG ^^^^^^^^^^^^^^^^^^ # Chain 5 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv # Chain 6 AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA ^^^^^^^^^^^^^^^^^^ # Chain 2 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv # Chain 7 GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA ^^^^^^^^^^^^^^^ # Chain 3
Note that, in order to get, successively, all the occurrences, even in case of overlapping, hit the
F3
shortcut to get a match. Then, hit theLeft
arrow, followed by theRight
arrow, to advance of one position in text !
At last, I would say that it’s obvious that I’m not competing, with my simple regexes, against all the powerful ORF finding tools used by biologists ;-)) Refer to :
https://en.wikipedia.org/wiki/Open_reading_frame#ORF_finding_tools
Best Regards,
guy038
Here is a very basic summary, about genetic code, for the sole purpose of that present discussion :
DNA
is a double ordered helix chain, made of four nucleotides, associated by pairs, adenine (A
) with thymine (T
) and guanine (G
) with cytosine (C
), which carries all the genetic instructions for development, functioning, growth and reproduction of all known living organismsA
gene
is a sequence of nucleotides that encodes, either, the synthesis of theRNA
molecule from theDNA
molecule or the synthesis of a protein from aRNA
molecule and may contain more than1,000
pairs of nucleotides. Humans have nearly20,000
genes, whose about2,000
are thought essential to our survival !Each triplet of nucleotides, named codon, part of a double-stranded
DNA
or part of a single-strandedRNA
molecule, corresponds to an amino-acid, during the transcription process toRNA
or during the translation process to a proteinA single translated region of the genetic code, in the
DNA
molecule, is called anORF
( Open reading Frame ), containing, generally, a minimum of100
to150
codonsAfter the transcription phase, from the
DNA
molecule intoPre-mRNA
and the suppression of introns, we get the matureRNA
molecule, made of four nucleotides, associated by pairs, too : adenine (A
) with uracil (U
) and guanine (G
) with cytosine (C
)In
RNA
, theORF
zone has become theProtein Coding Region
, composed of :- A start codon
[ AUG , GUG , UUG ]
- A continuous stretch of codons
- A stop codon
[ UAA , UGA , UAG ]
If we consider, for instance, the genome of the Escherichia Coli bacteria
K-12
, the proportions of each start and stop codons are :START STOP Codon Codon AUG
83%UAA
63%GUG
14%UGA
29%UUG
3%UAG
8%Refer also to :
https://en.wikipedia.org/wiki/Introduction_to_genetics
https://en.wikipedia.org/wiki/DNA
https://en.wikipedia.org/wiki/RNA
https://en.wikipedia.org/wiki/Gene
https://en.wikipedia.org/wiki/Open_reading_frame
-
-
-
-
-
-
-