Generic Regex : How to use the couple of "Backtracking Control" verbs (*SKIP)(*FAIL) or (*SKIP)(*F) in regexes
-
Hi, All,
In this post, I will present the
(*SKIP)(*FAIL)
( or(*SKIP)(*F)
) powerful feature, used within regexes and made of these two Backtracking Control verbs.Some general considerations :
Backtracking control verbs can be described as zero-width assertions, absolutely invisible when the regex engine looks forward, in current regex pattern.
Now, if we consider the general pattern
Regex_A(*VERB)Regex_B
, note that backtracking is allowed, both, inside theRegex_A
pattern and/or inside theRegex_B
pattern. However, the regex engine cannot backtrack from theRegex_B
part to theRegex_A
part, crossing backward the(*VERB)
assertion !I will describe six different examples, using the
(*SKIP)(*FAIL)
functionality. And, regarding the first example, I’ll also show the difference introduced if we omit, on purpose, either the(*SKIP)
or the(*FAIL)
Backtracking Control verb.The associated generic regex is :
Regex A
(*SKIP)(*F)|
Regex BFrom the different examples, below, you should be able to guess the contents of, both, the
Regex A
andRegex B
parts !
The first example is a derived example from this article, shown in
rexegg.com
site :https://rexegg.com/backtracking-control-verbs.php#skipfail
Suppose that we want to match single words as long as they are NOT embedded between two curly braces
{...}
and surround them with two slashs. We can use this S/R :-
FIND
{\w+}(*SKIP)(*FAIL)|\w+
-
REPLACE
/$0/
Against this INPUT text :
This {is} a simple text to {see} how this {regex} works
This S/R would return the following OUTPUT text :
/This/ {is} /a/ /simple/ /text/ /to/ {see} /how/ /this/ {regex} /works/
Here is how this pattern works :
-
First, the regex part
{\w+}
tries to match any single word between a set of curly braces. -
As the first word
This
is not within curly braces, it skips the left branch of the|
alternation and, obviously, matches the right branch\w+
=> After replacement, we get the string/This/
. -
Now, the regex engine position is right after the
/
. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string{is}
. -
Again, it tries to match the left branch
{\w+}
which, indeed, is the successful match{is}
. -
But it encounters the
(*SKIP)
verb which always matches and, then, the(*FAIL)
verb which always fails. -
So, the combination of the two backtracking control verbs
(*SKIP)(*FAIL)
:-
Do not allow any backtracking process
-
Cancels the current match
{is}
, so far => The contents{is}
, that we want to avoid, has been skipped. -
Resets the regex engine position to the position right before
(*SKIP)
.
-
-
So, the position of the regex engine is now right after the closing curly brace of the string
{is}
-
Again, the next space character cannot be matched at all and the regex engine advances to the next position right before the article
a
-
As this word
a
is not within curly braces, the regex engine skips the left branch of the|
alternation and, obviously, matches the right branch\w+
=> After replacement, we get the string/a/
And so on …
Note that the only possibility for the pattern to succeed, against the example text, is that the left branch search fails before
(*SKIP)
, in order to allow the right branch to be tested. That is the case, for instance, at the very beginning against the first wordThis
.But, as soon as the left branch can be matched, the combination
(*SKIP)(*FAIL)
just discards the match, so far and restarts the regex engine at the position right before(*SKIP)
within the text to examine.
Note also that, generally, the
(*SKIP)(*FAIL)
syntax is not mandatory. We could have achieved the same result with any of the three following S/R :-
FIND
(?<=[^{])\w++(?=[^}])
-
REPLACE
/$0/
Or :
-
FIND
{\w+}|(\w+)
-
REPLACE
?1/$0/:$0
Or :
-
SEARCH
(?:{\w+}.*?)?\K\w+
-
REPLACE
/$0/
The three S/R, above, do give the same OUTPUT text as the regex with the
(*SKIP)(*FAIL)
syntax, i.e. :/This/ {is} /a/ /simple/ /text/ /to/ {see} /how/ /this/ {regex} /works/
Now, let’s go back to our first syntax and get rid of the
(*SKIP)
backtracking control verb :-
FIND
{\w+}(*FAIL)|\w+
-
REPLACE
/$0/
And let’s try to imagine how this new pattern works :
-
First, the regex part
{\w+}
tries to match any single word between a set of curly braces. -
As the first word
This
is not within curly braces, it skips the left branch of the|
alternation and, obviously, matches the right branch\w+
=> After replacement, we get the string/This/
. -
Now, the regex engine position is right after the
/
. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string{is}
. -
Again, it tries to match the left branch
{\w+}
which, indeed, is the successful match{is}
. -
Then, it encounters the
(*FAIL)
verb which always fails. -
Because of the
(*FAIL)
control verb, the current match{is}
is discarded and, as the right branch of the alternation cannot match either, the whole regex fails and the position of the regex engine advances to the next position, right after the opening curly brace of the string{is}
. -
This time, the word
is
can match the second branch of the alternative\w+
=> So, after replacement, we get the string{/is/}
.
And so on …
As expected, it would produce this kind of OUTPUT :
/This/ {/is/} /a/ /simple/ /text/ /to/ {/see/} /how/ /this/ {/regex/} /works/
Again, let’s go back to our first syntax and get rid of the
(*FAIL)
backtracking control verb :-
FIND
{\w+}(*SKIP)|\w+
-
REPLACE
/$0/
And let’s try to imagine how this other pattern works :
-
First, the regex part
{\w+}
tries to match any single word between a set of curly braces. -
As the first word
This
is not within curly braces, it skips the left branch of the|
alternation and, obviously, matches the right branch\w+
=> After replacement, we get the string/This/
. -
Now, the regex engine position is right after the
/
. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string{is}
. -
Again, it tries to match the left branch
{\w+}
which, indeed, is the successful match{is}
. -
Then, it encounters the
(*SKIP)
verb which always matches. -
So far, as no
(*FAIL)
occurs, the part{\w+}(*SKIP)
is a true match which, after replacement, returns the string/{is}/
( Note that if a successful match has NOT been found, as the
(*SKIP)
verb does NOT allow any backtracking, the regex engine would have discarded the current match attempt and would have advanced to the position right after the string{is}
)-
Now, the regex engine position is right after the last
/
. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the stringa
-
As this word is not within curly braces, it will simply match the right branch of the alternation
\w+
which, after replacement, gives the string/a/
And so on …
So the OUTPUT text becomes :
/This/ /{is}/ /a/ /simple/ /text/ /to/ /{see}/ /how/ /this/ /{regex}/ /works/
Two remarks :
-
Given the present INPUT text, the
{\w+}
and\w+
are equally matched. So, the search regex could have been expressed, without any backtracking control verb, as simply :- FIND
{\w+}|\w+
- FIND
-
See the subtle difference with the preceding example, regarding the words embedded in curly braces (
{/is/}
vs/{is}/
! )
Now, let’s show a second example of the
(*SKIP)(*F)
feature with a table, where the third delimiter|
has been changed into the#
symbol :| abc | def # ghi | jkl | mno | | abc | def # ghi | jkl | mno |
Then, the following regex S/R would change any word, located after the
#
character, with the all-uppercase same word :-
FIND
(?-s).*#(*SKIP)(*F)|\w+
-
REPLACE
\U$0
| abc | def ! GHI | JKL | MNO | | abc | def ! GHI | JKL | MNO |
However, note that we can use, either, the following regex S/R, without any backtracking control verb, for the same result !
-
FIND
(?-s)(?!.*#)\w+
-
REPLACE
\U$0
Let’s go on with this third example, which run an S/R from after a specific and unique character till the very end of file.
The following S/R will change all words uppercase, after an unique
#
symbol, located near the middle of the file, till the very end of file.-
FIND
(?s)\A^.*#(*SKIP)(*F)|\w+
-
REPLACE
\U$0
So, from this INPUT text, placed in a new tab ( IMPORTANT )
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
We get the OUTPUT text, below :
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. SEE THE GNU GENERAL PUBLIC LICENSE FOR MORE DETAILS. YOU SHOULD HAVE RECEIVED A COPY OF THE GNU GENERAL PUBLIC LICENSE ALONG WITH THIS PROGRAM. IF NOT, SEE <HTTPS://WWW.GNU.ORG/LICENSES/>.
Again, note that the shorter search syntax,
(?s)\A^.*#\K|\w+
, without any backtracking control verb, would produce the same result !
This fourth example is a derived example from this article on the
stackoverflow.com
site :https://stackoverflow.com/questions/24534782/
It matches any line of text, which does NOT contain the upper-case
ABC
string, anywhere in current line, but DO contain an upper-caseDEF
string, if and only if, it’s followed, further on, with the upper-caseXYZ
string. This can be achieved with :MARK
(?-is).*ABC(*SKIP)(*F)|^.*DEF(?=.*XYZ).*
If we consider the following INPUT text :
test DEF test XYZ DEF test test XYZ DEFXYZ ABC test ABC DEF test XYZ ABC DEF test ABC test XYZ ABC DEFXYZ test ABC DEF test ABC XYZ DEF test ABC test ABC XYZ DEFABCXYZ test ABC DEF test XYZ ABC DEF test ABC test XYZ ABC DEFXYZABC test OK DEF test XYZ OK DEF test OK test XYZ OK DEFXYZ OK ABC test OK ABC DEF test XYZ OK ABC DEF test OK ABC test XYZ OK ABC DEFXYZ OK test ABC OK DEF test ABC XYZ OK DEF test ABC OK test ABC XYZ OK DEFABCXYZ OK test ABC OK DEF test XYZ ABC OK DEF test ABC OK test XYZ ABC OK DEFXYZABC OK this test this DEF test XYZ this DEF test this test XYZ this DEFXYZ this ABC test this ABC DEF test XYZ this ABC DEF test this ABC test XYZ This ABC DEFXYZ this test ABC this DEF test ABC XYZ this DEF test ABC this test ABC XYZ This DEFABCXYZ this test ABC this DEF test XYZ ABC this DEF test ABC this test XYZ ABC this DEFXYZABC this test OK this DEF test XYZ OK this DEF test OK this test XYZ OK this DEFXYZ OK this ABC test OK this ABC DEF test XYZ OK this ABC DEF test OK this ABC test XYZ OK this ABCDEFXYZ OK this test ABC OK this DEF test ABC OK XYZ this DEF test ABC OK this test ABC XYZ OK this DEFABCXYZ OK this test ABC OK this DEF test XYZ ABC OK this DEF test ABC OK this test XYZ ABC OK this DEFXYZABC OK
The search regex would mark all these lines of our INPUT text :
DEF test XYZ DEFXYZ DEF test XYZ OK DEFXYZ OK this DEF test XYZ this DEFXYZ this DEF test XYZ OK this DEFXYZ OK
And, as expected, all these lines :
-
Do NOT contain any string
ABC
-
DO contain a string
DEF
-
DO contain a string
XYZ
, located after the stringDEF
Once again, note that the regex
(?-is)^((?!ABC).)*DEF(?=.*XYZ)((?!ABC).)*$
, without any(*SKIP)(*F)
syntax, would also mark the same results. But I admit that it’s a bit harder to understand !
In a nutshell, from the examples above, we can express the
(*SKIP)(*F)
feature as follows :( Regex A ) (*SKIP)(*F) | ( Regex B ) | <---- Part SKIPPED ---->|<- Part MATCHED ->
When the regex to match is simply duplicated at the left of the
(SKIP)
Backtracking Control` verb, this diagram becomes :( Regex A ) (*SKIP)(*F) | ( Regex A ) | <---- Part SKIPPED ---->|<- Part MATCHED ->
Or, more interesting, the template :
<------------------------------------------ Regex A ------------------------------------------> | <----- Regex B -----> | ( Cond 1 | Cond 2 | .... | Cond N ) ( MAIN Search Regex ) ( Cond 1 | Cond 2 | .... | Cond M ) (*SKIP)(*F) | ( MAIN Search Regex ) | <--------------------------------------------- What is SKIPPED --------------------------------------------->|<-- What is MATCHED -->
Meaning that :
-
If the main regex to match is preceded by a condition, from
1
toN
and followed by a condition, from1
toM
, NOT wanted, this whole regex/match is discarded -
In ALL other cases, the main regex, in the right branch of the alternation is always matched !
Note that this last template, with multiple conditions to avoid, can hardly be realized without the use of the
(*SKIP)(*F)
part !!Continuation and end on next post !
-
-
Hi, All,
For instance, let’s imagine this fifth example with the regex :
- MARK
(?-i)(?:before_1|before_2)\u+(?:after_1|after_2)(*SKIP)(*F)|\u+
From this regex, we deduce that :
-
If an upper-case word
\u+
is, both :-
Preceded by the string
before_1
or the stringbefore_2
-
AND
-
Followed by the string
after_1
or the stringafter_2
-
=> This match will be discarded, due to the combination
(*SKIP)(*F)
- In all other cases, the right branch of the alternation
\u+
is used and an upper-case word will be matched !
You may verify my hypotheses, against the INPUT text, below, pasted in a new tab
before_XYZafter_ before_XYZafter_1 before_XYZafter_2 before_1XYZafter before_1XYZafter_1 before_1XYZafter_2 before_2XYZafter before_2XYZafter_1 before_2XYZafter_2
Finally, this sixth example, will take in account, both, the recursion and the
(*SKIP)(*F)
syntax. Again, it’s a derived example from this article, on thestackoverflow.com
site :https://stackoverflow.com/questions/70216280
The regex below tries to match any line with unbalanced level of parentheses. This kind of search needs the use of recursion to be achieved !
MARK
(\((?:[^()\r\n]++|(?1))*\))(*SKIP)(*F)|[()]
Some explanations :
-
The
Group 1
contents is the string\((?:[^()\r\n]++)*\)
which represents a correct balanced level of parentheses ( i.e. an atomic group of characters, different from parentheses and line-breaks, surrounded with a couple of parentheses ). -
The recursion is then realized by the insertion of the group 1, so the
(?1)
syntax, as an alternative, within the contents of theGroup 1
itself. -
If correct sets of parentheses have been found in current line, the match is then discarded, because of the
(*SKIP)(*F)
part. -
If an incorrect set is found, the right branch of the alternation
[()]
, after the(*SKIP)(*F)
part, will match any parenthesis.
After running this regex against the INPUT text below that you’ll paste in a new tab :
(abc) ((abc) (ab(c))) ((a)bc) (((((a)(b)(c)) (a(b)c) ((a))bc) (ab(c)) (a((b)c) (a(bc)) ((ab)c) ((a)(b)(c)) ((a((bc)) ((ab))))c)
You should find
12
matches in the7
lines below :((abc) (ab(c))) (((((a)(b)(c)) ((a))bc) (a((b)c) ((a((bc)) ((ab))))c)
Of course, for any marked line, with unbalanced levels of parentheses in lines, you must study where are the excess parentheses to be removed in order to get correct sets of parentheses !
For example :
-
The unbalanced expression
(a((b)c)
can be interpreted, either, as the correct setsa((b)c)
or(a(b)c)
! -
The unbalanced expression
((ab))))c)
can be interpreted, either, as the correct sets((ab))c
or((ab)c)
Best Regards,
guy038
P.S. : Yet, another example of the
(*SKIP)(*F)
technique in this article, on thestackoverflow.com
site :https://stackoverflow.com/questions/53066132
FIND
(?i)\b(?:county coast|at the|grant pass)\b(*SKIP)(*F)|\b(?:coast|the|pass)\b
Globally, this regex searches, whatever the case, for :
- Any word coast, if NOT preceded by the word county
Or
- Any word the, if NOT preceded by the word at
Or
- Any word pass, if NOT preceded by the word grant
- MARK
-
P PeterJones referenced this topic
-
G guy038 referenced this topic
-
-
Hello @alan-kilborn and All,
Sorry for the typo. The correct link is :
https://stackoverflow.com/questions/24534782/
I also edit my previous post
BR
guy038