Generic Regex : How to use the couple of "Backtracking Control" verbs (*SKIP)(*FAIL) or (*SKIP)(*F) in regexes

guy038

Hi, All,

In this post, I will present the (*SKIP)(*FAIL) ( or (*SKIP)(*F) ) powerful feature, used within regexes and made of these two Backtracking Control verbs.

Some general considerations :

Backtracking control verbs can be described as zero-width assertions, absolutely invisible when the regex engine looks forward, in current regex pattern.

Now, if we consider the general pattern Regex_A(*VERB)Regex_B, note that backtracking is allowed, both, inside the Regex_A pattern and/or inside the Regex_B pattern. However, the regex engine cannot backtrack from the Regex_B part to the Regex_A part, crossing backward the (*VERB) assertion !

I will describe six different examples, using the (*SKIP)(*FAIL) functionality. And, regarding the first example, I’ll also show the difference introduced if we omit, on purpose, either the (*SKIP) or the (*FAIL) Backtracking Control verb.

The associated generic regex is :

Regex A(*SKIP)(*F)|Regex B

From the different examples, below, you should be able to guess the contents of, both, the Regex A and Regex B parts !

The first example is a derived example from this article, shown in rexegg.com site :

https://rexegg.com/backtracking-control-verbs.php#skipfail

Suppose that we want to match single words as long as they are NOT embedded between two curly braces {...} and surround them with two slashs. We can use this S/R :

FIND {\w+}(*SKIP)(*FAIL)|\w+
REPLACE /$0/

Against this INPUT text :

This {is} a simple text to {see} how this {regex} works

This S/R would return the following OUTPUT text :

/This/ {is} /a/ /simple/ /text/ /to/ {see} /how/ /this/ {regex} /works/

Here is how this pattern works :

First, the regex part {\w+} tries to match any single word between a set of curly braces.
As the first word This is not within curly braces, it skips the left branch of the | alternation and, obviously, matches the right branch \w+ => After replacement, we get the string /This/.
Now, the regex engine position is right after the /. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string {is}.
Again, it tries to match the left branch {\w+} which, indeed, is the successful match {is}.
But it encounters the (*SKIP) verb which always matches and, then, the (*FAIL) verb which always fails.
So, the combination of the two backtracking control verbs (*SKIP)(*FAIL) :
- Do not allow any backtracking process
- Cancels the current match {is}, so far => The contents {is}, that we want to avoid, has been skipped.
- Resets the regex engine position to the position right before (*SKIP).
So, the position of the regex engine is now right after the closing curly brace of the string {is}
Again, the next space character cannot be matched at all and the regex engine advances to the next position right before the article a
As this word a is not within curly braces, the regex engine skips the left branch of the | alternation and, obviously, matches the right branch \w+ => After replacement, we get the string /a/

And so on …

Note that the only possibility for the pattern to succeed, against the example text, is that the left branch search fails before (*SKIP), in order to allow the right branch to be tested. That is the case, for instance, at the very beginning against the first word This.

But, as soon as the left branch can be matched, the combination (*SKIP)(*FAIL) just discards the match, so far and restarts the regex engine at the position right before (*SKIP) within the text to examine.

Note also that, generally, the (*SKIP)(*FAIL) syntax is not mandatory. We could have achieved the same result with any of the three following S/R :

FIND (?<=[^{])\w++(?=[^}])
REPLACE /$0/

Or :

FIND {\w+}|(\w+)
REPLACE ?1/$0/:$0

Or :

SEARCH (?:{\w+}.*?)?\K\w+
REPLACE /$0/

The three S/R, above, do give the same OUTPUT text as the regex with the (*SKIP)(*FAIL) syntax, i.e. :

/This/ {is} /a/ /simple/ /text/ /to/ {see} /how/ /this/ {regex} /works/

Now, let’s go back to our first syntax and get rid of the (*SKIP) backtracking control verb :

FIND {\w+}(*FAIL)|\w+
REPLACE /$0/

And let’s try to imagine how this new pattern works :

First, the regex part {\w+} tries to match any single word between a set of curly braces.
As the first word This is not within curly braces, it skips the left branch of the | alternation and, obviously, matches the right branch \w+ => After replacement, we get the string /This/.
Now, the regex engine position is right after the /. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string {is}.
Again, it tries to match the left branch {\w+} which, indeed, is the successful match {is}.
Then, it encounters the (*FAIL) verb which always fails.
Because of the (*FAIL) control verb, the current match {is} is discarded and, as the right branch of the alternation cannot match either, the whole regex fails and the position of the regex engine advances to the next position, right after the opening curly brace of the string {is}.
This time, the word is can match the second branch of the alternative \w+ => So, after replacement, we get the string {/is/}.

And so on …

As expected, it would produce this kind of OUTPUT :

/This/ {/is/} /a/ /simple/ /text/ /to/ {/see/} /how/ /this/ {/regex/} /works/

Again, let’s go back to our first syntax and get rid of the (*FAIL) backtracking control verb :

FIND {\w+}(*SKIP)|\w+
REPLACE /$0/

And let’s try to imagine how this other pattern works :

First, the regex part {\w+} tries to match any single word between a set of curly braces.
As the first word This is not within curly braces, it skips the left branch of the | alternation and, obviously, matches the right branch \w+ => After replacement, we get the string /This/.
Now, the regex engine position is right after the /. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string {is}.
Again, it tries to match the left branch {\w+} which, indeed, is the successful match {is}.
Then, it encounters the (*SKIP) verb which always matches.
So far, as no (*FAIL) occurs, the part {\w+}(*SKIP) is a true match which, after replacement, returns the string /{is}/

( Note that if a successful match has NOT been found, as the (*SKIP) verb does NOT allow any backtracking, the regex engine would have discarded the current match attempt and would have advanced to the position right after the string {is} )

Now, the regex engine position is right after the last /. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string a
As this word is not within curly braces, it will simply match the right branch of the alternation \w+ which, after replacement, gives the string /a/

And so on …

So the OUTPUT text becomes :

/This/ /{is}/ /a/ /simple/ /text/ /to/ /{see}/ /how/ /this/ /{regex}/ /works/

Two remarks :

Given the present INPUT text, the {\w+} and \w+ are equally matched. So, the search regex could have been expressed, without any backtracking control verb, as simply :
- FIND {\w+}|\w+
See the subtle difference with the preceding example, regarding the words embedded in curly braces ( {/is/} vs /{is}/ ! )

Now, let’s show a second example of the (*SKIP)(*F) feature with a table, where the third delimiter | has been changed into the # symbol :

| abc | def # ghi | jkl | mno |
| abc | def # ghi | jkl | mno |

Then, the following regex S/R would change any word, located after the # character, with the all-uppercase same word :

FIND (?-s).*#(*SKIP)(*F)|\w+
REPLACE \U$0

| abc | def ! GHI | JKL | MNO |
| abc | def ! GHI | JKL | MNO |

However, note that we can use, either, the following regex S/R, without any backtracking control verb, for the same result !

FIND (?-s)(?!.*#)\w+
REPLACE \U$0

Let’s go on with this third example, which run an S/R from after a specific and unique character till the very end of file.

The following S/R will change all words uppercase, after an unique # symbol, located near the middle of the file, till the very end of file.

FIND (?s)\A^.*#(*SKIP)(*F)|\w+
REPLACE \U$0

So, from this INPUT text, placed in a new tab ( IMPORTANT )

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    #
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

We get the OUTPUT text, below :

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    #
    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.  SEE THE
    GNU GENERAL PUBLIC LICENSE FOR MORE DETAILS.

    YOU SHOULD HAVE RECEIVED A COPY OF THE GNU GENERAL PUBLIC LICENSE
    ALONG WITH THIS PROGRAM.  IF NOT, SEE <HTTPS://WWW.GNU.ORG/LICENSES/>.

Again, note that the shorter search syntax, (?s)\A^.*#\K|\w+, without any backtracking control verb, would produce the same result !

This fourth example is a derived example from this article on the stackoverflow.com site :

https://stackoverflow.com/questions/24534782/

It matches any line of text, which does NOT contain the upper-case ABC string, anywhere in current line, but DO contain an upper-case DEF string, if and only if, it’s followed, further on, with the upper-case XYZ string. This can be achieved with :

MARK (?-is).*ABC(*SKIP)(*F)|^.*DEF(?=.*XYZ).*

If we consider the following INPUT text :

test
DEF test XYZ
DEF test
test XYZ
DEFXYZ

ABC test
ABC DEF test XYZ
ABC DEF test
ABC test XYZ
ABC DEFXYZ

test ABC
DEF test ABC XYZ
DEF test ABC
test ABC XYZ
DEFABCXYZ

test ABC
DEF test XYZ ABC
DEF test ABC
test XYZ ABC
DEFXYZABC

test OK
DEF test XYZ OK
DEF test OK
test XYZ OK
DEFXYZ OK

ABC test OK
ABC DEF test XYZ OK
ABC DEF test OK
ABC test XYZ OK
ABC DEFXYZ OK

test ABC OK
DEF test ABC XYZ OK
DEF test ABC OK
test ABC XYZ OK
DEFABCXYZ OK

test ABC OK
DEF test XYZ ABC OK
DEF test ABC OK
test XYZ ABC OK
DEFXYZABC OK

this test
this DEF test XYZ
this DEF test
this test XYZ
this DEFXYZ

this ABC test
this ABC DEF test XYZ
this ABC DEF test
this ABC test XYZ
This ABC DEFXYZ

this test ABC
this DEF test ABC XYZ
this DEF test ABC
this test ABC XYZ
This DEFABCXYZ

this test ABC
this DEF test XYZ ABC
this DEF test ABC
this test XYZ ABC
this DEFXYZABC

this test OK
this DEF test XYZ OK
this DEF test OK
this test XYZ OK
this DEFXYZ OK

this ABC test OK
this ABC DEF test XYZ OK
this ABC DEF test OK
this ABC test XYZ OK
this ABCDEFXYZ OK

this test ABC OK
this DEF test ABC OK XYZ
this DEF test ABC OK
this test ABC XYZ OK
this DEFABCXYZ OK

this test ABC OK
this DEF test XYZ ABC OK
this DEF test ABC OK
this test XYZ ABC OK
this DEFXYZABC OK

The search regex would mark all these lines of our INPUT text :

DEF test XYZ
DEFXYZ
DEF test XYZ OK
DEFXYZ OK
this DEF test XYZ
this DEFXYZ
this DEF test XYZ OK
this DEFXYZ OK

And, as expected, all these lines :

Do NOT contain any string ABC
DO contain a string DEF
DO contain a string XYZ, located after the string DEF

Once again, note that the regex (?-is)^((?!ABC).)*DEF(?=.*XYZ)((?!ABC).)*$, without any (*SKIP)(*F) syntax, would also mark the same results. But I admit that it’s a bit harder to understand !

In a nutshell, from the examples above, we can express the (*SKIP)(*F) feature as follows :

    ( Regex A ) (*SKIP)(*F) |  ( Regex B )
                            |
    <---- Part SKIPPED ---->|<- Part MATCHED ->

When the regex to match is simply duplicated at the left of the (SKIP) Backtracking Control` verb, this diagram becomes :

    ( Regex A ) (*SKIP)(*F) |  ( Regex A )
                            |
    <---- Part SKIPPED ---->|<- Part MATCHED ->

Or, more interesting, the template :

    <------------------------------------------ Regex A ------------------------------------------>              |  <----- Regex B ----->
	                                                                                                             |
    ( Cond 1 | Cond 2 | .... |  Cond N ) (  MAIN Search Regex ) ( Cond 1 | Cond 2 | .... | Cond M ) (*SKIP)(*F)  |  ( MAIN Search Regex )
                                                                                                                 |
    <--------------------------------------------- What is SKIPPED --------------------------------------------->|<-- What is MATCHED -->

Meaning that :

If the main regex to match is preceded by a condition, from 1 to N and followed by a condition, from 1 to M, NOT wanted, this whole regex/match is discarded
In ALL other cases, the main regex, in the right branch of the alternation is always matched !

Note that this last template, with multiple conditions to avoid, can hardly be realized without the use of the (*SKIP)(*F) part !!

Continuation and end on next post !

guy038

Hi, All,

For instance, let’s imagine this fifth example with the regex :

MARK (?-i)(?:before_1|before_2)\u+(?:after_1|after_2)(*SKIP)(*F)|\u+

From this regex, we deduce that :

If an upper-case word \u+ is, both :
- Preceded by the string before_1 OR the string before_2
- AND
- Followed by the string after_1 OR the string after_2

=> This match will be discarded, due to the combination (*SKIP)(*F)

In all other cases, the right branch of the alternation \u+ is used and an upper-case word will be matched !

You may verify my hypotheses, against the INPUT text, below, pasted in a new tab

before_XYZafter_
before_XYZafter_1
before_XYZafter_2

before_1XYZafter
before_1XYZafter_1
before_1XYZafter_2

before_2XYZafter
before_2XYZafter_1
before_2XYZafter_2

Finally, this sixth example, will take in account, both, the recursion and the (*SKIP)(*F) syntax. Again, it’s a derived example from this article, on the stackoverflow.com site :

https://stackoverflow.com/questions/70216280

The regex below tries to match any line with unbalanced level of parentheses. This kind of search needs the use of recursion to be achieved !

MARK ($(?:[^()\r\n]++|(?1))*$)(*SKIP)(*F)|[()]

Some explanations :

The Group 1 contents is the string $(?:[^()\r\n]++)*$ which represents a correct balanced level of parentheses ( i.e. an atomic group of characters, different from parentheses and line-breaks, surrounded with a couple of parentheses ).
The recursion is then realized by the insertion of the group 1, so the (?1) syntax, as an alternative, within the contents of the Group 1 itself.
If correct sets of parentheses have been found in current line, the match is then discarded, because of the (*SKIP)(*F) part.
If an incorrect set is found, the right branch of the alternation [()], after the (*SKIP)(*F) part, will match any parenthesis.

After running this regex against the INPUT text below that you’ll paste in a new tab :

(abc)
((abc)
(ab(c)))
((a)bc)
(((((a)(b)(c))
(a(b)c)
((a))bc)
(ab(c))
(a((b)c)
(a(bc))
((ab)c)
((a)(b)(c))
((a((bc))
((ab))))c)

You should find 12 matches in the 7 lines below :

((abc)
(ab(c)))
(((((a)(b)(c))
((a))bc)
(a((b)c)
((a((bc))
((ab))))c)

Of course, for any marked line, with unbalanced levels of parentheses in lines, you must study where are the excess parentheses to be removed in order to get correct sets of parentheses !

For example :

The unbalanced expression (a((b)c) can be interpreted, either, as the correct sets a((b)c) or (a(b)c) !
The unbalanced expression ((ab))))c) can be interpreted, either, as the correct sets ((ab))c or ((ab)c)

Best Regards,

guy038

P.S. : Yet, another example of the (*SKIP)(*F) technique in this article, on the stackoverflow.com site :

https://stackoverflow.com/questions/53066132

Globally, this regex searches, whatever the case, for :

Any word coast, if NOT preceded by the word county

Or

Any word the, if NOT preceded by the word at

Or

Any word pass, if NOT preceded by the word grant

Alan Kilborn

@guy038 said :

https://stackoverflow.com/questions/27534782/

This results in “Page not found” for me.

guy038

Hello @alan-kilborn and All,

Sorry for the typo. The correct link is :

https://stackoverflow.com/questions/24534782/

I also edit my previous post

BR

guy038

Alan Kilborn

I was feeling deja vu as I was reading this blog, and the feeling was not misplaced.
These verbs were discussed previously as part of the more generic discussion here FAQ: Regex “Backtracking Control Verbs”; follow the link and the use the “find” function of your browser to jump to the text _______________ (*SKIP)(*FAIL) _______________.