Community
    • Login

    Generic Regex : How to use the couple of "Backtracking Control" verbs (*SKIP)(*FAIL) or (*SKIP)(*F) in regexes

    Scheduled Pinned Locked Moved Blogs
    5 Posts 2 Posters 389 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hi, All,

      In this post, I will present the (*SKIP)(*FAIL) ( or (*SKIP)(*F) ) powerful feature, used within regexes and made of these two Backtracking Control verbs.

      Some general considerations :

      Backtracking control verbs can be described as zero-width assertions, absolutely invisible when the regex engine looks forward, in current regex pattern.

      Now, if we consider the general pattern Regex_A(*VERB)Regex_B, note that backtracking is allowed, both, inside the Regex_A pattern and/or inside the Regex_B pattern. However, the regex engine cannot backtrack from the Regex_B part to the Regex_A part, crossing backward the (*VERB) assertion !

      I will describe six different examples, using the (*SKIP)(*FAIL) functionality. And, regarding the first example, I’ll also show the difference introduced if we omit, on purpose, either the (*SKIP) or the (*FAIL) Backtracking Control verb.

      The associated generic regex is :

      Regex A(*SKIP)(*F)|Regex B

      From the different examples, below, you should be able to guess the contents of, both, the Regex A and Regex B parts !


      The first example is a derived example from this article, shown in rexegg.com site :

      https://rexegg.com/backtracking-control-verbs.php#skipfail

      Suppose that we want to match single words as long as they are NOT embedded between two curly braces {...} and surround them with two slashs. We can use this S/R :

      • FIND {\w+}(*SKIP)(*FAIL)|\w+

      • REPLACE /$0/

      Against this INPUT text :

      This {is} a simple text to {see} how this {regex} works
      

      This S/R would return the following OUTPUT text :

      /This/ {is} /a/ /simple/ /text/ /to/ {see} /how/ /this/ {regex} /works/
      

      Here is how this pattern works :

      • First, the regex part {\w+} tries to match any single word between a set of curly braces.

      • As the first word This is not within curly braces, it skips the left branch of the | alternation and, obviously, matches the right branch \w+ => After replacement, we get the string /This/.

      • Now, the regex engine position is right after the /. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string {is}.

      • Again, it tries to match the left branch {\w+} which, indeed, is the successful match {is}.

      • But it encounters the (*SKIP) verb which always matches and, then, the (*FAIL) verb which always fails.

      • So, the combination of the two backtracking control verbs (*SKIP)(*FAIL) :

        • Do not allow any backtracking process

        • Cancels the current match {is}, so far => The contents {is}, that we want to avoid, has been skipped.

        • Resets the regex engine position to the position right before (*SKIP).

      • So, the position of the regex engine is now right after the closing curly brace of the string {is}

      • Again, the next space character cannot be matched at all and the regex engine advances to the next position right before the article a

      • As this word a is not within curly braces, the regex engine skips the left branch of the | alternation and, obviously, matches the right branch \w+ => After replacement, we get the string /a/

      And so on …


      Note that the only possibility for the pattern to succeed, against the example text, is that the left branch search fails before (*SKIP), in order to allow the right branch to be tested. That is the case, for instance, at the very beginning against the first word This.

      But, as soon as the left branch can be matched, the combination (*SKIP)(*FAIL) just discards the match, so far and restarts the regex engine at the position right before (*SKIP) within the text to examine.


      Note also that, generally, the (*SKIP)(*FAIL) syntax is not mandatory. We could have achieved the same result with any of the three following S/R :

      • FIND (?<=[^{])\w++(?=[^}])

      • REPLACE /$0/

      Or :

      • FIND {\w+}|(\w+)

      • REPLACE ?1/$0/:$0

      Or :

      • SEARCH (?:{\w+}.*?)?\K\w+

      • REPLACE /$0/

      The three S/R, above, do give the same OUTPUT text as the regex with the (*SKIP)(*FAIL) syntax, i.e. :

      /This/ {is} /a/ /simple/ /text/ /to/ {see} /how/ /this/ {regex} /works/
      

      Now, let’s go back to our first syntax and get rid of the (*SKIP) backtracking control verb :

      • FIND {\w+}(*FAIL)|\w+

      • REPLACE /$0/

      And let’s try to imagine how this new pattern works :

      • First, the regex part {\w+} tries to match any single word between a set of curly braces.

      • As the first word This is not within curly braces, it skips the left branch of the | alternation and, obviously, matches the right branch \w+ => After replacement, we get the string /This/.

      • Now, the regex engine position is right after the /. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string {is}.

      • Again, it tries to match the left branch {\w+} which, indeed, is the successful match {is}.

      • Then, it encounters the (*FAIL) verb which always fails.

      • Because of the (*FAIL) control verb, the current match {is} is discarded and, as the right branch of the alternation cannot match either, the whole regex fails and the position of the regex engine advances to the next position, right after the opening curly brace of the string {is}.

      • This time, the word is can match the second branch of the alternative \w+ => So, after replacement, we get the string {/is/}.

      And so on …

      As expected, it would produce this kind of OUTPUT :

      /This/ {/is/} /a/ /simple/ /text/ /to/ {/see/} /how/ /this/ {/regex/} /works/
      

      Again, let’s go back to our first syntax and get rid of the (*FAIL) backtracking control verb :

      • FIND {\w+}(*SKIP)|\w+

      • REPLACE /$0/

      And let’s try to imagine how this other pattern works :

      • First, the regex part {\w+} tries to match any single word between a set of curly braces.

      • As the first word This is not within curly braces, it skips the left branch of the | alternation and, obviously, matches the right branch \w+ => After replacement, we get the string /This/.

      • Now, the regex engine position is right after the /. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string {is}.

      • Again, it tries to match the left branch {\w+} which, indeed, is the successful match {is}.

      • Then, it encounters the (*SKIP) verb which always matches.

      • So far, as no (*FAIL) occurs, the part {\w+}(*SKIP) is a true match which, after replacement, returns the string /{is}/

      ( Note that if a successful match has NOT been found, as the (*SKIP) verb does NOT allow any backtracking, the regex engine would have discarded the current match attempt and would have advanced to the position right after the string {is} )

      • Now, the regex engine position is right after the last /. As the space character cannot match any branch of the regex, the regex engine advances to the next position, right before the string a

      • As this word is not within curly braces, it will simply match the right branch of the alternation \w+ which, after replacement, gives the string /a/

      And so on …

      So the OUTPUT text becomes :

      /This/ /{is}/ /a/ /simple/ /text/ /to/ /{see}/ /how/ /this/ /{regex}/ /works/
      

      Two remarks :

      • Given the present INPUT text, the {\w+} and \w+ are equally matched. So, the search regex could have been expressed, without any backtracking control verb, as simply :

        • FIND {\w+}|\w+
      • See the subtle difference with the preceding example, regarding the words embedded in curly braces ( {/is/} vs /{is}/ ! )


      Now, let’s show a second example of the (*SKIP)(*F) feature with a table, where the third delimiter | has been changed into the # symbol :

      | abc | def # ghi | jkl | mno |
      | abc | def # ghi | jkl | mno |
      

      Then, the following regex S/R would change any word, located after the # character, with the all-uppercase same word :

      • FIND (?-s).*#(*SKIP)(*F)|\w+

      • REPLACE \U$0

      | abc | def ! GHI | JKL | MNO |
      | abc | def ! GHI | JKL | MNO |
      

      However, note that we can use, either, the following regex S/R, without any backtracking control verb, for the same result !

      • FIND (?-s)(?!.*#)\w+

      • REPLACE \U$0


      Let’s go on with this third example, which run an S/R from after a specific and unique character till the very end of file.

      The following S/R will change all words uppercase, after an unique # symbol, located near the middle of the file, till the very end of file.

      • FIND (?s)\A^.*#(*SKIP)(*F)|\w+

      • REPLACE \U$0

      So, from this INPUT text, placed in a new tab ( IMPORTANT )

          This program is free software: you can redistribute it and/or modify
          it under the terms of the GNU General Public License as published by
          the Free Software Foundation, either version 3 of the License, or
          (at your option) any later version.
      
          This program is distributed in the hope that it will be useful,
          but WITHOUT ANY WARRANTY; without even the implied warranty of
          #
          MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
          GNU General Public License for more details.
      
          You should have received a copy of the GNU General Public License
          along with this program.  If not, see <https://www.gnu.org/licenses/>.
      

      We get the OUTPUT text, below :

          This program is free software: you can redistribute it and/or modify
          it under the terms of the GNU General Public License as published by
          the Free Software Foundation, either version 3 of the License, or
          (at your option) any later version.
      
          This program is distributed in the hope that it will be useful,
          but WITHOUT ANY WARRANTY; without even the implied warranty of
          #
          MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.  SEE THE
          GNU GENERAL PUBLIC LICENSE FOR MORE DETAILS.
      
          YOU SHOULD HAVE RECEIVED A COPY OF THE GNU GENERAL PUBLIC LICENSE
          ALONG WITH THIS PROGRAM.  IF NOT, SEE <HTTPS://WWW.GNU.ORG/LICENSES/>.
      

      Again, note that the shorter search syntax, (?s)\A^.*#\K|\w+, without any backtracking control verb, would produce the same result !


      This fourth example is a derived example from this article on the stackoverflow.com site :

      https://stackoverflow.com/questions/24534782/

      It matches any line of text, which does NOT contain the upper-case ABC string, anywhere in current line, but DO contain an upper-case DEF string, if and only if, it’s followed, further on, with the upper-case XYZ string. This can be achieved with :

      MARK (?-is).*ABC(*SKIP)(*F)|^.*DEF(?=.*XYZ).*

      If we consider the following INPUT text :

      test
      DEF test XYZ
      DEF test
      test XYZ
      DEFXYZ
      
      ABC test
      ABC DEF test XYZ
      ABC DEF test
      ABC test XYZ
      ABC DEFXYZ
      
      test ABC
      DEF test ABC XYZ
      DEF test ABC
      test ABC XYZ
      DEFABCXYZ
      
      test ABC
      DEF test XYZ ABC
      DEF test ABC
      test XYZ ABC
      DEFXYZABC
      
      test OK
      DEF test XYZ OK
      DEF test OK
      test XYZ OK
      DEFXYZ OK
      
      ABC test OK
      ABC DEF test XYZ OK
      ABC DEF test OK
      ABC test XYZ OK
      ABC DEFXYZ OK
      
      test ABC OK
      DEF test ABC XYZ OK
      DEF test ABC OK
      test ABC XYZ OK
      DEFABCXYZ OK
      
      test ABC OK
      DEF test XYZ ABC OK
      DEF test ABC OK
      test XYZ ABC OK
      DEFXYZABC OK
      
      this test
      this DEF test XYZ
      this DEF test
      this test XYZ
      this DEFXYZ
      
      this ABC test
      this ABC DEF test XYZ
      this ABC DEF test
      this ABC test XYZ
      This ABC DEFXYZ
      
      this test ABC
      this DEF test ABC XYZ
      this DEF test ABC
      this test ABC XYZ
      This DEFABCXYZ
      
      this test ABC
      this DEF test XYZ ABC
      this DEF test ABC
      this test XYZ ABC
      this DEFXYZABC
      
      this test OK
      this DEF test XYZ OK
      this DEF test OK
      this test XYZ OK
      this DEFXYZ OK
      
      this ABC test OK
      this ABC DEF test XYZ OK
      this ABC DEF test OK
      this ABC test XYZ OK
      this ABCDEFXYZ OK
      
      this test ABC OK
      this DEF test ABC OK XYZ
      this DEF test ABC OK
      this test ABC XYZ OK
      this DEFABCXYZ OK
      
      this test ABC OK
      this DEF test XYZ ABC OK
      this DEF test ABC OK
      this test XYZ ABC OK
      this DEFXYZABC OK
      

      The search regex would mark all these lines of our INPUT text :

      DEF test XYZ
      DEFXYZ
      DEF test XYZ OK
      DEFXYZ OK
      this DEF test XYZ
      this DEFXYZ
      this DEF test XYZ OK
      this DEFXYZ OK
      

      And, as expected, all these lines :

      • Do NOT contain any string ABC

      • DO contain a string DEF

      • DO contain a string XYZ, located after the string DEF

      Once again, note that the regex (?-is)^((?!ABC).)*DEF(?=.*XYZ)((?!ABC).)*$, without any (*SKIP)(*F) syntax, would also mark the same results. But I admit that it’s a bit harder to understand !


      In a nutshell, from the examples above, we can express the (*SKIP)(*F) feature as follows :

          ( Regex A ) (*SKIP)(*F) |  ( Regex B )
                                  |
          <---- Part SKIPPED ---->|<- Part MATCHED ->
      

      When the regex to match is simply duplicated at the left of the (SKIP) Backtracking Control` verb, this diagram becomes :

          ( Regex A ) (*SKIP)(*F) |  ( Regex A )
                                  |
          <---- Part SKIPPED ---->|<- Part MATCHED ->
      

      Or, more interesting, the template :

          <------------------------------------------ Regex A ------------------------------------------>              |  <----- Regex B ----->
      	                                                                                                             |
          ( Cond 1 | Cond 2 | .... |  Cond N ) (  MAIN Search Regex ) ( Cond 1 | Cond 2 | .... | Cond M ) (*SKIP)(*F)  |  ( MAIN Search Regex )
                                                                                                                       |
          <--------------------------------------------- What is SKIPPED --------------------------------------------->|<-- What is MATCHED -->
      

      Meaning that :

      • If the main regex to match is preceded by a condition, from 1 to N and followed by a condition, from 1 to M, NOT wanted, this whole regex/match is discarded

      • In ALL other cases, the main regex, in the right branch of the alternation is always matched !

      Note that this last template, with multiple conditions to avoid, can hardly be realized without the use of the (*SKIP)(*F) part !!

      Continuation and end on next post !

      Alan KilbornA 1 Reply Last reply Reply Quote 4
      • guy038G
        guy038
        last edited by guy038

        Hi, All,

        For instance, let’s imagine this fifth example with the regex :

        • MARK (?-i)(?:before_1|before_2)\u+(?:after_1|after_2)(*SKIP)(*F)|\u+

        From this regex, we deduce that :

        • If an upper-case word \u+ is, both :

          • Preceded by the string before_1 OR the string before_2

          • AND

          • Followed by the string after_1 OR the string after_2

        => This match will be discarded, due to the combination (*SKIP)(*F)

        • In all other cases, the right branch of the alternation \u+ is used and an upper-case word will be matched !

        You may verify my hypotheses, against the INPUT text, below, pasted in a new tab

        before_XYZafter_
        before_XYZafter_1
        before_XYZafter_2
        
        before_1XYZafter
        before_1XYZafter_1
        before_1XYZafter_2
        
        before_2XYZafter
        before_2XYZafter_1
        before_2XYZafter_2
        

        Finally, this sixth example, will take in account, both, the recursion and the (*SKIP)(*F) syntax. Again, it’s a derived example from this article, on the stackoverflow.com site :

        https://stackoverflow.com/questions/70216280

        The regex below tries to match any line with unbalanced level of parentheses. This kind of search needs the use of recursion to be achieved !

        MARK (\((?:[^()\r\n]++|(?1))*\))(*SKIP)(*F)|[()]

        Some explanations :

        • The Group 1 contents is the string \((?:[^()\r\n]++)*\) which represents a correct balanced level of parentheses ( i.e. an atomic group of characters, different from parentheses and line-breaks, surrounded with a couple of parentheses ).

        • The recursion is then realized by the insertion of the group 1, so the (?1) syntax, as an alternative, within the contents of the Group 1 itself.

        • If correct sets of parentheses have been found in current line, the match is then discarded, because of the (*SKIP)(*F) part.

        • If an incorrect set is found, the right branch of the alternation [()], after the (*SKIP)(*F) part, will match any parenthesis.

        After running this regex against the INPUT text below that you’ll paste in a new tab :

        (abc)
        ((abc)
        (ab(c)))
        ((a)bc)
        (((((a)(b)(c))
        (a(b)c)
        ((a))bc)
        (ab(c))
        (a((b)c)
        (a(bc))
        ((ab)c)
        ((a)(b)(c))
        ((a((bc))
        ((ab))))c)
        

        You should find 12 matches in the 7 lines below :

        ((abc)
        (ab(c)))
        (((((a)(b)(c))
        ((a))bc)
        (a((b)c)
        ((a((bc))
        ((ab))))c)
        

        Of course, for any marked line, with unbalanced levels of parentheses in lines, you must study where are the excess parentheses to be removed in order to get correct sets of parentheses !

        For example :

        • The unbalanced expression (a((b)c) can be interpreted, either, as the correct sets a((b)c) or (a(b)c) !

        • The unbalanced expression ((ab))))c) can be interpreted, either, as the correct sets ((ab))c or ((ab)c)

        Best Regards,

        guy038

        P.S. : Yet, another example of the (*SKIP)(*F) technique in this article, on the stackoverflow.com site :

        https://stackoverflow.com/questions/53066132

        FIND (?i)\b(?:county coast|at the|grant pass)\b(*SKIP)(*F)|\b(?:coast|the|pass)\b

        Globally, this regex searches, whatever the case, for :

        • Any word coast, if NOT preceded by the word county

        Or

        • Any word the, if NOT preceded by the word at

        Or

        • Any word pass, if NOT preceded by the word grant
        1 Reply Last reply Reply Quote 3
        • PeterJonesP PeterJones referenced this topic on
        • guy038G guy038 referenced this topic on
        • Alan KilbornA
          Alan Kilborn @guy038
          last edited by

          @guy038 said :

          https://stackoverflow.com/questions/27534782/

          This results in “Page not found” for me.

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by

            Hello @alan-kilborn and All,

            Sorry for the typo. The correct link is :

            https://stackoverflow.com/questions/24534782/

            I also edit my previous post

            BR

            guy038

            1 Reply Last reply Reply Quote 2
            • Alan KilbornA
              Alan Kilborn
              last edited by Alan Kilborn

              I was feeling deja vu as I was reading this blog, and the feeling was not misplaced.
              These verbs were discussed previously as part of the more generic discussion here FAQ: Regex “Backtracking Control Verbs”; follow the link and the use the “find” function of your browser to jump to the text _______________ (*SKIP)(*FAIL) _______________.

              1 Reply Last reply Reply Quote 3
              • dr ramaanandD dr ramaanand referenced this topic on
              • dr ramaanandD dr ramaanand referenced this topic on
              • dr ramaanandD dr ramaanand referenced this topic on
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors