New "Backtracking Control Verbs" feature, available since Notepad++ v7.7



  • +                  BACKTRACKING CONTROL verbs
    -                          FIRST post
    

    Hi, All,

    Since Notepad++ v7.7, its regex engine uses a new Boost regex library. First, I advice you to to see here , which discusses of various Boost version numbers.

    Related to our present regex engine status, these versions are :

    • The general Boost C++ libraries version number : -1.64.0

    • The Boost-Regex library version number : -5.1.3

    • The Boost-Regex documentation version number : 1.70.0


    Our Boost regex engine uses the PCRE ( Perl Compatible Regular Expression syntax ) and, since the old 1.55.0 documentation, the new 1.70.0 regex documentation, contains only a significant difference, relative to a new feature, named Backtracking Control Verbs. Refer to the official Boost documentation, below :

    https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.backtracking_control_verbs

    Note that on Rexegg site, it is said :

    The slow speed of adoption of the backtracking control verbs reflects some simple truths: they are rarely used, likely because they are poorly known and little understood by a considerable proportion of those few who have heard of them.

    Frankly, this lack of awareness is not an issue because there is so much basic material that most people, who only occasionally use regex, need to master before the features offered by backtracking verbs become meaningful !

    I would add that the use of the backtracking control verbs are really significant in rare occasions only ! However, I’ll try to give you an oversight of these new assertions and some examples to describe them as well as some practical uses, found out from general documentation and from some of my tests !


    But, before studying these backtracking control Verbs, let’s say a word about the backtracking process itself !

    • If we consider the simple regex \w+\w{3}\d+, against the string ABC12345DEFABC56667892, the main steps are :

      • First, the \w+ pattern matches, from the starting position 1, all the string ABC12345DEFABC56667892

      • Then, the \w+ backtracks ( so decreases by 3 the current value of quantifier +), in order that the \w+ pattern matches the string ABC12345DEFABC56667 and that the \w{3} pattern matches the string 892

      • Finally, the \w+\w{3} backtracks ( so decrease by 1 the current value of quantifier +, in \w+ pattern ), in order that the \w+ pattern matches ABC12345DEFABC5666 , the \w{3} pattern matches the string 789 and the \d+ pattern matches the digit 2

    In this specific regex, only the first + quantifier of \w+ was involved in the backtracking process. The + quantifier of \d+ was not concerned at all as nothing follows the pattern \d+

    Remark : It is important to note that the backtracking process is not related, exactly, to backward directions, but, rather, to the idea that the regex engine recalculates an other possible match, from current starting position, in the string !

    • Let’s consider the regex (?-i)\w+?C\d{6}, against the same string ABC12345DEFABC56667892

      • First, the regex \w+?, with the lazy quantifier beginning at 1, matches the letter A. But as the other letter is not a B, it backtracks and, from position 1 in string, increases the quantifier to 2 to get the string AB and, this time, the C pattern does match the next C letter

      • Then the \d{6} pattern tries to find six digits. However, after the five digits 12345, the D letter cannot be matched by the \d pattern. Thus, the regex engine backtracks to starting position 1 and increase the common value 2 of the +? quantifier, by one, …till value 13, in order that the next char is, again, a letter C

      • Finally the \d{6} pattern looks for six consecutive digits and does find a match with the string 566678. So, the search process stops with pattern \w+? = ABC12345DEFAB, pattern C = C and pattern \d{6} = 566678


    To begin with, let’s say that the backtracking control verbs can be described as zero-width assertions, absolutely invisible when the regex engine looks forward, in current regex pattern. Six different verbs are managed by the Boost regex engine

    As mentioned in the Rexegg site, the two verbs (*ACCEPT) and (*FAIL) should not be strictly designed as backtracking control Verbs. Indeed :

    • The (*ACCEPT) assertion should be simply seen as a control Verb

    • The (*FAIL) assertion should be called a backtracking Verb.

    It is important to point out that these two verbs act as soon as they are encountered, in forward direction !

    • Four other verbs (*THEN), (*PRUNE), (*SKIP) and (*COMMIT) can be fully designed as backtracking control verbs, as they act only during the backtracking process and, of course, forces the regex engine to realize a specific action

    However, if we consider the general pattern Regex_A(*VERB)Regex_B, note that backtracking is allowed, both, inside the Regex_A pattern and/or inside the Regex_B pattern. However, the regex engine cannot backtrack from the Regex_B part to the Regex_A part, crossing backward the (*VERB) assertion !

    • For instance, the regex \w+\d+(*PRUNE).+\d, against the string AAAA123456789BBBB----YYYY0ZZZZ, matches the part AAAA123456789BBBB----YYYY0 ( Refer to the PRUNE backtracking control verb, further on ! )

      • The regex_A ( \w+d+ ) backtracks from last B letter to digit 8, in order to match the AAAA123456789 part

      • The regex_B ( .+\d ) backtracks from last Z to last Y, in order to match the BBBB----YYYY0 part

    • As no additional backtracking process is needed, an overall match occurs

    • Now, this same regex, against the string AAAA123456789BBBB----YYYYZZZZ ( with 0 digit missing ) would not match anything :

      • First, the regex_A ( \w+\d+ ), as before, would match the string AAAAA123456789

      • However, the Regex_B ( .+\d ) cannot obviously match the remainder BBB----YYYYZZZZ

      • Even after backtracking to the first B letter, no match can be found, because no more digit occurs. So, the regex engine would need to backtrack onto the Regex_A pattern till before the 8 digit, in order that .+ would match the 8 digit and \d would match the 9 digit

      • Because of the (*PRUNE) verb, this behavior is not allowed and the match attempt fails at current position, Then, the regex engine advances to next position, in string, but no match occurs at any subsequent position, too !

    Just see the difference with the simple regex \w+\d+.+\d, against the string AAAA123456789BBBB----YYYYZZZZ. This time, the substring AAAA123456789 matches ! ( \w+ = AAAA123456, \d+ = 7, .+ = 8 and \d = 9 )


    • When the regex engine moves forward, in the pattern, these four verbs have absolutely no influence and they always match. So their special semantics are never activated !

    • When the regex engine moves downward, because backtracking is needed, it’s not allowed to cross them. In other words, the regex engine can never go back to the left of these backtracking control verbs !

    Finally, the (*MARK) verb is not supported by the present Boost library, implemented in Notepad++, whatever its version


    We’ll begin to study the 4 true backtracking control verbs according to this order : (*COMMIT), (*SKIP), (*PRUNE) and (*THEN). Next, it should be easier, for anyone, to figure out the relations and differences between each of them :-)

    Then, we’ll study the two remaining verbs (*ACCEPT) and (*FAIL). The later is often used in conjunction with other backtracking control verbs to perform enhanced behaviors

    This rather long documentation must be divided into five posts !


    _______________ (*COMMIT) _______________

    In BOOST documentation, it is said :

    (*COMMIT) Has no effect unless backtracked onto, in which case all subsequent matching/searching attempts are abandoned.

    This means that, if at current position in string, the regex engine can match the part before (*COMMIT) but cannot match the part after (*COMMIT), as any backtracking is canceled to the left of (*COMMIT), the regex engine discards any further search and the current match attempt just fails. So, the regex engine simply abandons any further match attempt, the overall match just fails and the search process stops

    Note that after the general failure, the current starting position remains right after the last successful attempt

    In other words, the general regex regex_A(*COMMIT)regex_B means that if regex_A have already matched, then regex_B must necessarily match too, right after. If not, the overall match fails !

    On Rexegg site it is said :

    “You can read the (*COMMIT) token as past this point, we are committed to finding a match, in this match attempt, or none at all”.

    So, for instance, the regex (*COMMIT)ABC means that if the string ABC is not found, at caret location, the search process simply stops !


    Consider the regex (?-i)AB+(*COMMIT)CD|\w{2}.|AB+(*COMMIT)CH, against the text ---ABBZE---ABBBBCH---ABBBCY---ABBBBCD---

    —ABBZE—ABBBBCH—ABBBCY—ABBBBCD—

    So, according to the effect of a (*COMMIT) backtracking control verb, let’s follow the regex engine’s work :

    • The first 3 positions of the string ( \-\-\- ) are skipped, as no alternative can match a dash character

    • So, the starting position is right before the string ABBZE :

      • The first found pattern AB+(*SKIP)CD matches ABB but cannot match CD

      • So, it tries to backtrack at left of the (*COMMIT) syntax to test other values of AB+

      • Because of the (*COMMIT) control verb, this behavior is forbidden So, all which is matched, so far ( ABB ), is discarded and the regex engine abandons any further match attempt, too. So the overall match fails.

      • However, note that the starting position remains at the very beginning of the scanned string, so right before the string ---ABBZE


    The regex 1(?=2(*COMMIT)3) finds a 1 digit IF followed with 23, but all the search process stops completely as soon as a 1 digit is followed by a 2 and any char different from 3. Note that, after this failure, the current starting point remains after the last successful attempt ! ( the digit 1 which is followed with the string 23 )

    Whereas the regex 1(?=23) find a 1 digit IF followed with 23 and search process goes on, to next positions, in string, to search for other successful match attempts as, for instance, the third value 123 !

    Test these two regexes 1(?=2(*COMMIT)3) and 1(?=23), against the text, below :

    10
    11
    123  OK
    14
    18
    19
    123  OK
    12
    123  KO
    127
    

    Third example : the regex (a(*COMMIT)b)+ac can be re-written (a(*COMMIT)ba(*COMMIT)ba.....a(*COMMIT)b)ac fails against the text abac, because in the second repeat, a(*COMMIT)b fails to match ac and, as backtracking is not allowed in order to choose only one repeat, followed with ac, the overall match fails due to the verb (*COMMIT)


    It’s worth to point out that the (*COMMIT) verb does not always commit. It’s the case when an other backtracking control verb is located at right at the (*COMMIT) verb :

    For instance, the regex (?-i)AB+(*COMMIT)(*SKIP)CD|\w{2}., against the string ---ABBZE--- does match the string ZE-, because the (*SKIP) verb if found, first, before reaching, backwards, the (*COMMIT) verb. In contrast, with the almost similar regex (?-i)AB+(*COMMIT)B(*SKIP)CD|\w{2}., the regex engine would need to backtrack at left of the (*COMMIT) verb => Immediate failure !


    I found a practical use of the (**cOMMIT) verb :

    We all know, that if the Wrap around option is not used, the process is run from current location till the very end of file(s). Now, imagine that you would like to execute an S/R from current position till the very beginning of file. The backward option is usually not enabled when regex search is involved and, anyway, some regexes do not behave properly, in backward direction !

    So, add a simple character, not used yet, alone in a line, anywhere in your current file. I chose the # character but, of course, any other char may be chosen ! Now, see the behavior of the regex #(*COMMIT*)\w+|\w+

    • As long as the # character is not found, at current starting position, the second alternative of the regex is used and matches any word ( \w+ )

    • As soon as the # character is found, at current starting position, the regex engine tries to find some word chars, right after the # symbol. This is impossible as it is followed with a line-break ! So, due to the (*COMMIT) verb, the overall regex fails and the process stops !

    So, if we consider the shorter search regex (#(*COMMIT)|)\w+ and the replacement regex \U$0 with the Wrap around option ticked, this S/R would change any word in uppercase but the replacement process would stop soon as a # symbol, not immediately followed by a word char, is found, further on, in current text !



  • +                  BACKTRACKING CONTROL verbs
    -                 SECOND post ( continuation )
    

    _______________ (*SKIP) _______________

    In BOOST documentation, it is said :

    (*SKIP) Behaves the same as (*PRUNE) except that it is assumed that no match can possibly occur prior to the current point in the string being searched. This can be used to optimize searches by skipping over chunks of text that have already been determined can not form a match.

    This means that, if at current position in string, the regex engine can match the part before (*SKIP) but cannot match the part after (*SKIP), as any backtracking is canceled to the left of (*SKIP), the regex engine discards any further search and the current match attempt just fails. So the regex engine must advance to the location where the zero-width (*SKIP) verb occurs, for a new match attempt !

    In other words, if the (*SKIP) is crossed, during the backtracking process, the regex search process stops, at current position and begins again, at location right after the (*SKIP) verb. So, the text matched before (*SKIP) will never be part of a subsequent successful match ! Sometimes, this behavior can save a lot of fruitless match attempts !


    Consider the regex (?-i)AB+(*SKIP)CD|\w{2}.|AB+(*SKIP)CH, against the text ---ABBZE---ABBBBCH---ABBBCY---ABBBBCD---

    So, according to the effect of a (*SKIP) backtracking control verb, let’s follow the regex engine’s work :

    • The first 3 positions of the string ( \-\-\- ) are skipped, as no alternative can match a dash character

    • So, the starting position is right before the string ABBZE :

      • The first found pattern AB+(*SKIP)CD matches ABB but cannot match CD

      • So, it tries to backtrack at left of the (*SKIP) syntax to test other values of AB+

      • Because of the (*SKIP) control verb, this behavior is forbidden So, all which is matched, so far ( ABB ), is discarded and the regex engine moves to the location where (*SKIP) occurs, so right after the ABB string

    • Now, the starting position is right before the string ZE :

    • The first alternative cannot match in forward direction but the second one \w{2}. does match the string ZE-

    • So, after 2 moves, again, the starting position is, now, right before the string ABBBBCH :

      • At this stage, it’s worth noting that last alternative AB+(*SKIP)CH could match the ABBBBCH string ! However it won’t be used, because the first alternative AB+(*SKIP)CD is tested first

      • So, the first found pattern AB+(*PRUNE)CD matches ABBBB but cannot match CD

      • So, it tries to backtrack at left of the (*SKIP) syntax to test other values of AB+

      • Because of the (*SKIP) control verb, this behavior is forbidden So, all which is matched, so far ( ABBBB ), is discarded and the regex engine moves to the location where (*SKIP) occurs, so right after the ABBBB string

    • Now, the starting position is right before the string CH :

    • Again, the first alternative cannot match in forward direction but the second one, \w{2}., does match the string CH-

    • After 2 moves, the starting position is, now, right before the string ABBBCY :

      • Process is identical as it was when the starting position was right before the string ABBZE

      • Because of the (*SKIP) control verb, current match attempt fails and the regex engine moves to the location where (*SKIP) occurs, so right before the string CY :

    • The first alternative fails, in forward direction but the second pattern \w{2}. does match the string CY-

    • Finally, after 2 moves, the starting position is, now, right before the string ABBBBCD :

      • This time, the first alternative AB+(*SKIP)CD just matches, in forward direction, all the remaining letters, the string ABBBBCD

    See, below, the letters matched by each numbered occurrence :

    ---ABBZE---ABBBBCH---ABBBCY---ABBBBCD---
          111       222      333  4444444
    

    Just compare with the same regex, below, where we omit the (*SKIP) control verb :

    (?-i)AB+CD|\w{2}.|AB+CH

    Note, again, the letters matched by each occurrence, keeping in mind that the regex engine, in that case, always tries the alternatives from left to right, period !

    ---ABBZE---ABBBBCH---ABBBCY--ABBBBCD----
       111222  333444    555666  7777777
    

    A second example : against the string aaacaaabxxaacaaaa---, beginning a line, let’s test some regexes :

    • The regex \w+(*SKIP)b|.+ :

      • First this regex matches all the letters aaacaaabxxaacaaaa and tries to backtrack in order to get a final b letter

      • As the backtracking process is forbidden because of the (*SKIP) control verb, current match attempt fails

      • Then, the starting position moves to the location where (*SKIP) is encountered, so right after all the letters

      • Finally, as the first alternative does not match, the second alternative .+ does match the string ---

    aaacaaabxxaacaaaa---
                     111
    
    • The regex a+(*SKIP)b|.+ :

      • First this regex matches all the letters aaa and, as the next c letter does not match the b pattern, it tries to backtrack

      • As the backtracking process is forbidden because of the (*SKIP) control verb, current match attempt fails

      • Then, the starting position moves to the location where (*SKIP) is encountered, so right after the aaa string

      • Finally, as the first alternative cannot match, the second alternative .+ alternative does match the remaining characters caaabxxaacaaaa---

    aaacaaabxxaacaaaa---
       11111111111111111
    

    See the differences with the similar regexes, without the (*SKIP) control verb :

    • The regex \w+b|.+

      • First this regex matches all the letters and, as the next - character does not match the b pattern, it backtracks from the last a letter to the letter a, before the b letter, for a first match aaacaaab

      • Then, as the first alternative cannot match, the second alternative .+ matches all the remaining characters, so the string xxaacaaaa---

    aaacaaabxxaacaaaa---
    11111111222222222222
    
    • The regex \w+?b|.+

      • First this regex matches the first letter a and, as the next a letter does not match the b pattern, it backtracks from the first a letter to the letter a, before the letter b, for a first match aaacaaab

      • Then, as the first alternative cannot match, the second alternative .+ matches all the remaining characters xxaacaaaa---, as above

    • The regex \w++b|.+

      • First this regex matches all the letters and, as the next - character does not match the b pattern, the regex engine tries to backtrack

      • But, because of the possessive quantifier ++, the backtracking process is not allowed and the match attempt fails

      • So, the second alternative .+ is tried and matches all the remaining characters, so the entire string aaacaaabxxaacaaaa---


    Note that, on Rexegg site , it is said :

    If (*SKIP) is the last backtracking control verb in a template, a match failure beyond (*SKIP) should, in principle, always trigger the (*SKIP) control verb. Indeed, backtracking is needed to test new match attempts at current position in string !

    In practice, internally, among all kinds of clever optimizations, Perl, PCRE and Python's regex package are smart enough to avoid fruitless backtracking. But externally, PCRE and Python behave as though they were backtracking all the way back

    This seems also the behavior of the Boost regex engine. If these optimizations had been taken into account, the regex engine would have behaved, sometimes, as if no (*SKIP) backtracking control verb was part of the overall regex

    Now, regarding practical uses of the (*SKIP) backtracking control verb, we can mention the association of this verb with the verb (*FAIL), which will be studied later


    _______________ (*PRUNE) _______________

    In BOOST documentation, it is said :

    (*PRUNE) Has no effect unless backtracked onto, in which case all the backtracking information, prior to this point, is discarded.

    This means that, if at current position in string, the regex engine can match the part before (*PRUNE) but cannot match the part after (*PRUNE), as any backtracking is canceled to the left of (*PRUNE), the regex engine discards any further search and the current match attempt just fails. Thus, the regex engine then advances to next position, in the string, for a new match attempt

    In other words, if the (*PRUNE) is crossed, during the backtracking process, the regex search process stops, at current position and process begins again, at next starting position, in string !

    Usually, the (*PRUNE**) verb is simply an alternative to an atomic group or to a possessive quantifier. However, there are some uses of the (*PRUNE) verb that cannot be expressed in any other way !


    Consider the regex (?-i)AB+(*PRUNE)CD|\w{3}|AB+(*PRUNE)CH, against the text ---ABBZE---ABBBBCH---ABBBCY---ABBBBCD---

    So, according to the effect of a (*PRUNE) backtracking control verb, let’s follow the regex engine’s work :

    • The first 3 positions of the string ( --- ) are skipped, as no alternative can match a dash character

    • So, the starting position is right before the string ABBZE :

      • The first found pattern AB+(*PRUNE)CD matches ABB but cannot match CD

      • So, it tries to backtrack at left of the (*PRUNE) syntax to test other values of AB+

      • Because of the (*PRUNE) control verb, this behavior is forbidden So, all which is matched, so far ( ABB ), is discarded and the regex engine moves to the next starting position, in string. So, right before the string BBZE

    • The first alternative cannot match in forward direction but the second alternative \w{3} does match the string BBZ

    • Then, the starting position is right before the letter E, which cannot be matched by any part of the overall regex

    • So, after 3 moves, again, the starting position is, now, right before the string ABBBBCH :

      • At this stage it’s worth noting that last alternative AB+(*PRUNE)CH could match the ABBBBCH string ! However it won’t be used, because the first alternative AB+(*PRUNE)CD is tried first

      • So, the first found pattern AB+(*PRUNE)CD matches ABBBB but cannot match CD

      • So, it tries to backtrack at left of the (*PRUNE) syntax to test other values of AB+

      • Because of the (*PRUNE) control verb, this behavior is forbidden So, all which is matched, so far ( ABBBB ), is discarded and the regex engine moves, immediately, to the next starting position, in string. So right before the string BBBBCH

    • Again, the first alternative cannot match in forward direction but the second one, \w{3}, does match the string BBB

    • Then, the starting position is right before the string BCH :

    • The first alternative cannot match, in forward direction, but the second alternative, \w[3}, does match the string BCH

    • After 3 moves, the starting position is, now, right before the string ABBBCY :

      • Process is identical as it was when the starting position was right before the string ABBZE

      • Because of the (*PRUNE) control verb, current match attempt fails and the regex engine moves to the next starting position, right before the string BBBCY

      • The first alternative fails, in forward direction but the second pattern \w{3} does match the string BBB

    • Then, the starting position is right before the string CY, which cannot be matched by any part of the overall regex

    • Finally, after 3 moves, the starting position is, now, right before the string ABBBBCD :

      • This time, the first alternative AB+(*PRUNE)CD just matches, in forward direction, all the remaining letters, the string ABBBBCD

    See, below, the letters matched by each numbered occurrence :

    ---ABBZE---ABBBBCH---ABBBCY---ABBBBCD---
        111     222333    444     5555555
    

    Just compare with the same regex, below, where we omit the (*PRUNE) control verb :

    (?-i)AB+CD|\w{3}|AB+CH

    Note, again, the letters matched by each occurrence, keeping in mind that the regex engine, in that case, always tries the alternatives from left to right, period !

    ---ABBZE---ABBBBCH---ABBBCY--ABBBBCD---
       111     222333    444555  5555555
    

    If the regex part, located at left of a (*PRUNE) backtracking control verb is moderately complex and that a lot of backtracking actions are probable, it may save the regex engine a lot a time because it cancels any backtracking process and skips to next starting position, in string

    However, in this case, in order for (*PRUNE) to become efficient, note that :

    • All backtracking time must be greater that time to compile the pattern with this special verb

    • Our Boost regex engine may have performed some internal optimizations that prevents backtracking, anyway

    • Sometimes, the use of the (SKIP) backtracking control verb is even better !


    Personally, I don’t see any practical use of the (*PRUNE) backtracking control verb, yet :-((

    One more point :

    In PCRE2, it is said :

    In an anchored pattern (*PRUNE) has the same effect as (*COMMIT)

    Seemingly, it’s not the case, with our Boost regex engine, whatever the pattern is anchored or not :

    For instance, against the simple line : BBBBCX---

    • The regex B+(*PRUNE)CD|.+ matches the string CX---, but the regex B+(*COMMIT)CD|.+ just fails

    • The regex ^B+(*PRUNE)CD|.+, with an anchored pattern, matches the string BBBCX---*, but the regex ^B+(*COMMIT)CD|.+ fails, also



  • +            BACKTRACKING CONTROL verbs
    -            THIRD post ( continuation )
    

    _______________ (*THEN) _______________

    Preliminary comment :

    I must confess that, among all the backtracking control verbs, the description and behavior of the (*THEN) verb, with our Boost regex engine, is a bit tricky and I still cannot explain some regex patterns, containing the (*THEN) verb. So, it quite possible that my general understanding of that specific verb is wrong … Just tell me !!

    The (*THEN) verb is generally used in an alternation structure. When the (*THEN) verb occurs outside an alternation, it behaves exactly like the (*PRUNE) verb

    In BOOST documentation, it is said :

    (*THEN) has no effect unless backtracked onto, in which case all subsequent alternatives, in a group of alternations, are discarded

    This means that, if at current position in string, the regex engine can match the part before (*THEN) but cannot match the part after (*THEN), of an alternative, in a group, as any backtracking is canceled to the left of (*THEN), the current match attempt is discarded and all the subsequent alternatives of the current group. So, the regex engine skips to the pattern of the next alternative, located outside the current group

    The (*THEN) verb is a little confusing ! After, numerous tests to determine its true behavior, I came up with these explanations :

    • Only the first alternative, containing (*THEN), as well as all alternatives without the (*THEN) verb, located or not in a group, are tried by the regex engine

    And, in case of a backtracking at left of the (*THEN) verb of this first alternative :

    • If this alternative is located outside a group or is the single alternative of a group, without an alternation symbol | :

      • The current group is not considered as a true group of alternatives

      • Thus, the (*THEN) verb rather acts as a (*PRUNE) verb and the current starting position is increased by 1

    • If this alternative is located in a group, with, at least, two alternatives, with (*THEN), separated by an alternation symbol | :

      • The current group is considered as a true group of alternatives and the semantics of the (*THEN) verb is activated

      • So, the current starting position remains unchanged

    • Then, in both cases, the regex engine skips to the next alternative, outside the current group, whatever this one is included or not in a group


    For instance, let’s consider the regex, below, using the free-spacing mode :

    (?x-i)
    (  AB+(*THEN)CH  |  AB+(*THEN)CD  |  AB+(*THEN)CT  )  |  # Line 1
    (  AB+(*THEN)ZE  |  AB+(*THEN)ZK                   )  |  # Line 2
    (  \w{3}                                           )  |  # Line 3
    (  AB+(*THEN)HA                                    )  |  # Line 4
    (  AB+(*THEN)ZX  | AB+(*THEN)HY                    )     # Line 5
    

    against the one-line text, below, where the added numbers, designate each numbered occurrence :

    ABBCD---ABBBBCH---ABBBBBCT---ABBBZK---ABBBBZE---ABBBBBZX---ABBBBHJ---ABHA---ABBBBBBHY---
    111     2222222   333444     555666   7777777   888999     000111    222    333444555
    

    As you can see, only the first alternative of groups, line 1 and 2, are taken in account, because, on backtracking to the left of (*THEN), the regex engine then skips, immediately, to the next group of alternatives

    However, note that the first alternative of groups, line 4 and 5 are never tried because the \w{3} pattern comes before !

    So, except for the strings ABBBBCH and ABBBBZE, all the other strings are partially matched by the \w{3} pattern, line 3 !


    Now, let’s consider the regex, below, in free-spacing mode :

    (?x-i)
    (  AB+(*THEN)CH  |  AB+(*THEN)CD  |  AB+(*THEN)CT  )  |  # Line 1
    (  AB+(*THEN)ZE  |  AB+(*THEN)ZK                   )  |  # Line 2
    (  AB+(*THEN)HA                                    )  |  # Line 3  Group WITHOUT an |  => (*THEN) acts as (*PRUNE)
    (  \w{3}                                           )  |  # Line 4
    (  AB+(*THEN)ZX  | AB+(*THEN)HY                    )     # Line 5
    

    against the same subject string :

    ABBCD---ABBBBCH---ABBBBBCT---ABBBZK---ABBBBZE---ABBBBBZX---ABBBBHJ---ABHA---ABBBBBBHY---
     111    2222222    333444     555     6666666    777888     999000   1111    222333
    

    This time, only the first alternative of groups, line 1, 2 and 3, are taken in account, because, on backtracking to the left of (*THEN), the regex engine then skips, immediately, to the next group of alternatives

    But, this time, due to the lack of any alternation symbol of the group, line 3, the backtracking control verb (*THEN) acts rather as the (*PRUNE) verb. So, in case of a unsuccessful backtracking operation, the starting position of the regex engine advances to next position !

    Therefore, as the strings ABBCD, ABBBBBCT, ABBBZK, ABBBBBZX, ABBBBHJ and ABBBBBBHY cannot match the AB+(*THEN)HA pattern, in line 3, the regex engine goes to next position, which always begins with a B letter.

    And, as all alternatives, beginning with an A letter, cannot match, the \w{3} pattern is the only one which can match something ( 3 consecutive letters )


    Just compare with the same regex, without any (*THEN) verb, below, again the same string :

    (?x-i)
    (  AB+CH  |  AB+CD  |  AB+CT  )  |  # Line 1
    (  AB+ZE  |  AB+ZK            )  |  # Line 2
    (  AB+HA                      )  |  # Line 3
    (  \w{3}                      )  |  # Line 4
    (  AB+ZX  | AB+HY             )     # Line 5
    

    Note, again, the letters matched by each occurrence, keeping in mind that the regex engine, in that case, always tries the alternatives from left to right, period ! This explains why the strings ABBBBBZX and ABBBBBBHY are never matched by the patterns AB+ZX and AB+HY

    ABBCD---ABBBBCH---ABBBBBCT---ABBBZK---ABBBBZE---ABBBBBZX---ABBBBHJ---ABHA---ABBBBBBHY---
    11111   2222222   33333333   444444   5555555   666777     888999    0000   111222333
    

    As said before, if the (*THEN) verb is not inside an alternation group, it acts like the (*PRUNE) backtracking control verb.

    For instance, against the string ---1123ABBBBCHIKOP--- :

    • With the regex (?-i)\d+(AB+(*THEN)CD)|\w{8} or (?-i)\d+(AB+(*PRUNE)CD)|\w{8}+, containing a group, without any alternation symbol :

      • Because no D letter, right after the C letter, it fails, and, as backtracking is not allowed because of the verb (*THEN) or (*PRUNE) , the regex engine moves to the next starting position, in string, and retries the first alternative

      • After successive moves to next starting position and immediate failure to match the CD string, caret is, now, right before the letter A. This time, the first alternative cannot match. So, the second alternative just matches the next eight characters ABBBBCHI :

    ---1123ABBBBCHIKOP---
           11111111
    
    • Now, with the regex (?x-i)\d+(AB+(*THEN)CD|\w\w)|\w{8}, containing a group with an alternation symbol :

      • After the failure in matching D, right after letter C, and, as backtracking on AB+ is not allowed because of the verb (*THEN), the regex engine skips and tries the next alternative of this inner group, \w\w

      • But, this time, the backtracking process onto \d+ is possible, from the first position in string ( the regex engine must find a range of digits, not followed with the A letter). Thus, it easily finds the 1123A string, where \d+ = 112 and \w\w = 3A :

    ---1123ABBBBCHIKOP---
       1111122222222
    

    When the regex expression, located before a (*THEN) backtracking control verb, is complex and time-consuming, the (*THEN) verb acts in the same way as an atomic group, because it prevents from any backtracking in current used alternative. However the use of the (*THEN) verb seems limited and, personally, I haven’t found out, yet, any practical use of this verb !

    As I said before, my present reasoning about the (*THEN) backtracking control verb may be completely erroneous.Here are some oddities that I still cannot understand ! Can someone can explain me :

    • Why the regex (?-i)A(B+(*THEN)L|FK)|\w\w matches BX, in the string ---ABX--- ?

    • and why the regex (?-i)A(B+(*THEN)L|BK)|\w\w matches AB, in the same string ---ABX--- ?


    IMPORTANT : The 4 full backtracking control verbs, described, above, provide 4 different strengths of control when subsequent matching fails :

    • The (*THEN) is the weakest, carrying on the match at the next alternative, first

    • The (*PRUNE) verb comes next, failing the match at the current starting position, even in case of subsequent alternatives, and moving, in string, to the next position

    • The (*SKIP) verb is similar to (*PRUNE), except that the regex engine moves, in string, to the (*SKIP) location ( so, possibly more than one character ! )

    • Finally, the (*COMMIT) verb is the strongest and always causes the overall match to fail

    In next chapters, we’ll study the two verbs (*ACCEPT) and (*FAIL). In contrast to the previous verbs, the (*ACCEPT) and (*FAIL) verbs act as soon as they are encountered !


    _______________ (*ACCEPT) _______________

    In BOOST documentation, it is said :

    (*ACCEPT) Causes the pattern to be considered matched at the current point. Any half-open sub-expressions are closed at the current point.

    This means that, if at current position in string, the regex engine can match the part right before the (*ACCEPT) verb, this partial match is considered as a successful match attempt and any pattern, after the (*ACCEPT) verb, is simply discarded. Then, as usual, the regex engine advances to the next starting position, in string, for a new match attempt

    If the (*ACCEPT) backtracking control verb is located inside a capturing group, that group is set to whatever characters have been matched, up to that point !

    Let’s consider the regex \d+(*ACCEPT).+|\w+, against the string 12345----abcde----

    • First, the regex matches the digits 12345

    • Without the (*ACCEPT) verb, the ending .+ part of the current alternative would have matched all the remaining chars of line

    • But, as soon as the (*ACCEPT) verb is reached, the current pattern, so far, \d+, is considered as successful and the ending .+ part of the current alternative is discarded by the regex engine

    • Then current starting position is right before the first dash char, which cannot match the regex

    • After moving till right after the last dash character, the second alternative does match the string abcde

    The interest of the (*ACCEPT) verb is generally limited. But, sometimes, it can be very useful when used in a list of alternatives !


    A practical use is described on the [rexegg] (https://www.rexegg.com/backtracking-control-verbs.html#accept) site :

    Let’s suppose you would like to match the 3 upper-case strings BAZ, BO and BIZ. Of course, an obvious solution is (?-i)BAZ|BO|BIZ

    But, if B and Z stand for complicated sub-patterns, you would probably factor these two patterns, which is normally impossible because of the BO syntax.

    With the B(?:A|O(*ACCEPT)|I)Z syntax, containing the (*ACCEPT) verb, the factorization of the B pattern, at the beginning and the Z pattern, at the end becomes possible !

    Indeed, when the regex engine uses the alternative O(*ACCEPT) ( so the overall regex BO(*ACCEPT)Z ), it just matches the BO string as the match of Z is discarded, due to the (ACCEPT) verb

    As usual, a solution without any backtracking control verb is still possible, with conditional patterns. You could use, either, the regexes below, with the conditional structure (?(1)...|...)

    SEARCH B(?:(A|I)|O)(?(1)Z)    or    B(?:A|I|(O))(?(1)|Z)


    The (*ACCEPT) backtracking control verb is the only backtracking verb that is allowed to be quantified. And, when the quantifier is ungreeedy, this allows the (*ACCEPT) verb to acts only when a backtracking phase, from quantifier 0 to quantifier 1 occurs !

    • Consider the regex A(*ACCEPT)??BC, where the (*ACCEPT) verb is located inside an ungreedy quantified group, with a minimum of 0, against the string AAAABCBBBBBAAAABCCCCCC

      • If current starting point is right before an A, followed with BC, as the quantifier takes first the value 0 => the (*ACCEPT) verb is not taken in account. Thus, the regex is identical to ABC and matches the string ABC, in forward direction, without any backtracking process

      • If an A is not followed by BC, at current starting position, a backtrack is needed and the value 1 is tried => Thus, the regex becomes A(*ACCEPT)B which always matches an A letter, in all cases

    • Now, consider the regex A(*ACCEPT)?BC, where the (*ACCEPT) verb is located inside a greedy quantified group, with a maximum of 1, against the same string :

      • This time, whether an A is followed or not with the string BC, at current starting point, as the quantifier takes first the value 1 => The (*ACCEPT) verb is always taken in account. So, the regex is identical to the pattern A(*ACCEPT)BC, which always matches a A letter, in all cases


  • +                  BACKTRACKING CONTROL verbs
    -                 FOURTH post ( continuation )
    

    _______________ (*FAIL) _______________

    In BOOST documentation, it is said :

    (*FAIL) Causes the match to fail unconditionally at this point, can be used to force the engine to backtrack

    This means that, if at current position in string, the active part of the pattern meets a zero-width (*FAIL) verb, then the current match attempt is always considered as unsuccessful. So the regex engine, necessarily, backtracks in order to find out other patterns, in the overall regex, to make a match attempt successful, at current location in string

    Because of the inconditional failure of the current match attempt, due to (*FAIL), the current starting position has not changed. For instance, the regex AABBCC(*FAIL)|.+, against the string AABBCCABDEF---GGAC, will match all string contents, from the start location 1, even if the string AABBCC has been matched before the (*FAIL) verb was encountered !

    Note that the (*FAIL) syntax can be abbreviated as (*F) and may also be replaced by a negative look-ahead of an empty string (?!), which is an always false assertion because any character is, necessarily, followed with an empty string. You could use, also, a regex like (?=A)B which is, obviously, an always false assertion, too !

    One example : the regex A((A|Z)(*FAIL)|B|C)D is just equivalent to the regex A(B|C)D, as the alternative (A|Z)(*FAIL) always fails ! So, this regex only matches the strings ABD and ACD, against the text AZD ABD AAD ACD


    I see 4 practical uses of the (*FAIL) backtracking control verb :

    1. If the overall regex contains some complicated alternatives, you can temporarily omit one of them, for testing. Simply add, after this alternative and before the | symbol, a (*F) verb, which will always fails any match attempt of this specific alternative. For instance, with the free-spacing mode, the regex, below, would only match the strings AAAA, BBBB, DDDD and EEEE
    (?x-i)
    AAAA        |
    BBBB        |
    CCCC  (*F)  |
    DDDD        |
    EEEE
    

    Note that we could have used, also, the # comment char, placing all the pattern CCCC in comments :

    (?x-i)
    AAAA        |
    BBBB        |
    # CCCC      |
    DDDD        |
    EEEE
    

    1. The (*FAIL) or (*F) syntaxes can be used to force the regex engine to backtrack, after defining some groups, not intended to be consumed at current position in string, but later

    For instance, let’s examine the regex (?-i)([A-Z])([a-z])([0-9])(*F)|(?1)(?3)(?2)|((?1)|(?2))(?3){2} against the string A3h AAD x45 Z4a b8Q H09 BAc 000 Y1y

    • The first alternative is the part ([A-Z])([a-z])([0-9])(*F) which defines 3 groups : [A-Z], [a-z] and [0-9] standing ,respectively, for an uppercase letter, a lowercase letter and a digit

    • The second alternative (?1)(?3)(?2) represents subroutine calls to the groups ( An uppercase letter followed with a digit and a lowercase letter )

    • The third alternative ((?1)|(?2))(?3){2} represents other subroutine calls to groups ( A letter followed with two digits )

    Note that it does not matter whether the first alternative is matched or not. Each time the overall regex is processed, the regex engine always tries this alternative first and, in all cases, it attempts to match the three regexes, consecutively, storing them as groups 1 and 2 and 3, for further use, via the subroutine calls !

    Remark : The use of the (*FAIL) verb can also be emulated with the special (DEFINE) variable, in a conditional group, so the (?(DEFINE).....) syntax

    In our particular case, it would lead to the regex (?-i)(?(DEFINE)([A-Z])([a-z])([0-9]))(?1)(?3)(?2)|((?1)|(?2))(?3){2}

    Now, I also found out a very simple way to emulate the (*FAIL) verb or the (?(DEFINE)...) conditional structure :

    • Use the syntax (?-i)([A-Z])([a-z])([0-9])¤|(?1)(?3)(?2)|((?1)|(?2))(?3){2}, which substitutes the (*FAIL) verb with any char or string, not present in your text. I chose the ¤ character.

    Indeed, although the first alternative ([A-Z])([a-z])([0-9])¤ will never match, the three regexes [A-Z], [a-z] and [0-9] are, indeed, stored as groups 1, 2 and 3 for further use, via the subroutine calls (?1) , (?2) and (?3) ;-))


    So, these regexes, with, either, (*F) verb, the (DEFINE) assertion or the non-existent char ¤, matches any 3-CHARS string of non-accentuated characters :

    • Beginning with an upper-case letter, ending with a lower-case letter and with a digit in between

    • Beginning with any letter and followed with two digits

    Az1   A3h  AAD  x45  Z4a   Az1   b8Q   H09   BAc  000  Y1y
          111       222  333               444             555
    

    1. In Perl, the (*FAIL) verb is commonly used to get all the different branches of a match tree

    In the Rexegg site, it is said, for instance :

    • This script line 'abc' =~ /\w+(?{print "$&\n";})(*F)/ prints, successively, the strings abc, ab, a, bc, b and c

    • First, the \w+ matches the whole string abc

    • Then, the code capsule prints the match

    • Now, the (*F) verb forces the regex engine to backtrack

    • So, the regex engine gives up the c

    • And the callback prints ab

    • Again, the (*F) verb forces the regex engine to backtrack

    • And so on !


    1. In next chapter we’ll learn about the (*SKIP)(*FAIL) construct which is a powerful way of excluding some patterns, from the match !

    A last point regarding the (*FAIL) backtracking control verb :

    As mentioned in rexegg, this verb is not the opposite of the (*ACCEPT) verb. If it were the case, then the (*FAIL) would mean, at current point, in string, the match attempt just fails. In fact, the true opposite of the (*ACCEPT) verb is the combination of the two verbs (*PRUNE) and (*FAIL) or, even better, the combination of the two verbs (*SKIP) and (*FAIL) !

    For instance, the regex (?-i)([a-z]+(*ACCEPT)|\w+)[A-Z]+\d+ against the string azrdfgsdFGSDFG12345 is just equivalent to the pattern (?-i)([a-z]+(*ACCEPT)) which matches the beginning azrdfgs of the string

    So, we’re tying to find out a verb structure which means DON’T accept, like the \K syntax, which would not take in account or, in other words, would skip the [a-z]+ pattern, matched so far !

    If we substitute the (*ACCEPT) verb with the (*FAIL) verb, in the same pattern, so the regex (?-i)([a-z]+(*FAIL)|\w+)|[A-Z]+\d+, it matches all the subject string azrdfgsdFGSDFG12345 which isn’t our expected match. Indeed, because of the (*FAIL) verb , it forces the regex engine to backtrack, at current position in string, for an other match attempt… which is successful, as the second alternative \w+ graps all the subject string

    Now if we add the (*PRUNE) verb, before the (*FAIL) verb, we get the new regex (?-i)([a-z]+(*PRUNE)(*FAIL)|\w+)[A-Z]+\d+ which does matches the remaining part of the string after the lower-case letters. Now, after meeting the (PRUNE) verb, in backward direction the match attempt fails but, this time, the regex engine moves to the next starting position before doing a new match attempt

    We would get the same expected result with the regex (?-i)([a-z]+(*SKIP)(*FAIL)|\w+)[A-Z]+\d+, which would, globally, skip the part matched, so far, by the pattern before the (*SKIP) verb ! The association of these two verbs is described right below.


    _______________ (*SKIP)(*FAIL) _______________

    The combination (*SKIP)(*FAIL) or (*SKIP)(*F) means that everything matched on left side of (*SKIP)(*FAIL) is discarded and that the regex engine must move to the location right after this discarded match for a new match attempt !

    A generic regex would be : What_I_don’t want(*SKIP)(*FAIL)|What_I_want    or    What_I_don’t want(*SKIP)(*F)|What_I_want

    For instance, in order to surround any word with two - characters, except whose already surrounded by curly braces, {}, use one of the regex S/R, below : :

    • SEARCH {[^}]*}(*SKIP)(*F)|\b\w+\b    REPLACE -$0-

    • SEARCH {[^}]*}(*SKIP)(?!)|\b\w+\b    REPLACE -$0-

    Test it, against the sentence : This is a test with some {text} to see {if} it works"

    Each time a {xxxx} zone is reached, it’s skipped and the current starting position moves to the location where the (*F) verb or the (?!) syntax occur

    Of course, you may, simply, use the regex S/Rs below, without any control verb, for an exact replacement :

    • SEARCH {[^}]*}|(\b\w+\b)    REPLACE ?1-$0-:$0

    • SEARCH ({[^}]*}.*?\K)?\b\w+\b    REPLACE -$0-

    Note the difference with this search, without the **(*SKIP)**part : :

    SEARCH {[^}]*}(*F)|\b\w+\b

    This time, as the (*SKIP) verb is absent, when the string {text} is reached and evaluated it backtracks because of (*F) and, as the second alternative cannot match, too, the current starting position, in string, goes next and the part \w+ does match the text part

    Second example :

    Let’s consider the table, below, where the third delimiter | has been changed into the ! symbol

    | abc | def ! ghi | jkl | mno |
    | abc | def ! ghi | jkl | mno |
    

    Then the regex S/R :

    SEARCH (?-s).*!(*SKIP)(*F)|\w+

    REPLACE \U$0

    would change this table as :

    | abc | def ! GHI | JKL | MNO |
    | abc | def ! GHI | JKL | MNO |
    

    But, we can use the syntax (?-s)(?!.*!)\w+, without any backtracking control verb, too !


    Before, with the (*COMMIT) verb and a dummy character, we saw a method to do an S/R only in the first part of a file.

    In a similar way, we can run an S/R, only in the last part of a file. Again, let’s choose the # character, inserted around the middle of current file. Then the following S/R would change all words uppercase, after the # symbol, till the very end of file

    SEARCH (?s)\A^.*#(*SKIP)(*F)|\w+

    REPLACE \U$0

    However, note that this shorter search syntax, (?s)\A^.*#\K|\w+, without any backtracking control verb, would produce the same result !

    So, in short, the (*SKIP)(*FAIL) construct means :

    Discard anything that has been matched, so far, at left of the two verbs (*SKIP)(*FAIL), in the current regex alternative !



  • +                  BACKTRACKING CONTROL verbs
    -                     FIFTH and LAST post
    

    A last summary :

    • The zero-width (*THEN), (*PRUNE), (*SKIP) and (*COMMIT) backtracking control verbs always match in forward direction but and always fail, on backtracking, when the (*......) verb is crossed in backward direction

    • The zero-width (*ACCEPT) backtracking control verb always matches the current match attempt, till the (*ACCEPT) verb, in forward direction

    –The zero-width (*FAIL) / (*F) backtracking control verb always fails the current match attempt, in forward direction. So, it always backtracks in order to test other parts of the overall regex, if any

    And, if we consider the general syntax Regex_A(*......)RegexB, the next starting position, of the regex engine, in string, is :

    • When the verb (*......) is found in forward direction :

      • For (*THEN) : Right after the characters matched by the overall pattern Regex_A(*THEN)Regex_B

      • For (*PRUNE) : Right after the characters matched by the overall pattern Regex_A(*PRUNE)Regex_B

      • For (*SKIP) : Right after the characters matched by the overall pattern Regex_A(*SKIP)Regex_B

      • For (*COMMIT) : Right after the characters matched by the overall pattern Regex_A(*COMMIT)Regex_B

      • For (*ACCEPT) : Right after the characters matched by the Regex_A pattern part

      • For (*FAIL) : Unchanged, so the location before running the overall pattern Regex_A(*FAIL)Regex_B

    • When the verb (*......) is found in backward direction, because of backtracking :

      • For (*THEN) : The current location when running the overall pattern Regex_A(*THEN)Regex_B

      • For (*PRUNE) : The location next to the current location when running the overall pattern Regex_A(*PRUNE)Regex_B

      • For (*SKIP) : Right after the characters matched by the Regex_A pattern part

      • For (*COMMIT) : Unchanged, so the location before running the overall pattern Regex_A(*COMMIT)Regex_B


    To conclude this study of the backtracking control verbs, here is a last example which recapitulates the different backtracking control verbs and some of their combinations. All the regexes, in the table, below, are tested against the same subject string :

    aaaa1bbbcccccc#

    •-----------------------------•----------------------------------------•-------------------------------------------------------•---------------------------------•
    |     REGULAR Expression      |            Different MATCHES           |   SECOND regex alternative STATUS, after FIRST try    | Caret AFTER 1st try of 2ND Alt. |
    •-----------------------------•----------------------------------------•-------------------------------------------------------•---------------------------------•
    |  c+|a+\d|b+|#               |  aaaa1   bbb   cccccc   #              |  MATCH, in FORWARD direction                          |  Right AFTER  the 1 DIGIT       |
    |  c+|\w+\d|b+|#              |  aaaa1   bbb   cccccc   #              |  MATCH, AFTER backtracking from LAST c to 1           |  Right AFTER  the 1 DIGIT       |
    |                             |                                        |                                                       |                                 |
    |  c+|a++\d|b+|#              |  aaaa1   bbb   cccccc   #              |  MATCH, in FORWARD direction                          |  Right AFTER  the 1 DIGIT       |
    |  c+|\w++\d|b+|#             |          bbb   cccccc   #              |  NO match, as backtracking  is NOT allowed            |  Right BEFORE the FIRST a       |
    |                             |                                        |                                                       |                                 |
    |                             |                                        |                                                       |                                 |
    |  c+|a+(*FAIL)\d|b+|#        |          bbb   cccccc   #              |  Match FAILURE, though MATCH in FORWARD direction     |  Right BEFORE the FIRST a       |
    |  c+|\w+(*FAIL)\d|b+|#       |          bbb   cccccc   #              |  Match FAILURE, though MATCH after BACKTRACKING       |  Right BEFORE the FIRST a       |
    |                             |                                        |                                                       |                                 |
    |                             |                                        |                                                       |                                 |
    |  c+|a+(*SKIP)\d|b+|#        |  aaaa1   bbb   cccccc   #              |  MATCH, in FORWARD direction                          |  Right AFTER  the 1 DIGIT       |
    |  c+|\w+(*SKIP)\d|b+|#       |  #  ( + cccccc IF caret AFTER LAST b ) |  NO match, as BACTRACKING is NOT allowed              |  Right BEFORE the # symbol      |
    |                             |                                        |                                                       |                                 |
    |  c+|a+(*SKIP)(*F)\d|b+|#    |          bbb   cccccc   #              |  Match FAILURE, though MATCH in FORWARD direction     |  Right BEFORE the 1 DIGIT       |
    |  c+|\w+(*SKIP)(*F)\d|b+|#   |  #  ( + cccccc IF caret AFTER LAST b ) |  Match FAILURE and NO match as NO backtracking        |  Right BEFORE the # symbol      |
    |                             |                                        |                                                       |                                 |
    |                             |                                        |                                                       |                                 |
    |  c+|a+(*PRUNE)\d|b+|#       |  aaaa1   bbb   cccccc   #              |  MATCH, in FORWARD direction                          |  Right AFTER  the 1 DIGIT       |
    |  c+|\w+(*PRUNE)\d|b+|#      |                cccccc   #              |  Alternative DISCARDED + NO match as NO backtracking  |  Right AFTER  the FIRST a       |
    |                             |                                        |                                                       |                                 |
    |  c+|a+(*PRUNE)(*F)\d|b+|#   |          bbb   cccccc   #              |  Match FAILURE, though MATCH in FORWARD direction     |  Right AFTER  the FIRST a       |
    |  c+|\w+(*PRUNE)(*F)\d|b+|#  |                cccccc   #              |  Match FAILURE and NO match as NO backtracking        |  Right AFTER  the FIRST a       |
    |                             |                                        |                                                       |                                 |
    |                             |                                        |                                                       |                                 |
    |  c+|a+(*COMMIT)\d|b+|#      |  aaaa    bbb   cccccc   #              |  MATCH, in FORWARD direction                          |  Right AFTER  the 1 DIGIT       |
    |  c+|\w+(*COMMIT)\d|b+|#     |  cccccc   #  ( IF caret AFTER LAST b ) |  CANCELS any SUBSEQUENT attempt as NO backtracking    |  Right BEFORE the FIRST a       |
    |                             |                                        |                                                       |                                 |
    |  c+|a+(*ACCEPT)\d|b+|#      |  aaaa    bbb   cccccc   #              |  MATCH of part BEFORE (*ACCEPT), in FORWARD direction |  Right BEFORE the 1 DIGIT       |
    |  c+|\w+(*ACCEPT)\d|b+|#     |  aaaa1bbbcccccc         #              |  MATCH of part BEFORE (*ACCEPT), in FORWARD direction |  Right BEFORE the # symbol      |
    |                             |                                        |                                                       |                                 |
    |  c+|a+(*THEN)\d|b+|#        |  aaaa1   bbb   cccccc   #              |  MATCH, in FORWARD direction                          |  Right AFTER  the 1 DIGIT       |
    |  c+|\w+(*THEN)\d|b+|#       |                cccccc   #              |  NO match, as BACTRACKING is NOT allowed              |  Right BEFORE the FIRST a       |
    •-----------------------------•----------------------------------------•-------------------------------------------------------•---------------------------------•
    

    For additional information, refer to :

    https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.backtracking_control_verbs

    https://www.rexegg.com/backtracking-control-verbs.html

    https://www.pcre.org/current/doc/html/pcre2pattern.html#SEC29

    https://perl.developpez.com/documentations/en/5.20.1/perlre.html#Special-Backtracking-Control-Verbs

    and this article on stackoverflow :

    https://stackoverflow.com/questions/19992984/verbs-that-act-after-backtracking-and-failure


    Now, it’s up to you ! Just experiment all these backtracking control verbs and try to find out some practical uses of them ;-))

    Best regards,

    guy038



  • https://community.notepad-plus-plus.org/post/55641

    Hello, All,

    During my study of the backtracking control verbs ( see posts above ), I noticed this article which described ways to pre-define some subroutines, which can be used to build numerous modular patterns, at the same time :

    https://www.rexegg.com/regex-tricks.html#pseudo-define

    Note that the special conditional (?(DEFINE).....) structure was created, originally, to this purpose, by modern regex engines, whose our Boost regex engine ! I already discussed of this syntax in this post below :

    https://community.notepad-plus-plus.org/post/52608

    So, for instance, with the free-spacing mode, the regex below :

    (?x-i)
    (?(DEFINE)      #  START of the conditional DEFINE structure
    ([A-Z]{2}\d)    #    TWO CAPITAL letters, followed with a DIGIT in Group 1
    )               #  END of the conditional DEFINE structure
    (?1)-(?1)-(?1)  #  THREE "Group 1" triplets, separated with DASHES
    

    Against the string YH5-RC6-UY0-BD5-AZ3-KL9, would match the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9


    Now, in Rexegg site, we are told that, instead of the (?(DEFINE).....) conditional structure, we can use the (*FAIL) backtracking control verb to get a similar behavior, according to the syntax below, using an optional non-capturing group :

    (?x-i)
    (?:             # START of an OPTIONAL NON-CAPTURING group
    ([A-Z]{2}\d)    #   TWO CAPITAL letters, followed with a DIGIT in Group 1
    (*F)            #   Backtracking Control verb (*FAIL)
    )?              # END of the OPTIONAL NON-CAPTURING group
    (?1)-(?1)-(?1)  # THREE "Group 1" triplets, separated with DASHES
    

    Remark that, instead of the (*F) verb, we may also use the negative look-ahead (?!) syntax for identical results


    Now, I’ve found out a more simple syntax :

    (?x-i)
    ([A-Z]{2}\d)    #  TWO CAPITAL letters, followed with a DIGIT in Group 1
    (*F)            #  Backtracking Control verb (*FAIL)
    |               #    OR
    (?1)-(?1)-(?1)  #  THREE "Group 1" triplets, separated with DASHES
    

    How this regex syntax works, against the string YH5-RC6-UY0-BD5-AZ3-KL9 ?

    • First, the regex engine tries to match the part ([A-Z]{2}\d) and, indeed, matches the substring HY5.

    • At the same time, it stores the pattern [A-Z]{2}\d in group 1

    • Then the regex engine meets the backtracking control verb (*F) which forces it to backtrack

    • So the current match HY5 is discarded and the starting position remains right before the first letter Y

    • Then the regex engine tries the other alternative ?1)-(?1)-(?1) and matches, successively, the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9, as we expect to !


    Note the effect of the (*FAIL) verb is exactly the same as if the regex engine would try to match a specific pattern that does not exist in subject string ! So, let simply substitute this verb with, for instance, the CURRENY sign ¤, of Unicode code point ``\x{00a4}`. which is generally not used in files !

    You can use the Windows input method, hitting the Alt key and typing number 0164 on the numeric keypad

    So we get the regex :

    (?x-i)
    ([A-Z]{2}\d)    #  TWO CAPITAL letters, followed with a DIGIT in Group 1
    ¤               #  The CURRENCY sign character ( In fact, any NON-EXISTENT character, in CURRENT file ! )
    |               #    OR
    (?1)-(?1)-(?1)  #  THREE "Group 1" triplets, separated with DASHES
    

    Why my syntax also works :

    • First, the regex engine tries to match the part ([A-Z]{2}\d) and, indeed, matches the substring HY5.

    • At the same time, it stores the pattern [A-Z]{2}\d in group 1

    • But it cannot match the following currency sign ¤ and the regex engine naturally backtracks

    • The starting position remains right before the first letter Y

    • And the regex engine tries the other alternative ?1)-(?1)-(?1), which matches, successively, the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9

    Note that the regex engine does remember of the pattern contained in group 1 because, despite of the backtracking phase, the search process still goes on !

    Of course, instead of the single character ¤, you can, either, choose a simple string, which does not exist in current file. For instance, /// or @@ or _X_

    It’s also important to point out that you can perfectly use named group(s), for all these syntaxes. For instance, my last regex syntax can be rewritten :

    (?x-i)
    (?<Seq>[A-Z]{2}\d)       #  TWO CAPITAL letters, followed with a DIGIT in NAMED group 'Seq'
    ¤                        #  The CURRENCY sign character ( In fact, any NON-EXISTENT character, in CURRENT file ! )
    |                        #    OR
    (?&Seq)-(?&Seq)-(?&Seq)  #  THREE 'Seq' triplets, separated with DASHES
    

    Of course, all the syntaxes, above, which help us to create a library of pre-defined groups are mostly valuable, when you try to search between numerous patterns, built up from these elementary patterns !

    To that purpose, let’s expose a more interesting and practical example, regarding the search of specific sequences within the genetic code !

    If necessary, refer for a very basic background, about genetic code, at the end of this post !

    From now on, the genetic notions invoked are only the fruit of my imagination and are, in no way, based on scientific data. !! It’s just used to demonstrate the interest of the above regex syntaxes

    Let’s suppose that the 3 genetic sequences CGUUUA, GCCACUAAACAG and AAUCGACAU, named Seq_1, Seq_2 and Seq_3, are of main importance in order to build up greater genetic chains from a combination of these components

    Now, let’s assume that we want, at the same time, search for the seven combinations :

    • Seq_2 + Seq_3
    • Seq_1 + Seq_2
    • Seq_1 + Seq_3
    • Seq_2 + AAA codon + Seq_3
    • Seq_2 + CCC codon + Seq_1
    • 4 consecutive Seq_1 + Seq_2
    • 4 consecutive Seq_1 + Seq_3

    Then, the regex to search any of these sequences, delimited with a start and a stop codon, could be, in free-spacing mode :

    (?x-i)                     #  FREE-SPACING mode and NON-INSENSITIVE search
    
    (?<Seq_1>CGUUUA)           #  Seq_1 definition  (  6 bases )
    (?<Seq_2>GCCACUAAACAG)     #  Seq_2 definition  ( 12 bases )
    (?<Seq_3>AAUCGACAU)        #  Seq_3 definition  (  9 bases )
    X |                        #  An INEXISTANT character in RNA sequence      OR
    
    (AUG|GUG|UUG)              #  Possible START codons
    (?<Codon>[ACGU]{3})*?      #  Any number of CODONS, even ZERO, in the NAMED group Codon
    \K                         #  ANYTHING matched, so far, is DISCARDED
    
    (?:                        #  Start of a NON-CAPTURING group
    
    (?&Seq_2)(?&Seq_3)      |  #    Chain 1 :  Seq_2 + Seq_3                   OR
    (?&Seq_1)(?&Seq_2)      |  #    Chain 2 :  Seq_1 + Seq_2                   OR
    (?&Seq_1)(?&Seq_3)      |  #    Chain 3 :  Seq_1 + Seq_3                   OR
    (?&Seq_2)AAA(?&Seq_3)   |  #    Chain 4 :  Seq_2 + AAA + Seq_3             OR
    (?&Seq_3)CCC(?&Seq_1)   |  #    Chain 5 :  Seq_3 + CCC + Seq_1             OR
    (?&Seq_1){4}(?&Seq_2)   |  #    Chain 6 :  FOUR consecutive Seq_1 + Seq_2  OR
    (?&Seq_1){4}(?&Seq_3)      #    Chain 7 :  FOUR consecutive Seq_1 + Seq_3
    
    )                          #  END of the NON-CAPTURING group
    
    (?=                        #  START of a LOOK-AHEAD
    (?&Codon)*                 #    Any number of CODONS, even ZERO
    (UAA|UGA|UAG)              #    Possible STOP codons
    )                          #  END of the LOOK-AHEAD
    
    You can test this regex against these lines, below, which **all** begin with a **start** codon and end with a **stop** codon :
    
    ~~~diff
    AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA  # Chain  3
    
    GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA        # Chain  3
    
    UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG        # Chain  1
    
    UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG  # Chains 4 then 5
    
    AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA        # Chains 6 then 2
    
    GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA     # Chains 7 then 3
    

    However, if we change the greedy quantifier, of the Codon group, with a lazy quantifier, we get other matches. This is expected because some areas overlap with other areas or include some others !

    So, in order to correctly detect all these chains of nucleotides, we could look, only for the zero-length location, of the start of each sequence, with the appropriate regex, below :

    (?x-i)                     #  FREE-SPACING mode and NON-INSENSITIVE search
    (?<Seq_1>CGUUUA)           #  Seq_1 definition (  6 bases )
    (?<Seq_2>GCCACUAAACAG)     #  Seq_2 definition ( 12 bases )
    (?<Seq_3>AAUCGACAU)        #  Seq_3 definition (  9 bases )
    X|                         #  An INEXISTANT character in RNA sequence     OR
    (?=                        #  START of a LOOK-AHEAD
    (?&Seq_2)(?&Seq_3)      |  #    Case 1 :  Seq_2 + Seq_3                   OR
    (?&Seq_1)(?&Seq_2)      |  #    Case 2 :  Seq_1 + Seq_2                   OR
    (?&Seq_1)(?&Seq_3)      |  #    Case 3 :  Seq_1 + Seq_3                   OR
    (?&Seq_2)AAA(?&Seq_3)   |  #    Case 4 :  Seq_2 + AAA + Seq_3             OR
    (?&Seq_3)CCC(?&Seq_1)   |  #    Case 5 :  Seq_3 + CCC + Seq_1             OR
    (?&Seq_1){4}(?&Seq_2)   |  #    Case 6 :  FOUR consecutive Seq_1 + Seq_2  OR
    (?&Seq_1){4}(?&Seq_3)      #    Case 7 :  FOUR consecutive Seq_1 + Seq_3
    )                          #  END of the LOOK-AHEAD
    

    See below, with the indication of the start of each sequence :

                v
    AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA  # Chain  3
                      v
    GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA        # Chain  3
                               v
    UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG        # Chain  1
                         v              v
    UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG  # Chains 4 then 5
       v                 v
    AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA        # Chains 6 then 2
                                     v                 v
    GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA     # Chains 7 then 3
    

    Now, if we want to completely match each composite sequence, we’ll use this last regex :

    (?x-i)                     #  FREE-SPACING mode and NON-INSENSITIVE search
    (?<Seq_1>CGUUUA)           #  Seq_1 definition (  6 bases )
    (?<Seq_2>GCCACUAAACAG)     #  Seq_2 definition ( 12 bases )
    (?<Seq_3>AAUCGACAU)        #  Seq_3 definition (  9 bases )
    X|                         #  An INEXISTANT character in RNA sequence     OR
    (?:                        #  START of a NON-CAPTURING group
    (?&Seq_2)(?&Seq_3)      |  #    Case 1 :  Seq_2 + Seq_3                   OR
    (?&Seq_1)(?&Seq_2)      |  #    Case 2 :  Seq_1 + Seq_2                   OR
    (?&Seq_1)(?&Seq_3)      |  #    Case 3 :  Seq_1 + Seq_3                   OR
    (?&Seq_2)AAA(?&Seq_3)   |  #    Case 4 :  Seq_2 + AAA + Seq_3             OR
    (?&Seq_3)CCC(?&Seq_1)   |  #    Case 5 :  Seq_3 + CCC + Seq_1             OR
    (?&Seq_1){4}(?&Seq_2)   |  #    Case 6 :  FOUR consecutive Seq_1 + Seq_2  OR
    (?&Seq_1){4}(?&Seq_3)      #    Case 7 :  FOUR consecutive Seq_1 + Seq_3
    )                          #  END of a NON-CAPTURING group
    

    See, below, the indication of each sequence with the v letter or the ^ symbol :

                vvvvvvvvvvvvvvv
    AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA   # Chain 3
    
                      vvvvvvvvvvvvvvv
    GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA         # Chain 3
    
                               vvvvvvvvvvvvvvvvvvvvv
    UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG         # Chain 1
    
    
                         vvvvvvvvvvvvvvvvvvvvvvvv                              # Chain 4
    UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG
                                        ^^^^^^^^^^^^^^^^^^                     # Chain 5
    
    
       vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv                                    # Chain 6
    AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA
                         ^^^^^^^^^^^^^^^^^^                                    # Chain 2
    
    
                                     vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv         # Chain 7
    GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA
                                                       ^^^^^^^^^^^^^^^         # Chain 3
    

    Note that, in order to get, successively, all the occurrences, even in case of overlapping, hit the F3 shortcut to get a match. Then, hit the Left arrow, followed by the Right arrow, to advance of one position in text !


    At last, I would say that it’s obvious that I’m not competing, with my simple regexes, against all the powerful ORF finding tools used by biologists ;-)) Refer to :

    https://en.wikipedia.org/wiki/Open_reading_frame#ORF_finding_tools

    Best Regards,

    guy038

    Here is a very basic summary, about genetic code, for the sole purpose of that present discussion :

    DNA is a double ordered helix chain, made of four nucleotides, associated by pairs, adenine ( A ) with thymine ( T ) and guanine ( G ) with cytosine ( C ), which carries all the genetic instructions for development, functioning, growth and reproduction of all known living organisms

    A gene is a sequence of nucleotides that encodes, either, the synthesis of the RNA molecule from the DNA molecule or the synthesis of a protein from a RNA molecule and may contain more than 1,000 pairs of nucleotides. Humans have nearly 20,000 genes, whose about 2,000 are thought essential to our survival !

    Each triplet of nucleotides, named codon, part of a double-stranded DNA or part of a single-stranded RNA molecule, corresponds to an amino-acid, during the transcription process to RNA or during the translation process to a protein

    A single translated region of the genetic code, in the DNA molecule, is called an ORF ( Open reading Frame ), containing, generally, a minimum of 100 to 150 codons

    After the transcription phase, from the DNA molecule into Pre-mRNA and the suppression of introns, we get the mature RNA molecule, made of four nucleotides, associated by pairs, too : adenine ( A ) with uracil ( U ) and guanine ( G ) with cytosine ( C )

    In RNA, the ORF zone has become the Protein Coding Region, composed of :

    • A start codon [ AUG , GUG , UUG ]
    • A continuous stretch of codons
    • A stop codon [ UAA , UGA , UAG ]

    If we consider, for instance, the genome of the Escherichia Coli bacteria K-12, the proportions of each start and stop codons are :

    START STOP
    Codon Codon
    AUG    83% UAA    63%
    GUG    14% UGA    29%
    UUG     3% UAG     8%

    Refer also to :

    https://en.wikipedia.org/wiki/Introduction_to_genetics

    https://en.wikipedia.org/wiki/DNA

    https://en.wikipedia.org/wiki/RNA

    https://en.wikipedia.org/wiki/Gene

    https://en.wikipedia.org/wiki/Open_reading_frame

    https://en.wikipedia.org/wiki/Start_codon

    https://en.wikipedia.org/wiki/Stop_codon


Log in to reply