generic-regex-replacing-in-a-specific-zone-of-text

Paul Wormer

I’m on a track to learning advanced regexes and this regex is one of my challenges. I understand it halfway and have two remaining questions. For reference sake I repeat the regex here:

(?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)

First question: why is there the alternation (?-si:BSR|(?!\A)\G)? Naively, I would simply start with (?-si:BSR) and place the caret somewhere before the string matched by BSR and then hit Find Next or Replace.

My second question is more related to Npp than to regexes: why doesn’t Replace work with this regex (even if one wants to replace one string only) and is Replace All necessary?

guy038

Hello, @paul-wormer and All,

I do not have some spare time to fully answer your first question, presently ! Just a matter of some hours !

However, regarding your second question, it’s quite easy ! It happens that any time you insert a \K syntax, somewhere in a regex, the step by step replacement, with the Replace button is not allowed by the regex engine and the only possibility is to use the Replace All button !

Best regards,

guy038

guy038

Hello, @paul-wormer and All,

Let’s test it with the real text, below, that you’ll copy in a new N++ tab :



<try>01-23
456
7---89
</pos>

<val>37--001</val>
<text>This-is
-a</text>
<pos>4-1234</pos>

<val>37--002</val>
<text>-small---example</text>
<pos>9-0012</pos>


<val>37--003</val>
<text>-of-text-
which-</text>
<pos>1-9999</pos>


<val>37--004</val>

<text>need
-to-be-
modi
fied</text>

<pos>0-0000</pos>

Note the 2 empty lines at the beginning of the file !

Now, let’s suppose that we want to replace any range of dashes with a single space char, but ONLY on lines embedded in a multi-lines section <text>.....</text>

If we use your formulation of the generic regex (?-si:BSR)(?s-i:(?!ESR).)*?\K(?-si:FR), we end up with the functional search regex (?-si:<text>)(?s-i:(?!</text>).)*?\K(?-si:-+) which can be simplified as :
SEARCH (?-i:<text>)(?s-i:(?!</text>).)*?\K-+
REPLACE \x20
Move the cursor at the very beginning of the new tab
Seemingly, the first match is correct : a dash, right after the part <text>This, but the subsequent matches are wrong : it always matches the first occurrence ONLY, of each line surrounded with the multi-line tags <text> and </text>

Because of the lack of the \G syntax, the other dashes, present in each <text>.....</text> section, are not matched. Thus, we cannot use this regex form !

You could say : but what about the generic regex (?-si:BSR|\G)(?s-i:(?!ESR).)*?\K(?-si:FR) which can be simplified as :

SEARCH (?-i:<text>|\G)(?s-i:(?!</text>).)*?\K-+
REPLACE \x20
Again, move the cursor at the very beginning of the new tab
First, it matches all occurrences of a dash in any line not surrounded with the multi-line tags <text> and </text> ( NOT wanted )
Then, as soon as it matches an BSR region ( <Text> ) and, up to the very end of the file, it correctly matches any range of dashes in each line surrounded with the multi-line tags <text> and </text> ONLY

Because the \G syntax is matched at the very beginning of the file, some first wrong matches occur. Thus, this formulation is not correct, too !

Finally, let’s use the complete generic regex (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR) which gives the functional one :

SEARCH (?-i:<text>|(?!\A)\G)(?s-i:(?!</text>).)*?\K-+
REPLACE \x20
Again, move the cursor at the very beginning of the new tab
As you can see, since the \G syntax is not allowed at the very beginning of current file, it correctly matches all the occurrences of any range of dashes, ONLY in the lines surrounded with the multi-line tags <text> and </text>.

This is the expected and desired behaviour !

However, note that, if we decide, on purpose, to start with the cursor on a line, after the very beginning ( as line 2, 3 or else ), our last regex will also finds some initial wrong matches !

Best Regards,

guy038

Paul Wormer

@guy038 Thank you very much for your very elaborate answer (and also for your time). I will study carefully your text, at first glance it seems definitely worth my while. You really are the grandmaster of regular expressions! Thank you again.

guy038

Hi, @paul-wormer and All,

I realize that I forgot to mention the fundamental role of the \G syntax, in this kind of regex !

In the Boost reference manual, here, it is said :

Continuation Escape

The sequence \G matches only at the end of the last match found, or at the start of the text being matched if no previous match was found.

This escape is useful if you’re iterating over the matches contained within a text, and you want each subsequent match to start where the last one ended.

What does this mean ?

Well, when the caret is at the beginning of file, the \G syntax is not allowed because of the negative look-ahead (?!\A). So the only possibility to match is to match the BSR string, i.e. the string <text> with that case, followed by the smallest range, even on several lines, of any char, different from </text> … till a range of dashes

Then the \G feature takes over and selects from the next char to the nearest range of dashes, again. But, as it cannot go through the </text> string, this means that, necessarily, the next match will not be adjacent to the previous one.

Thus, the \G feature is not supported anymore and the only possiblility is to match a <text> string again ( the other alternative )

 bla bla <text> this is a-small test to see-if it</text> is OK.<text>We're looking again for a third dash-and a fourth-one and so-on</text>
         ----------------•-----------------•                   ------------------------------------------•------------•----------•
             1st match        2nd match                                    3rd match                       4th match    5th match

This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string ( </text> )

(?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)

which gives the functional regex S/R :

SEARCH (?-s)(?-i:<text>|(?!\A)\G).*?\K-+

REPLACE \x20

If we imagine this text :


Line 1  <text>this is a-small text to see-where are-
-       ---------------•-----------------•---------•
Line 3     1st match       2nd match      3rd match
Line 4
Line 5  <text>all the-different matches of-a dash</text>
Line 6  -------------•--------------------•
Line 7     4th match        5th match

After the third match of the final dash, at the end of line 1 the next match should be at beginning of line 2. But it is not allowed because the regex engine would have to go from line 1 to line 2 and skip the two chars CR + LF

In that case, the \G feature is not respected anymore and, necessarily, the search goes on starting with the <text> string, in line 5, … till a dash

Note that the </text> string, at the end of each line, is not mandatory because the implicit gap CRLF, between two consecutive lines, resets, each time, the regex engine to the search of a <text> string first !!

Thus, to search any FR region in each line, containing a BSR region, simply use the simplified generic regex (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)

Best Regards,

guy038

Paul Wormer

I worked through Guy’s examples and understand them. I cannot say that I could already compose a regular expression of similar complexity, but at least I can follow Guy’s reasoning.

There is one open end though:

@guy038 said in generic-regex-replacing-in-a-specific-zone-of-text:

This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string ( </text> )
[…]
SEARCH (?-s)(?-i:<text>|(?!\A)\G).*?\K-+

I tried to match the string:

a-a <text>b-c</text> d-e

and placed the caret at the beginning of the file. Using the simplified regex above:

(?-s)(?-i:<text>|(?!\A)\G).*?\K-+

I found that the second and the third hyphen in the string match. Is this a case of “even Homer sometimes nods”, or do I make a mistake?

Alan Kilborn

@Paul-Wormer said in generic-regex-replacing-in-a-specific-zone-of-text:

do I make a mistake?

Yes, you do.
Why wouldn’t it match the last -, as you have nothing about </text> terminating the parsing in the regex…
The possible matching extends to the end of the line.

guy038

Hi, @paul-wormer, @alan-kilborn and All,

@paul-wormer, you didn’t make a mistake. It’s just that I did not explain myself properly !

When I said :

This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string ( </text> )

I should have added : … if we suppose that the closing </text> boundaries were implicitly at the very end of the lines containing the <text> string, instead of their present location !

For instance, write the text, below, in a new tab :

a-a foo b---c <text>bar d-e foo f---g bla</text> H--I blah j-k
This --- is
a-b foo c---d <text>e--f foo g-h bar i--j foo
a test
to - verify
A---0 bar b-c <text>d-e bla f-----g blah</text> foo h--i bar j-k
the ----- regexes

If you use the simplified generic regex (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR), so the practical search (?-s)(?-i:<text>|(?!\A)\G).*?\K-+ :

It supposes that implicit </text> boundaries exist at the end of the lines, instead of the present ones

So, the text could also be rewritten :

a-a foo b---c <text>bar d-e foo< f-g bla H---I blah j--k/text>
This --- is
a-b foo c---d <text>e--f foo g-h bar i--j foo</text>
a test
to - verify
A---0 bar b-c <text>d-e bla f-----g blah foo h--i bar j-k</text>
the ----- regexes

And, of course, the regex (?-s)(?-i:<text>|(?!\A)\G).*?\K-+, against these two texts, does match any range dash characters, after the opening <text> boundary … till the end of each line

But, using again, the original text :

a-a foo b---c <text>bar d-e foo f---g bla</text> H--I blah j-k
This --- is
a-b foo c---d <text>e--f foo g-h bar i--j foo
a test
to - verify
A---0 bar b-c <text>d-e bla f-----g blah</text> foo h--i bar j-k
the ----- regexes

If we want to restrict the search to the part of each line, within the <text>...........</text> region, we must use, this time, the following regex :

(?-s)(?-i:<text>|(?!\A)\G)(?-i:(?!</text>).)*?\K-+

Note that, in the line 3, which contains <text> but not </text>, the search of the dashes still runs till the end of line 3 !

Best Regards,

guy038

Mark Olson

@Paul-Wormer
Time to unpack the (?s-i:(?!ESR).)*? part of the regex, which is what your new version is currently missing.

Essentially this is divided into four parts:

the flags: `?s-i`

These say that:

the . metacharacter should match everything (that’s the s flag)
we want to be case-sensitive. That’s the -i part.

consume a character unless the end of the search region is right ahead: `(?!ESR).`

This looks ahead without consuming any characters to see if the ESR is in front of you, and then stops if it is. An example:
Search string: cacb ab
regex: (?!ab).
At the start of the string, we look ahead for ab. The next character is c, so we consume c. (remember, this is SUCCESS because we want to NOT match the ESR).

string:     cacb ab
want:      _ab
match:      !
consumed: YES

Now we’re after the first c, before the first a. We look ahead for ab, but see ac instead, so we’re clear to advance.

string:     cacb ab
want:      __ab
match:       *!
consumed: YES

You can see that there are no ab anywhere except at the end of the string, so everything will match.

Let’s fast-forward to the end of the string:

string:     cacb ab
want:      ______ab
match:           **
consumed: NO

We’re now positioned between the blankspace and the ending ab. The next two characters are ab, so this whitespace character will NOT be matched.

Do the above thing any number of times: `(?s-i:(?!ESR).)*?`

This just says to keep looking ahead and stopping if the ESR is ahead, then consuming a character, then looking ahead… until the ESR is reached or the entire string is consumed.

Interesting note: Rexegg.com refers to this as “tempered greed” because you’re greedily trying to eat the whole string, but checking to see if you’re full before you take each bite.

Putting it all together:

So as @guy038 illustrated above, the (?-i:<text>|(?!\A)\G)(?s-i:(?!</text>).)*?\K-+ regular expression is going to start with (?-i:<text>|(?!\A)\G) by matching either <text> (the BSR) or the end of the last matched region (unless you wrapped around).
Now the (?s-i:(?!</text>).)*? part comes into play. It behaves as I described above: the negative lookahead for </text> ensures that you cannot go past the ESR.

The rest is the same, as you correctly identified:

forget everything you matched so far (the \K)
match any number of - characters (-+).

For the record, I think that part of the problem with the readability of this regex has to do with the flags. The version of the regex without flags, (?:<text>|(?!\A)\G)(?:(?!</text>).)*?\K-+, is I think a bit less confusing.

generic-regex-replacing-in-a-specific-zone-of-text

the flags: ?s-i

consume a character unless the end of the search region is right ahead: (?!ESR).

Do the above thing any number of times: (?s-i:(?!ESR).)*?

Putting it all together:

the flags: `?s-i`

consume a character unless the end of the search region is right ahead: `(?!ESR).`

Do the above thing any number of times: `(?s-i:(?!ESR).)*?`