generic-regex-replacing-in-a-specific-zone-of-text
-
I’m on a track to learning advanced regexes and this regex is one of my challenges. I understand it halfway and have two remaining questions. For reference sake I repeat the regex here:
(?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)
First question: why is there the alternation
(?-si:BSR|(?!\A)\G)
? Naively, I would simply start with(?-si:BSR)
and place the caret somewhere before the string matched byBSR
and then hitFind Next
orReplace
.My second question is more related to Npp than to regexes: why doesn’t
Replace
work with this regex (even if one wants to replace one string only) and isReplace All
necessary? -
Hello, @paul-wormer and All,
I do not have some spare time to fully answer your first question, presently ! Just a matter of some hours !
However, regarding your second question, it’s quite easy ! It happens that any time you insert a
\K
syntax, somewhere in a regex, the step by step replacement, with theReplace
button is not allowed by the regex engine and the only possibility is to use theReplace All
button !Best regards,
guy038
-
Hello, @paul-wormer and All,
Let’s test it with the real text, below, that you’ll copy in a new N++ tab :
<try>01-23 456 7---89 </pos> <val>37--001</val> <text>This-is -a</text> <pos>4-1234</pos> <val>37--002</val> <text>-small---example</text> <pos>9-0012</pos> <val>37--003</val> <text>-of-text- which-</text> <pos>1-9999</pos> <val>37--004</val> <text>need -to-be- modi fied</text> <pos>0-0000</pos>
Note the
2
empty lines at the beginning of the file !
Now, let’s suppose that we want to replace any range of dashes with a single space char, but ONLY on lines embedded in a multi-lines section
<text>.....</text>
-
If we use your formulation of the generic regex
(?-si:BSR)(?s-i:(?!ESR).)*?\K(?-si:FR)
, we end up with the functional search regex(?-si:<text>)(?s-i:(?!</text>).)*?\K(?-si:-+)
which can be simplified as : -
SEARCH
(?-i:<text>)(?s-i:(?!</text>).)*?\K-+
-
REPLACE
\x20
-
Move the cursor at the very beginning of the new tab
-
Seemingly, the first match is correct : a dash, right after the part <text>This, but the subsequent matches are wrong : it always matches the first occurrence ONLY, of each line surrounded with the multi-line tags
<text>
and</text>
Because of the lack of the
\G
syntax, the other dashes, present in each<text>.....</text>
section, are not matched. Thus, we cannot use this regex form !
You could say : but what about the generic regex
(?-si:BSR|\G)(?s-i:(?!ESR).)*?\K(?-si:FR)
which can be simplified as :-
SEARCH
(?-i:<text>|\G)(?s-i:(?!</text>).)*?\K-+
-
REPLACE
\x20
-
Again, move the cursor at the very beginning of the new tab
-
First, it matches all occurrences of a dash in any line not surrounded with the multi-line tags
<text>
and</text>
( NOT wanted ) -
Then, as soon as it matches an BSR region (
<Text>
) and, up to the very end of the file, it correctly matches any range of dashes in each line surrounded with the multi-line tags<text>
and</text>
ONLY
Because the
\G
syntax is matched at the very beginning of the file, some first wrong matches occur. Thus, this formulation is not correct, too !
Finally, let’s use the complete generic regex
(?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)
which gives the functional one :-
SEARCH
(?-i:<text>|(?!\A)\G)(?s-i:(?!</text>).)*?\K-+
-
REPLACE
\x20
-
Again, move the cursor at the very beginning of the new tab
-
As you can see, since the
\G
syntax is not allowed at the very beginning of current file, it correctly matches all the occurrences of any range of dashes, ONLY in the lines surrounded with the multi-line tags<text>
and</text>
.
This is the expected and desired behaviour !
However, note that, if we decide, on purpose, to start with the cursor on a line, after the very beginning ( as line
2
,3
or else ), our last regex will also finds some initial wrong matches !Best Regards,
guy038
-
-
@guy038 Thank you very much for your very elaborate answer (and also for your time). I will study carefully your text, at first glance it seems definitely worth my while. You really are the grandmaster of regular expressions! Thank you again.
-
Hi, @paul-wormer and All,
I realize that I forgot to mention the fundamental role of the
\G
syntax, in this kind of regex !In the Boost reference manual, here, it is said :
Continuation Escape
The sequence
\G
matches only at the end of the last match found, or at the start of the text being matched if no previous match was found.This escape is useful if you’re iterating over the matches contained within a text, and you want each subsequent match to start where the last one ended.
What does this mean ?
Well, when the caret is at the beginning of file, the
\G
syntax is not allowed because of the negative look-ahead(?!\A)
. So the only possibility to match is to match the BSR string, i.e. the string<text>
with that case, followed by the smallest range, even on several lines, of any char, different from</text>
… till a range of dashesThen the
\G
feature takes over and selects from the next char to the nearest range of dashes, again. But, as it cannot go through the</text>
string, this means that, necessarily, the next match will not be adjacent to the previous one.Thus, the
\G
feature is not supported anymore and the only possiblility is to match a<text>
string again ( the other alternative )bla bla <text> this is a-small test to see-if it</text> is OK.<text>We're looking again for a third dash-and a fourth-one and so-on</text> ----------------•-----------------• ------------------------------------------•------------•----------• 1st match 2nd match 3rd match 4th match 5th match
This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string (
</text>
)(?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)
which gives the functional regex S/R :
SEARCH
(?-s)(?-i:<text>|(?!\A)\G).*?\K-+
REPLACE
\x20
If we imagine this text :
Line 1 <text>this is a-small text to see-where are- - ---------------•-----------------•---------• Line 3 1st match 2nd match 3rd match Line 4 Line 5 <text>all the-different matches of-a dash</text> Line 6 -------------•--------------------• Line 7 4th match 5th match
After the third match of the final dash, at the end of line
1
the next match should be at beginning of line2
. But it is not allowed because the regex engine would have to go from line1
to line2
and skip the two charsCR
+LF
In that case, the
\G
feature is not respected anymore and, necessarily, the search goes on starting with the<text>
string, in line5
, … till a dashNote that the
</text>
string, at the end of each line, is not mandatory because the implicit gapCRLF
, between two consecutive lines, resets, each time, the regex engine to the search of a<text>
string first !!Thus, to search any FR region in each line, containing a BSR region, simply use the simplified generic regex
(?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)
Best Regards,
guy038
-
I worked through Guy’s examples and understand them. I cannot say that I could already compose a regular expression of similar complexity, but at least I can follow Guy’s reasoning.
There is one open end though:
@guy038 said in generic-regex-replacing-in-a-specific-zone-of-text:
This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string ( </text> )
[…]
SEARCH (?-s)(?-i:<text>|(?!\A)\G).*?\K-+I tried to match the string:
a-a <text>b-c</text> d-e
and placed the caret at the beginning of the file. Using the simplified regex above:
(?-s)(?-i:<text>|(?!\A)\G).*?\K-+
I found that the second and the third hyphen in the string match. Is this a case of “even Homer sometimes nods”, or do I make a mistake?
-
@Paul-Wormer said in generic-regex-replacing-in-a-specific-zone-of-text:
do I make a mistake?
Yes, you do.
Why wouldn’t it match the last-
, as you have nothing about</text>
terminating the parsing in the regex…
The possible matching extends to the end of the line. -
Hi, @paul-wormer, @alan-kilborn and All,
@paul-wormer, you didn’t make a mistake. It’s just that I did not explain myself properly !
When I said :
This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string (
</text>
)I should have added : … if we suppose that the closing
</text>
boundaries were implicitly at the very end of the lines containing the<text>
string, instead of their present location !
For instance, write the text, below, in a new tab :
a-a foo b---c <text>bar d-e foo f---g bla</text> H--I blah j-k This --- is a-b foo c---d <text>e--f foo g-h bar i--j foo a test to - verify A---0 bar b-c <text>d-e bla f-----g blah</text> foo h--i bar j-k the ----- regexes
If you use the simplified generic regex
(?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)
, so the practical search(?-s)(?-i:<text>|(?!\A)\G).*?\K-+
:It supposes that implicit
</text>
boundaries exist at the end of the lines, instead of the present onesSo, the text could also be rewritten :
a-a foo b---c <text>bar d-e foo< f-g bla H---I blah j--k/text> This --- is a-b foo c---d <text>e--f foo g-h bar i--j foo</text> a test to - verify A---0 bar b-c <text>d-e bla f-----g blah foo h--i bar j-k</text> the ----- regexes
And, of course, the regex
(?-s)(?-i:<text>|(?!\A)\G).*?\K-+
, against these two texts, does match any range dash characters, after the opening<text>
boundary … till the end of each line
But, using again, the original text :
a-a foo b---c <text>bar d-e foo f---g bla</text> H--I blah j-k This --- is a-b foo c---d <text>e--f foo g-h bar i--j foo a test to - verify A---0 bar b-c <text>d-e bla f-----g blah</text> foo h--i bar j-k the ----- regexes
If we want to restrict the search to the part of each line, within the
<text>...........</text>
region, we must use, this time, the following regex :(?-s)(?-i:<text>|(?!\A)\G)(?-i:(?!</text>).)*?\K-+
Note that, in the line
3
, which contains<text>
but not</text>
, the search of the dashes still runs till the end of line3
!Best Regards,
guy038
-
@Paul-Wormer
Time to unpack the(?s-i:(?!ESR).)*?
part of the regex, which is what your new version is currently missing.Essentially this is divided into four parts:
the flags:
?s-i
These say that:
- the
.
metacharacter should match everything (that’s thes
flag) - we want to be case-sensitive. That’s the
-i
part.
consume a character unless the end of the search region is right ahead:
(?!ESR).
This looks ahead without consuming any characters to see if the ESR is in front of you, and then stops if it is. An example:
Search string:cacb ab
regex:(?!ab).
At the start of the string, we look ahead forab
. The next character isc
, so we consumec
. (remember, this is SUCCESS because we want to NOT match the ESR).string: cacb ab want: _ab match: ! consumed: YES
Now we’re after the first
c
, before the firsta
. We look ahead forab
, but seeac
instead, so we’re clear to advance.string: cacb ab want: __ab match: *! consumed: YES
You can see that there are no
ab
anywhere except at the end of the string, so everything will match.Let’s fast-forward to the end of the string:
string: cacb ab want: ______ab match: ** consumed: NO
We’re now positioned between the blankspace and the ending
ab
. The next two characters areab
, so this whitespace character will NOT be matched.Do the above thing any number of times:
(?s-i:(?!ESR).)*?
This just says to keep looking ahead and stopping if the ESR is ahead, then consuming a character, then looking ahead… until the ESR is reached or the entire string is consumed.
Interesting note: Rexegg.com refers to this as “tempered greed” because you’re greedily trying to eat the whole string, but checking to see if you’re full before you take each bite.
Putting it all together:
So as @guy038 illustrated above, the
(?-i:<text>|(?!\A)\G)(?s-i:(?!</text>).)*?\K-+
regular expression is going to start with(?-i:<text>|(?!\A)\G)
by matching either<text>
(the BSR) or the end of the last matched region (unless you wrapped around).
Now the(?s-i:(?!</text>).)*?
part comes into play. It behaves as I described above: the negative lookahead for</text>
ensures that you cannot go past the ESR.The rest is the same, as you correctly identified:
- forget everything you matched so far (the
\K
) - match any number of
-
characters (-+
).
For the record, I think that part of the problem with the readability of this regex has to do with the flags. The version of the regex without flags,
(?:<text>|(?!\A)\G)(?:(?!</text>).)*?\K-+
, is I think a bit less confusing. - the