Intracacies of NPP Regex negative lookahead
-
Ficticious example of HTML code, which contains no newline sequences:
<tr trow="1"><td class="fiddlesticks_a1bc"><span>here is some text</span></td></tr><tr trow="2"><td class="fiddlesticks_de2f"><span>another string</span></td></tr><tr trow="3"><td class="fiddlesticks_g-hi"><span>miscellaneous data</span></td></tr><tr trow="4"><td class="fiddlesticks_jk-l"><span>blah blah blah blah</span></td></tr>
I want to match ONLY from the LAST
<td class="fiddlesticks_
to the end of the data, and am attempting to employ a negative lookahead toward achieving that, but all my efforts have failed so far. For example:(?!<tr trow="\d">)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>.+?</span></td></tr>\z <td class="fiddlesticks_[A-Za-z0-9-]+">(?!<tr trow="\d">)<span>.+?</span></td></tr>\z <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!<tr trow="\d">).+?</span></td></tr>\z <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!<td class="fiddlesticks_).+?</span></td></tr>\z
All of the above result in everything from
<td class="fiddlesticks_a1bc">
on being matched (everything but the opening<tr trow="1">
). If I try this:<td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!here is some text).+?</span></td></tr>\z
…it matches everything from
<td class="fiddlesticks_de2f">
to the end. But these:<td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!another string).+?</span></td></tr>\z <td class="fiddlesticks_[A-Za-z0-9-]+">(?!miscellaneous data)<span>.+?</span></td></tr>\z
…all result in everything from
<td class="fiddlesticks_a1bc">
on being matched again. Of course, I wouldn’t be able to use expressions like those for my actual data anyway, since the text between<span>
and</span>
could be almost anything. Is my negative lookahead usage incorrect?Debug info, if it matters:
Notepad++ v7.9.5 (32-bit) Build time : Mar 21 2021 - 02:09:07 Path : C:\Program Files (x86)\Notepad++\notepad++.exe Admin mode : ON Local Conf mode : OFF OS Name : Windows 7 Ultimate (64-bit) OS Build : 7601.0 Current ANSI codepage : 1252 Plugins : none
-
@M-Andre-Z-Eckenrode said in Intracacies of NPP Regex negative lookahead:
Is it all about using a negative lookahead here, meaning the question you are asking?
I want to match ONLY from the LAST <td class="fiddlesticks_ to the end of the data
If we key in on that, doesn’t this get you there?:
.*<td class="fiddlesticks_
Well it gets you to the last occurrence of the above, and you can add to it to get to the “end of the data”.
But if you really want/need the negative lookahead, it will have to be someone else, as your examples (and your need) are confusing me. :-)
-
Hello, @m-andre-z-eckenrode, @ekopalypse, @peterjones, @alan-kilborn and All,
Ah… interesting problem ! So, Andre, you’ve tried these
7
regexes expressions, listed below, without finding a way to only match from the last<td class="fiddlesticks_
string till the very end of file :-((A1) (?!<tr trow="\d">)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>.+?</span></td></tr>\z (A2) <td class="fiddlesticks_[A-Za-z0-9-]+">(?!<tr trow="\d">)<span>.+?</span></td></tr>\z (A3) <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!<tr trow="\d">).+?</span></td></tr>\z (A4) <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!<td class="fiddlesticks_).+?</span></td></tr>\z (A5) <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!here is some text).+?</span></td></tr>\z (A6) <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!another string).+?</span></td></tr>\z (A7) <td class="fiddlesticks_[A-Za-z0-9-]+">(?!miscellaneous data)<span>.+?</span></td></tr>\z
Indeed :
-
Regex
A5
matches anything from<td class="fiddlesticks_de2f">
till the very end of file -
All the other regexes match from
<td class="fiddlesticks_a1bc">
till the very end of file
Before providing some solutions to your problem, I’m going to explain, first, what all these regexes match and why !
-
First note that all your regexes end with
</span></td></tr>\z
. So, whatever is matched so far, it must go on matching… till the string</span></td></tr>
at the very end of file -
Regarding the regex
A1
:-
It first looks for a string
<td class="fiddlesticks_[A-Za-z0-9-]+">
and, of course, when caret is right before the string<td class="fiddlesticks_a1bc">
, the negative look-ahead(?!<tr trow="\d">)
is necessarily true. So, it matches, so far, the<td class="fiddlesticks_a1bc">
-
Now, the regex engine tries to process the remaining part
<span>.+?</span></td></tr>\z
. You could say : as there is a lazy quantifier, it will match from<span>
to its corresponding</span>
. Not at all ! -
All of us, we must remember that the fundamental characteristic of regex engines is that they try, BY ALL MEANS, to match something. So, when you get a message
Find: Can't find text "something"
you’re absolutely sure that all possibilities and alternatives, if any, have been tried and that the overall regex cannot find a solution ! -
As, the overall regex must be anchored to the very end of file, the regex
<span>.+?</span></td></tr>\z
match the first<span>
. Then.+?
matches the shortest range of text till … the last</span></td></tr>
at the very end of file !
-
-
Regarding the regex
A2
:-
The first part
<td class="fiddlesticks_[A-Za-z0-9-]+">
matches the<td class="fiddlesticks_a1bc">
string -
Then, the part
(?!<tr trow="\d">)<span>
match<span>
which is obviously different from any<tr trow="#">
string -
And the remaining part
.+?</span></td></tr>\z
matches, as above the shortest range of text till the last</span></td></tr>
at the very end of file
-
-
Regarding the regex
A3
:- it almost identical to the regex
A2
, except for the fact that, right after the first<span>
string, there’s no<tr trow="#">
string, too !
- it almost identical to the regex
-
Regarding the regex
A4
:-
it’s a variant of the **regex
A3
as, right after the first<span>
string, there’s no string<td class="fiddlesticks_
at all ! -
So, again the remaining part .+?</span></td></tr>\z matches all text after the first
<span>
till the very end of file
-
-
Regarding the regex
A6
( I’ll speak of regexA5
later ) :-
Again, the part
<td class="fiddlesticks_[A-Za-z0-9-]+"><span>
matches the string<td class="fiddlesticks_a1bc"><span>
-
The negative look-ahead
(?!another string)
is necessarily true when carat is after the first<span>
string -
And the final part
.+?</span></td></tr>\z
matches, as said above, the shortest range of characters till it reaches the very end of file !
-
-
Regarding the regex
A7
:-
First, the regex
<td class="fiddlesticks_[A-Za-z0-9-]+">
matches the string<td class="fiddlesticks_a1bc">
-
Then, the part
(?!miscellaneous data)<span>
matches<span>
, which is, of course, different from themiscellaneous data
string -
Finally, the part
.+?</span></td></tr>\z
matches the shortest range of characters … till it reaches the very end of file !
-
-
Regarding the regex
A5
:-
The regex
<td class="fiddlesticks_[A-Za-z0-9-]+"><span>
matches the string<td class="fiddlesticks_a1bc"><span>
, first -
Then, the negative look-ahead
(?!here is some text)
is evaluated. As this text does follow the<span>
string, the look-ahead is false -
So, the regex engine go on, finding the string
<td class="fiddlesticks_de2f"><span>
which is matched by the part<td class="fiddlesticks_[A-Za-z0-9-]+"><span>
-
Again, the negative look-ahead
(?!here is some text)
is tested. As the present text isanother string
, which is different from thehere is some text
string, the negative look-ahead is true -
So, the final part
.+?</span></td></tr>\z
matches all the remaining of the text, till the very end of file
-
Now, here are a solution, using your different look-aheads syntaxes
(?!here is some text)
,(?!another string)
, and(?!miscellaneous data)
Let’s consider the four regexes below :
(G1) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>.+?</span></td></tr>\z (G2) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!here is some text).+?</span></td></tr>\z (G3) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!here is some text)(?!another string).+?</span></td></tr>\z (G4) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!here is some text)(?!another string)(?!miscellaneous data).+?</span></td></tr>\z
-
These
4
regexes will catch the text from the first, second, third and fourth<td class="fiddlesticks_
string till the very end of file ! -
For instance, in regex
G3
, in order to match the string<td class="fiddlesticks_g-hi"><span>miscellaneous data</span>
we need that, after the<span>
string, the expression is different fromhere is some text
AND different from theanother string
text ! -
Note that when multiple consecutive lookheads are evaluated, the working position of the regex engine does not change : it’s the location right after any
<span>
string
As you can see, if you always want to get the last range, near the end of file, it would be difficult to generalize as you would be forced to add as many negative look-aheads than the number of strings to avoid to :-((
So, the correct solution is to find :
-
First, a string
<td class="fiddlesticks_••••"><span>
-
Then, any range of text which does not contain, for instance, the string
<td
at any location of that range, till the very end of file -
These conditions can be achieved by the regex (G5) :
(?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!<td).)+\z
-
Note the leading modifiers
(?-si)
which forces :-
The
.
meta-character to match standard characters only ( not EOL ones ) -
The search to be processed in an non-sensitive to case way
-
-
As you can see, the part
((?!<td).)+
matches any standard character, if, at each position, the string<td
cannot be found, between a<span>
string and the very end of file ! -
Whereas the regex G6 :
(?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!<td).+\z
would just match like most of your regexes ! Indeed, in that case, the negative look-ahead is tested right after a<span>
string, only
Now, I could have chosen one of the following regexes, instead of regex
G5
:(G7) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!class).)+\z (G8) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!fiddlesticks_).)+\z (G9) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!fidd).)+\z (G10) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!<t).)+\z (G11) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!_).)+\z
All of them, like the regex
G5
, would catch the last zone<td class="fiddlesticks_
till the very end of file ;-))In other words, within the negative look-ahead, you must add an expression which :
-
Does occur, before the last
<td class="fiddlesticks_jk-l">
. So, a forbidden expression if met in these locations -
Does not occur, after the last
<td class="fiddlesticks_jk-l">
till the very end of file. So, the negative look-ahead is always true
Best Regards
guy038
-
-
A nice treatise.
Was anything learned specific to N++'s treatment of negative lookahead?
It didn’t seem like it to me.
It’s okay, though, I’m not complaining in any way. -
@Alan-Kilborn said in Intracacies of NPP Regex negative lookahead:
I want to match ONLY from the LAST <td class="fiddlesticks_ to the end of the data
If we key in on that, doesn’t this get you there?:
.*<td class="fiddlesticks_
Actually, I could probably use that in some circumstances, depending on the specific operation I’m trying to perform (which I didn’t go into in my post). Thanks for the suggestion — I’ll keep it in mind. But strictly speaking, it doesn’t do what I was trying to do, which was to match only a limited substring of the whole thing.
@guy038 said in Intracacies of NPP Regex negative lookahead:
Before providing some solutions to your problem, I’m going to explain, first, what all these regexes match and why !
As always, you’ve raised the bar with your thorough analysis and explanation. Thanks you, sir! So often, I find myself coming up with non-working code, and while I certainly want to find the code that DOES work, I also have an innate desire to understand why my previous attempts didn’t do what I expected.
- These conditions can be achieved by the regex (G5) :
(?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!<td).)+\z
And that’s the solution I was looking for! Thanks again! (Though I don’t think the
(?-s)
is necessary for the code I’m working on, since there are no EOL to be found.)@Alan-Kilborn said in Intracacies of NPP Regex negative lookahead:
Was anything learned specific to N++’s treatment of negative lookahead?
It didn’t seem like it to me.Perhaps not, but please consider that from my point of view when I wrote my initial post, and perhaps those of at least some of the others when posting similar regex questions in these forums, I honestly didn’t know if it was my attempts at negative lookahead that were inadequate, or the documentation (as turned out to be the case with my recent question about
\`
failing to match the end of file), or that regex functionality just wasn’t working. - These conditions can be achieved by the regex (G5) :
-
Hello, @m-andre-z-eckenrode, @alan-kilborn and All,
You’re right about the
(?-s)
modifier. it’s not necessary. So the final regexG5
would be :
(?-i)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!<td).)+\z
I also supposed that you’ve understood why, in regexes
G3
andG4
, we must use consecutive look-aheads ! Indeed, it has to be different fromCondition-1
ANDCondition_2
ANDCondition_3
… ANDCondition_N
Of course the unique negative look-ahead
(?!Condition_1|Condition_2|Condition_3)
is a nonsense as any string is always different from one of them, anyway ;-))Unlike the positive look-ahead
(?=Condition_1|Condition_2|Condition_3)
which is fully functional and validates the overall regex only if, at a particular position, the conditionCondition _1
ORCondition_2
ORCondition_3
is true !Note that successive positive look-aheads, as
(?=Condition_1)(?=Condition_2)(?=Condition_3)
, are generally a nonsense, too. Indeed, it’s usually impossible to satisfyN
conditons at the same time !For instance, the regex
\d+(?=ABC)(?=DEF)(?=GHI)
against the text below,will never match anything !012345ABC 012345DEF 012345GHI 012345___ 012345012
Of course, we could cheat a bit with the regex
\d+(?=ABC)(?=\u+)(?=\w+)
, but the last two look-ahead are rather superfluous ! Test also the regexes\d+(?=\u+)(?=\w+)
and\d+(?=\w+)
against the sampleNo, as you said, Alan, I haven’t learned something different from what I already knew about look-around, in this specific example !
But I take advantage of this post to speak of
4
new tricks :
- (
1A
) A look-ahead can be located before its related expression :
For instance, the regex
(?i-s)(?=....456)Guy
would match the stringGuy
, whatever its case, if it’s followed with a space char and the string456
- (
1B
) Similarly, a look-behind can be placed, after its related expression :
For instance, the regex
(?i-s)Guy(?<=123....)
would match the stringGuy
, whatever its case, if it’s preceded with a the string123
and a space chharTest them against this line
123 Guy 456
. Unfortunately, it doesn’t help to make look-behind, with variable size strings, functional :-( We still need the\K
feature
- (
2
) A regex can contain part(s) with free-spacing mode and part(s) with normal mode, mixed all together !
For instance, the regex
(?-si)A.+B(?x) .+ C (?-x).+D.+E(?x) ( 1 | 2 | 3 | 4 |5 ) + 6 78 9 0(?-x)6 78 9 0
would match within the string12345A12345B12345C12345D12345E12345678906 78 9 0ABCD
This particularity is interesting if you want to highlight a difficult part of the regex, either syntactically and/or functionally. For instance, let’s imagine that we want to match two ranges of digits, separated with a
/
then followed with a space char and a range of upper-case letters, which must not contain the exact string ABC.A typical syntax would be
(?-si)\d+/\d+ ((?!ABC)\u)+
. But it could also be expressed as(?-si)\d+/\d+ (?x) ( (?!ABC) \u )+
to show that, before each uppercase letter found, the stringABC
must not be matched !. Test it against this string12345/67890 FSDGOUZERTOABCROTFOERTFGCV 12/34 FSDGOUZERTOXYZROTFOERTFGCV 1/0 ZZABCZZZ
-
(
3
) Most of us ( and me, too ! ) think that the{
and}
symbols are regex meta-characters. Not at all !. For instance, all the regexes, below, are functional :-
1{A}3
-
A{----}----Z
-
12345{}67890
-
1{2
-
123}456
-
{}
-
{{}}
-
1}2{3
-
a{-3 }
-
and match one or two of the lines below :
12345{}67890 1{A}3 A{----}----Z 1{2 123}456 1{ a }3 {} {{}} 1}2{3 a{-3 }
-
However, when there’s a digit, after the opening brace
{
, and possibly, a space char, this symbol needs to be escaped. For instance, the regexes :-
1\{ 2 }3
matches the string1{ 2 }3
-
1\{2}3
matches the sting1{2}3
-
- (
4
) When a replacement zone contains space characters beginning and/or ending the field, you may surround the overall replacement with parentheses !
For instance the regex S/R :
SEARCH (?-i)DEF REPLACE ( $0 ) would change the string "ABCDEFGHI" into the string "ABC DEF GHI"
Adding the brackets
(
and)
helps us to easily visualize the replacement zone ;-))Best Regards,
guy038
- (
-
1A
,1B
, and2
are not that interesting, maybe because they are not very surprising! :-)
3
however, is a bit interesting. To restate it here:when there’s a digit, after the opening brace {, and possibly, a space char, this symbol needs to be escaped.
It is interesting, that you said that
1{2
is “functional”, even though it doesn’t meet the criterion. Perhaps it would need the}
to make it require the escape on the{
?I suppose that it sometimes needs the escape so that it isn’t confused with a usage like:
j{2}
– for a match of twoj
characters (although that is a contrived usage since it is shorter to simply usejj
).
4
is also somewhat interesting. Restating it:When a replacement zone contains space characters beginning and/or ending the field, you may surround the overall replacement with parentheses
I’ve sort of always take it as a given that "if you want literal
(
or)
to appear in your replacement, use\(
and\)
.But I haven’t thought too much about using them unescaped without a real need, such as in
(?1x:y)
or some other known constructs.Do you think there is a good reason that
(
or)
in the replace field just can’t be literalized when used without additional syntax (much like the{
or}
seems to be in your point3
above?It would save a bit of time, as the usual route is to not pay extreme attention to what you’re doing when you want literals in the replacement, and you do your operation, and the unescaped
(
or)
do not appear, and you think “darn it, I forgot to escape them”, and then you undo your replacement, add the\
to the replace field, and re-execute the replacement. -
Hi, @alan-kilborn,
Regarding the regex syntax
1{2
, this is considered as a pure literal expression, which correctly matches the1{2
stringBut if you want to match the literal string
1{2}
, as this syntax has the regex meaning : “two consecutive digits”1
, we need to escape the opening brace,{
, only ( so1\{2}1\{2}
) to get a literal expression !
As defined here
All characters are treated as literals, except for characters
$
,\
,(
,)
,?
, and:
If you want to write the
$
?
and the:
characters, literally, you do not need, most of the time, to escape them because they are usually found outside their meaning context !However, the three characters
(
,)
and\
must always be escaped, in the replacement zone, in order to be written literally !Parentheses are normally used for lexical grouping in conditional expressions, with these syntaxes :
-
(?
DigitTrue_Exp)
or(?
{Digit}True_Exp)
or(?
NameTrue_Exp)
-
(?
DigitTrue_Exp:False_Exp)
or(?
{Digit}True_Exp:False_Exp)
or(?
NameTrue_Exp:False_Exp)
Apart from these cases, these two parentheses seem to just represent a pure empty string !
For instance :
SEARCH
DEF
REPLACE
123(456
or REPLACE123()456
would change the string
ABCDEFGHI
intoABC123456GHI
And the S/R :
SEARCH : DEF REPLACE : 123( (((XYZ)OP(QRS)TUV ()) )789 would change the string "ABCDEFGHI" into "ABC123 XYZOPQRSTUV 789GHI"
Thus, the S/R :
SEARCH
DEF
REPLACE
()
would change the string
ABCDEFGHI
intoABCGHI
! In other words, the()
syntax, in the replacement zone, seems to be a synonym of an empty string ;-)However, note that :
SEARCH
DEF
REPLACE
123)456
or REPLACE123)456(789
would change the string
ABCDEFGHI
intoABC123GHI
only !
Now, placing some replacement meta-characters, inside parentheses, does not make them literal and they keep these normal behavior :
For instance, the regex S/R :
SEARCH
(DEF)|XYZ
REPLACE
---(123(?1TRUE:FALSE)456\\789)---
would change the string
ABCDEFGHI ABCXYZGHI
intoABC---123TRUE456\789---GHI ABC---123FALSE456\789---GHI
Finally, the only practical application I found of using parentheses, is when you want to delimit a string beginning and/or ending with space characters !
Best Regards,
guy038
-