Intracacies of NPP Regex negative lookahead



  • Ficticious example of HTML code, which contains no newline sequences:

    <tr trow="1"><td class="fiddlesticks_a1bc"><span>here is some text</span></td></tr><tr trow="2"><td class="fiddlesticks_de2f"><span>another string</span></td></tr><tr trow="3"><td class="fiddlesticks_g-hi"><span>miscellaneous data</span></td></tr><tr trow="4"><td class="fiddlesticks_jk-l"><span>blah blah blah blah</span></td></tr>
    

    I want to match ONLY from the LAST <td class="fiddlesticks_ to the end of the data, and am attempting to employ a negative lookahead toward achieving that, but all my efforts have failed so far. For example:

    (?!<tr trow="\d">)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>.+?</span></td></tr>\z
    <td class="fiddlesticks_[A-Za-z0-9-]+">(?!<tr trow="\d">)<span>.+?</span></td></tr>\z
    <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!<tr trow="\d">).+?</span></td></tr>\z
    <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!<td class="fiddlesticks_).+?</span></td></tr>\z
    

    All of the above result in everything from <td class="fiddlesticks_a1bc"> on being matched (everything but the opening <tr trow="1">). If I try this:

    <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!here is some text).+?</span></td></tr>\z
    

    …it matches everything from <td class="fiddlesticks_de2f"> to the end. But these:

    <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!another string).+?</span></td></tr>\z
    <td class="fiddlesticks_[A-Za-z0-9-]+">(?!miscellaneous data)<span>.+?</span></td></tr>\z
    

    …all result in everything from <td class="fiddlesticks_a1bc"> on being matched again. Of course, I wouldn’t be able to use expressions like those for my actual data anyway, since the text between <span> and </span> could be almost anything. Is my negative lookahead usage incorrect?

    Debug info, if it matters:

    Notepad++ v7.9.5   (32-bit)
    Build time : Mar 21 2021 - 02:09:07
    Path : C:\Program Files (x86)\Notepad++\notepad++.exe
    Admin mode : ON
    Local Conf mode : OFF
    OS Name : Windows 7 Ultimate (64-bit) 
    OS Build : 7601.0
    Current ANSI codepage : 1252
    Plugins : none
    


  • @M-Andre-Z-Eckenrode said in Intracacies of NPP Regex negative lookahead:

    Is it all about using a negative lookahead here, meaning the question you are asking?

    I want to match ONLY from the LAST <td class="fiddlesticks_ to the end of the data

    If we key in on that, doesn’t this get you there?:

    .*<td class="fiddlesticks_

    Well it gets you to the last occurrence of the above, and you can add to it to get to the “end of the data”.

    But if you really want/need the negative lookahead, it will have to be someone else, as your examples (and your need) are confusing me. :-)



  • Hello, @m-andre-z-eckenrode, @ekopalypse, @peterjones, @alan-kilborn and All,

    Ah… interesting problem ! So, Andre, you’ve tried these 7 regexes expressions, listed below, without finding a way to only match from the last <td class="fiddlesticks_ string till the very end of file :-(

    (A1) (?!<tr trow="\d">)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>.+?</span></td></tr>\z
    (A2) <td class="fiddlesticks_[A-Za-z0-9-]+">(?!<tr trow="\d">)<span>.+?</span></td></tr>\z
    (A3) <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!<tr trow="\d">).+?</span></td></tr>\z
    (A4) <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!<td class="fiddlesticks_).+?</span></td></tr>\z
    (A5) <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!here is some text).+?</span></td></tr>\z
    (A6) <td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!another string).+?</span></td></tr>\z
    (A7) <td class="fiddlesticks_[A-Za-z0-9-]+">(?!miscellaneous data)<span>.+?</span></td></tr>\z
    

    Indeed :

    • Regex A5 matches anything from <td class="fiddlesticks_de2f"> till the very end of file

    • All the other regexes match from <td class="fiddlesticks_a1bc"> till the very end of file


    Before providing some solutions to your problem, I’m going to explain, first, what all these regexes match and why !

    • First note that all your regexes end with </span></td></tr>\z. So, whatever is matched so far, it must go on matching… till the string </span></td></tr> at the very end of file

    • Regarding the regex A1 :

      • It first looks for a string <td class="fiddlesticks_[A-Za-z0-9-]+"> and, of course, when caret is right before the string <td class="fiddlesticks_a1bc">, the negative look-ahead (?!<tr trow="\d">) is necessarily true. So, it matches, so far, the <td class="fiddlesticks_a1bc">

      • Now, the regex engine tries to process the remaining part <span>.+?</span></td></tr>\z. You could say : as there is a lazy quantifier, it will match from <span> to its corresponding </span>. Not at all !

      • All of us, we must remember that the fundamental characteristic of regex engines is that they try, BY ALL MEANS, to match something. So, when you get a message Find: Can't find text "something" you’re absolutely sure that all possibilities and alternatives, if any, have been tried and that the overall regex cannot find a solution !

      • As, the overall regex must be anchored to the very end of file, the regex <span>.+?</span></td></tr>\z match the first <span>. Then .+? matches the shortest range of text till … the last </span></td></tr> at the very end of file !

    • Regarding the regex A2 :

      • The first part <td class="fiddlesticks_[A-Za-z0-9-]+"> matches the <td class="fiddlesticks_a1bc"> string

      • Then, the part (?!<tr trow="\d">)<span> match <span> which is obviously different from any <tr trow="#"> string

      • And the remaining part .+?</span></td></tr>\z matches, as above the shortest range of text till the last </span></td></tr> at the very end of file

    • Regarding the regex A3 :

      • it almost identical to the regex A2, except for the fact that, right after the first <span> string, there’s no <tr trow="#"> string, too !
    • Regarding the regex A4 :

      • it’s a variant of the **regex A3 as, right after the first <span> string, there’s no string <td class="fiddlesticks_ at all !

      • So, again the remaining part .+?</span></td></tr>\z matches all text after the first <span> till the very end of file

    • Regarding the regex A6 ( I’ll speak of regex A5 later ) :

      • Again, the part <td class="fiddlesticks_[A-Za-z0-9-]+"><span> matches the string <td class="fiddlesticks_a1bc"><span>

      • The negative look-ahead (?!another string) is necessarily true when carat is after the first <span> string

      • And the final part .+?</span></td></tr>\z matches, as said above, the shortest range of characters till it reaches the very end of file !

    • Regarding the regex A7 :

      • First, the regex <td class="fiddlesticks_[A-Za-z0-9-]+"> matches the string <td class="fiddlesticks_a1bc">

      • Then, the part (?!miscellaneous data)<span> matches <span>, which is, of course, different from the miscellaneous data string

      • Finally, the part .+?</span></td></tr>\z matches the shortest range of characters … till it reaches the very end of file !

    • Regarding the regex A5 :

      • The regex <td class="fiddlesticks_[A-Za-z0-9-]+"><span> matches the string <td class="fiddlesticks_a1bc"><span>, first

      • Then, the negative look-ahead (?!here is some text) is evaluated. As this text does follow the <span> string, the look-ahead is false

      • So, the regex engine go on, finding the string <td class="fiddlesticks_de2f"><span> which is matched by the part <td class="fiddlesticks_[A-Za-z0-9-]+"><span>

      • Again, the negative look-ahead (?!here is some text) is tested. As the present text is another string, which is different from the here is some text string, the negative look-ahead is true

      • So, the final part .+?</span></td></tr>\z matches all the remaining of the text, till the very end of file


    Now, here are a solution, using your different look-aheads syntaxes (?!here is some text), (?!another string), and (?!miscellaneous data)

    Let’s consider the four regexes below :

    (G1) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>.+?</span></td></tr>\z
    (G2) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!here is some text).+?</span></td></tr>\z
    (G3) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!here is some text)(?!another string).+?</span></td></tr>\z
    (G4) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!here is some text)(?!another string)(?!miscellaneous data).+?</span></td></tr>\z
    
    • These 4 regexes will catch the text from the first, second, third and fourth <td class="fiddlesticks_ string till the very end of file !

    • For instance, in regex G3, in order to match the string <td class="fiddlesticks_g-hi"><span>miscellaneous data</span> we need that, after the <span> string, the expression is different from here is some text AND different from the another string text !

    • Note that when multiple consecutive lookheads are evaluated, the working position of the regex engine does not change : it’s the location right after any <span> string


    As you can see, if you always want to get the last range, near the end of file, it would be difficult to generalize as you would be forced to add as many negative look-aheads than the number of strings to avoid to :-((

    So, the correct solution is to find :

    • First, a string <td class="fiddlesticks_••••"><span>

    • Then, any range of text which does not contain, for instance, the string <td at any location of that range, till the very end of file

    • These conditions can be achieved by the regex (G5) : (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!<td).)+\z

    • Note the leading modifiers (?-si) which forces :

      • The . meta-character to match standard characters only ( not EOL ones )

      • The search to be processed in an non-sensitive to case way

    • As you can see, the part ((?!<td).)+ matches any standard character, if, at each position, the string <td cannot be found, between a <span> string and the very end of file !

    • Whereas the regex G6 : (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>(?!<td).+\z would just match like most of your regexes ! Indeed, in that case, the negative look-ahead is tested right after a <span> string, only


    Now, I could have chosen one of the following regexes, instead of regex G5 :

     (G7) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!class).)+\z
     (G8) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!fiddlesticks_).)+\z
     (G9) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!fidd).)+\z
    (G10) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!<t).)+\z
    (G11) (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!_).)+\z
    

    All of them, like the regex G5, would catch the last zone <td class="fiddlesticks_ till the very end of file ;-))

    In other words, within the negative look-ahead, you must add an expression which :

    • Does occur, before the last <td class="fiddlesticks_jk-l">. So, a forbidden expression if met in these locations

    • Does not occur, after the last <td class="fiddlesticks_jk-l"> till the very end of file. So, the negative look-ahead is always true

    Best Regards

    guy038



  • @guy038

    A nice treatise.
    Was anything learned specific to N++'s treatment of negative lookahead?
    It didn’t seem like it to me.
    It’s okay, though, I’m not complaining in any way.



  • @Alan-Kilborn said in Intracacies of NPP Regex negative lookahead:

    I want to match ONLY from the LAST <td class="fiddlesticks_ to the end of the data

    If we key in on that, doesn’t this get you there?:

    .*<td class="fiddlesticks_

    Actually, I could probably use that in some circumstances, depending on the specific operation I’m trying to perform (which I didn’t go into in my post). Thanks for the suggestion — I’ll keep it in mind. But strictly speaking, it doesn’t do what I was trying to do, which was to match only a limited substring of the whole thing.

    @guy038 said in Intracacies of NPP Regex negative lookahead:

    Before providing some solutions to your problem, I’m going to explain, first, what all these regexes match and why !

    As always, you’ve raised the bar with your thorough analysis and explanation. Thanks you, sir! So often, I find myself coming up with non-working code, and while I certainly want to find the code that DOES work, I also have an innate desire to understand why my previous attempts didn’t do what I expected.

    • These conditions can be achieved by the regex (G5) : (?-si)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!<td).)+\z

    And that’s the solution I was looking for! Thanks again! (Though I don’t think the (?-s) is necessary for the code I’m working on, since there are no EOL to be found.)

    @Alan-Kilborn said in Intracacies of NPP Regex negative lookahead:

    Was anything learned specific to N++’s treatment of negative lookahead?
    It didn’t seem like it to me.

    Perhaps not, but please consider that from my point of view when I wrote my initial post, and perhaps those of at least some of the others when posting similar regex questions in these forums, I honestly didn’t know if it was my attempts at negative lookahead that were inadequate, or the documentation (as turned out to be the case with my recent question about \` failing to match the end of file), or that regex functionality just wasn’t working.



  • Hello, @m-andre-z-eckenrode, @alan-kilborn and All,

    @m-andre-z-eckenrode :

    You’re right about the (?-s) modifier. it’s not necessary. So the final regex G5 would be :
    (?-i)<td class="fiddlesticks_[A-Za-z0-9-]+"><span>((?!<td).)+\z

    I also supposed that you’ve understood why, in regexes G3 and G4, we must use consecutive look-aheads ! Indeed, it has to be different from Condition-1 AND Condition_2 AND Condition_3AND Condition_N

    Of course the unique negative look-ahead (?!Condition_1|Condition_2|Condition_3) is a nonsense as any string is always different from one of them, anyway ;-))

    Unlike the positive look-ahead (?=Condition_1|Condition_2|Condition_3) which is fully functional and validates the overall regex only if, at a particular position, the condition Condition _1 OR Condition_2 OR Condition_3 is true !

    Note that successive positive look-aheads, as (?=Condition_1)(?=Condition_2)(?=Condition_3), are generally a nonsense, too. Indeed, it’s usually impossible to satisfy N conditons at the same time !

    For instance, the regex \d+(?=ABC)(?=DEF)(?=GHI) against the text below,will never match anything !

    012345ABC
    012345DEF
    012345GHI
    012345___
    012345012
    

    Of course, we could cheat a bit with the regex \d+(?=ABC)(?=\u+)(?=\w+), but the last two look-ahead are rather superfluous ! Test also the regexes \d+(?=\u+)(?=\w+) and \d+(?=\w+) against the sample

    @alan-kilborn :

    No, as you said, Alan, I haven’t learned something different from what I already knew about look-around, in this specific example !

    But I take advantage of this post to speak of 4 new tricks :


    • ( 1A ) A look-ahead can be located before its related expression :

    For instance, the regex (?i-s)(?=....456)Guy would match the string Guy, whatever its case, if it’s followed with a space char and the string 456

    • ( 1B ) Similarly, a look-behind can be placed, after its related expression :

    For instance, the regex (?i-s)Guy(?<=123....) would match the string Guy, whatever its case, if it’s preceded with a the string 123 and a space chhar

    Test them against this line 123 Guy 456. Unfortunately, it doesn’t help to make look-behind, with variable size strings, functional :-( We still need the \K feature


    • ( 2 ) A regex can contain part(s) with free-spacing mode and part(s) with normal mode, mixed all together !

    For instance, the regex (?-si)A.+B(?x) .+ C (?-x).+D.+E(?x) ( 1 | 2 | 3 | 4 |5 ) + 6 78 9 0(?-x)6 78 9 0 would match within the string 12345A12345B12345C12345D12345E12345678906 78 9 0ABCD

    This particularity is interesting if you want to highlight a difficult part of the regex, either syntactically and/or functionally. For instance, let’s imagine that we want to match two ranges of digits, separated with a / then followed with a space char and a range of upper-case letters, which must not contain the exact string ABC.

    A typical syntax would be (?-si)\d+/\d+ ((?!ABC)\u)+. But it could also be expressed as (?-si)\d+/\d+ (?x) ( (?!ABC) \u )+ to show that, before each uppercase letter found, the string ABC must not be matched !. Test it against this string 12345/67890 FSDGOUZERTOABCROTFOERTFGCV 12/34 FSDGOUZERTOXYZROTFOERTFGCV 1/0 ZZABCZZZ


    • ( 3 ) Most of us ( and me, too ! ) think that the { and } symbols are regex meta-characters. Not at all !. For instance, all the regexes, below, are functional :

      • 1{A}3

      • A{----}----Z

      • 12345{}67890

      • 1{2

      • 123}456

      • {}

      • {{}}

      • 1}2{3

      • a{-3 }

    and match one or two of the lines below :

    12345{}67890
    1{A}3
    A{----}----Z
    1{2
    123}456
    1{ a }3
    {}
    {{}}
    1}2{3
    a{-3   }
    
    • However, when there’s a digit, after the opening brace {, and possibly, a space char, this symbol needs to be escaped. For instance, the regexes :

      • 1\{ 2 }3 matches the string 1{ 2 }3

      • 1\{2}3 matches the sting 1{2}3


    • ( 4 ) When a replacement zone contains space characters beginning and/or ending the field, you may surround the overall replacement with parentheses !

    For instance the regex S/R :

    SEARCH    (?-i)DEF
    
    REPLACE   (     $0     )
    
    would change the string "ABCDEFGHI"  into the string "ABC     DEF     GHI"
    

    Adding the brackets ( and ) helps us to easily visualize the replacement zone ;-))

    Best Regards,

    guy038



  • @guy038

    1A, 1B, and 2 are not that interesting, maybe because they are not very surprising! :-)


    3 however, is a bit interesting. To restate it here:

    when there’s a digit, after the opening brace {, and possibly, a space char, this symbol needs to be escaped.

    It is interesting, that you said that 1{2 is “functional”, even though it doesn’t meet the criterion. Perhaps it would need the } to make it require the escape on the { ?

    I suppose that it sometimes needs the escape so that it isn’t confused with a usage like: j{2} – for a match of two j characters (although that is a contrived usage since it is shorter to simply use jj).


    4 is also somewhat interesting. Restating it:

    When a replacement zone contains space characters beginning and/or ending the field, you may surround the overall replacement with parentheses

    I’ve sort of always take it as a given that "if you want literal ( or ) to appear in your replacement, use \( and \).

    But I haven’t thought too much about using them unescaped without a real need, such as in (?1x:y) or some other known constructs.

    Do you think there is a good reason that ( or ) in the replace field just can’t be literalized when used without additional syntax (much like the { or } seems to be in your point 3 above?

    It would save a bit of time, as the usual route is to not pay extreme attention to what you’re doing when you want literals in the replacement, and you do your operation, and the unescaped ( or ) do not appear, and you think “darn it, I forgot to escape them”, and then you undo your replacement, add the \ to the replace field, and re-execute the replacement.



  • Hi, @alan-kilborn,

    Regarding the regex syntax 1{2, this is considered as a pure literal expression, which correctly matches the 1{2 string

    But if you want to match the literal string 1{2}, as this syntax has the regex meaning : two consecutive digits” 1, we need to escape the opening brace, {, only ( so 1\{2}1\{2} ) to get a literal expression !


    As defined here

    All characters are treated as literals, except for characters $, \, (, ), ?, and :

    If you want to write the $ ? and the : characters, literally, you do not need, most of the time, to escape them because they are usually found outside their meaning context !

    However, the three characters (, ) and \ must always be escaped, in the replacement zone, in order to be written literally !

    Parentheses are normally used for lexical grouping in conditional expressions, with these syntaxes :

    • (?DigitTrue_Exp) or (?{Digit}True_Exp) or (?NameTrue_Exp)

    • (?DigitTrue_Exp:False_Exp) or (?{Digit}True_Exp:False_Exp) or (?NameTrue_Exp:False_Exp)


    Apart from these cases, these two parentheses seem to just represent a pure empty string !

    For instance :

    SEARCH DEF

    REPLACE 123(456 or REPLACE 123()456

    would change the string ABCDEFGHI into ABC123456GHI

    And the S/R :

    SEARCH :    DEF
    
    REPLACE :   123(   (((XYZ)OP(QRS)TUV   ())   )789
    
    would change the string  "ABCDEFGHI"   into  "ABC123   XYZOPQRSTUV      789GHI"
    

    Thus, the S/R :

    SEARCH DEF

    REPLACE ()

    would change the string ABCDEFGHI into ABCGHI ! In other words, the () syntax, in the replacement zone, seems to be a synonym of an empty string ;-)

    However, note that :

    SEARCH DEF

    REPLACE 123)456 or REPLACE 123)456(789

    would change the string ABCDEFGHI into ABC123GHI only !


    Now, placing some replacement meta-characters, inside parentheses, does not make them literal and they keep these normal behavior :

    For instance, the regex S/R :

    SEARCH (DEF)|XYZ

    REPLACE ---(123(?1TRUE:FALSE)456\\789)---

    would change the string ABCDEFGHI ABCXYZGHI into ABC---123TRUE456\789---GHI ABC---123FALSE456\789---GHI


    Finally, the only practical application I found of using parentheses, is when you want to delimit a string beginning and/or ending with space characters !

    Best Regards,

    guy038