Regex with Negative Lookahead
-
I’m trying to write a regular expression to find either <section class=“lyrics”> or <section id=“corpus”>, followed optionally by a carriage return & line feed, then not followed by <div class=“audio”>.
This is what I’m trying, but it doesn’t work:
(<section class=“lyrics”>|<section id=“corpus”>)(\r\n)?(?!<div class=“audio”>)
It correctly matches this:
<section class=“lyrics”>
<h1 class=“screen-reader-only”>Lyrics</h1>but incorrectly matches this:
<section class=“lyrics”>
<div class=“audio”>Can anyone see what I’m doing wrong?
Here’s the Notepad++ debug info:
Notepad++ v8.4.5 (64-bit)
Build time : Sep 3 2022 - 04:05:32
Path : C:\Program Files\Notepad++\notepad++.exe
Command Line : “C:\Users\Dick\Documents\tch\site\htm\h\a\d\i\hadiwing.htm”
Admin mode : OFF
Local Conf mode : OFF
Cloud Config : OFF
OS Name : Windows 10 Home (64-bit)
OS Version : 21H2
OS Build : 19044.2006
Current ANSI codepage : 1252
Plugins :
mimeTools (2.8)
NppConverter (4.4)
NppExport (0.4) -
@Sylvester-Bullitt said in Regex with Negative Lookahead:
Can anyone see what I’m doing wrong?
You mean besides annoying the reader by not posting your data inside grey code boxes (per the FAQ) and thus introducing curly double quotes into your data, so that someone has to adjust it before they can play with it? :-)
(<section class="lyrics">|<section id="corpus">)(\r\n)?(?!<div class="audio">)
<section class="lyrics"> <h1 class="screen-reader-only">Lyrics</h1>
<section class="lyrics"> <div class="audio">
Or maybe you really do have curly quotes…we have no way of knowing.
Regardless, I think the problem is that by making the
\r\n
optional, it is going with the easiest option and not matching it, and thus the negative lookahead is comparing against the\r\n
and since that isn’t<div class
…, you’ve got a match.If you add a
+
after the?
, you turn it into a possessive quantifier that will snag the\r\n
and won’t give it back.Thus, I’d try:
(<section class="lyrics">|<section id="corpus">)(\r\n)?+(?!<div class="audio">)
Or, I could be wrong; but that seemed to work for me in some quick experimentation.
-
Sorry, I looked at the “read this before posting” section, but didn’t realize the FAQ had additional instructions.
Your solution works, but I didn’t understand why. This is the first time I’ve run across “possessive quantifiers.” Thanks for the quick assist!
-
@Sylvester-Bullitt said in Regex with Negative Lookahead:
This is the first time I’ve run across “possessive quantifiers.”
follow this link to read about “possessive quantifiers” in the multiplying-operators of regular expressions section of the NPP Online User Manual
-
Hi, @sylvester-bullitt , @alan-kilborn, @peterjones and All,
I’ve got the solution ! Give me some minutes to write a decent post !
BR
guy038
P.S… :
Oh, I did not notice the clever solution of @alan-kilborn, using a possessive modifier for the line-break in order to cancel any
backtracking
attempt, which would split the line-break in two zones\r
and\n
-
Hello, @sylvester-bullitt, @alan-kilborn, @peterjones and All,
First of all, I suppose that the Alan solution can be simplified as :
SEARCH
(<section class="lyrics">|<section id="corpus">)\R?+(?!<div class="audio">)
FYI, @sylvester-bullitt, the
\R
syntax, in a search regex, represents any kind of line-break (\r\n
,\n
or\r
) and few other ones !In addition, IF some characters may exist between the line-break or the first part AND the
<div class="audio">
string, you may use this regex :SEARCH
(?-s)(<section class="lyrics">|<section id="corpus">)\R?+(?!.*<div class="audio">)
Now, it’s easy to understand the role of the possessive modifer
+
, placed after the quantifier ! When the possessive modifer is present it prevents the whole regex to backtrack in order than an other attempt succeeds !It’s very important to unsderstand that regex engines search, BY ALL MEANS a match of the current regex against the INPUT text
Indeed, let’s consider the text :
<section class="lyrics"> <div class="audio">
And the reduced regex :
<section class="lyrics">\R?(?!<div class="audio">)
As we use the simple
\R?
syntax, backtracking is possible. So :-
First, the regex engine matches the
<section class="lyrics">
part, followed with the line-break chars\r\n
-
But as it’s followed with the
<div class="audio">
string, it does not satisfy the negative look-ahead(?!<div class="audio">)
-
Thus, the regex engine backtracks and choose only
\r
as a line-break -
Then, the string
\n
+<div class="audio">
does satisfy the negative look-behind(?!<div class="audio">)
, producing a wrong match !
Now, if we consider the regex :
<section class="lyrics">\R?+(?!<div class="audio">)
, with a+
sign after\R?
As we use, this time, a possessive quantifier, backtracking is not allowed. So :
-
First, the regex engine matches the
<section class="lyrics">
part, followed with the line-break chars\r\n
-
But as it’s followed with the
<div class="audio">
string, it does not satisfy the negative look-ahead(?!<div class="audio">)
-
However, as no backtracking is possible, the regex engine does not have an other way to get a positive match and the whole process fails, as expected
Now, a second solution is possible :
- Regex A
(<section class="lyrics">|<section id="corpus">)(?!\R?<div class="audio">)
OR
- Regex B
(?-s)(<section class="lyrics">|<section id="corpus">)(?!\R?.*<div class="audio">)
if some chars exist before the<div class="audio">
string
You may test the regex A against the text below :
--- YES --- <section class="lyrics"> <h1 class="screen-reader-only">Lyrics</h1> <section class="lyrics"><h1 class="screen-reader-only">Lyrics</h1> <section id="corpus"> <h1 class="screen-reader-only">Lyrics</h1> <section id="corpus"><h1 class="screen-reader-only">Lyrics</h1> --- NO --- <section class="lyrics"> <div class="audio"> <section id="corpus"> <div class="audio"> <section class="lyrics"><div class="audio"> <section id="corpus"><div class="audio">
As well as the regex B against the text below :
--- YES --- <section class="lyrics"> <h1 class="screen-reader-only">Lyrics</h1> <section class="lyrics"><h1 class="screen-reader-only">Lyrics</h1> <section id="corpus"> <h1 class="screen-reader-only">Lyrics</h1> <section id="corpus"><h1 class="screen-reader-only">Lyrics</h1> --- NO --- <section class="lyrics"> <div class="audio"> <section class="lyrics"> 12345<div class="audio"> <section id="corpus"> <div class="audio"> <section id="corpus"> 12345<div class="audio"> <section class="lyrics"><div class="audio"> <section id="corpus"><div class="audio"> <section class="lyrics">12345<div class="audio"> <section id="corpus">12345<div class="audio">
Best Regards,
guy038
-
-
Thanks for the lucid explanation, Guy!
In case anyone is interested, we’re using this in a project to add embedded audio players to the pages at the Web site here.
The regular expression helps us find the pages which still need to be updated.