Community
    • Login

    Regex with Negative Lookahead

    Scheduled Pinned Locked Moved General Discussion
    7 Posts 4 Posters 2.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Sylvester BullittS
      Sylvester Bullitt
      last edited by

      I’m trying to write a regular expression to find either <section class=“lyrics”> or <section id=“corpus”>, followed optionally by a carriage return & line feed, then not followed by <div class=“audio”>.

      This is what I’m trying, but it doesn’t work:

      (<section class=“lyrics”>|<section id=“corpus”>)(\r\n)?(?!<div class=“audio”>)

      It correctly matches this:

      <section class=“lyrics”>
      <h1 class=“screen-reader-only”>Lyrics</h1>

      but incorrectly matches this:

      <section class=“lyrics”>
      <div class=“audio”>

      Can anyone see what I’m doing wrong?

      Here’s the Notepad++ debug info:

      Notepad++ v8.4.5 (64-bit)
      Build time : Sep 3 2022 - 04:05:32
      Path : C:\Program Files\Notepad++\notepad++.exe
      Command Line : “C:\Users\Dick\Documents\tch\site\htm\h\a\d\i\hadiwing.htm”
      Admin mode : OFF
      Local Conf mode : OFF
      Cloud Config : OFF
      OS Name : Windows 10 Home (64-bit)
      OS Version : 21H2
      OS Build : 19044.2006
      Current ANSI codepage : 1252
      Plugins :
      mimeTools (2.8)
      NppConverter (4.4)
      NppExport (0.4)

      Alan KilbornA 1 Reply Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @Sylvester Bullitt
        last edited by Alan Kilborn

        @Sylvester-Bullitt said in Regex with Negative Lookahead:

        Can anyone see what I’m doing wrong?

        You mean besides annoying the reader by not posting your data inside grey code boxes (per the FAQ) and thus introducing curly double quotes into your data, so that someone has to adjust it before they can play with it? :-)

        (<section class="lyrics">|<section id="corpus">)(\r\n)?(?!<div class="audio">)
        
        <section class="lyrics">
        <h1 class="screen-reader-only">Lyrics</h1>
        
        <section class="lyrics">
        <div class="audio">
        

        Or maybe you really do have curly quotes…we have no way of knowing.

        Regardless, I think the problem is that by making the \r\n optional, it is going with the easiest option and not matching it, and thus the negative lookahead is comparing against the \r\n and since that isn’t <div class…, you’ve got a match.

        If you add a + after the ?, you turn it into a possessive quantifier that will snag the \r\n and won’t give it back.

        Thus, I’d try:
        (<section class="lyrics">|<section id="corpus">)(\r\n)?+(?!<div class="audio">)

        Or, I could be wrong; but that seemed to work for me in some quick experimentation.

        1 Reply Last reply Reply Quote 3
        • Sylvester BullittS
          Sylvester Bullitt
          last edited by

          Sorry, I looked at the “read this before posting” section, but didn’t realize the FAQ had additional instructions.

          Your solution works, but I didn’t understand why. This is the first time I’ve run across “possessive quantifiers.” Thanks for the quick assist!

          PeterJonesP 1 Reply Last reply Reply Quote 1
          • PeterJonesP
            PeterJones @Sylvester Bullitt
            last edited by

            @Sylvester-Bullitt said in Regex with Negative Lookahead:

            This is the first time I’ve run across “possessive quantifiers.”

            follow this link to read about “possessive quantifiers” in the multiplying-operators of regular expressions section of the NPP Online User Manual

            1 Reply Last reply Reply Quote 1
            • guy038G
              guy038
              last edited by guy038

              Hi, @sylvester-bullitt , @alan-kilborn, @peterjones and All,

              I’ve got the solution ! Give me some minutes to write a decent post !

              BR

              guy038

              P.S… :

              Oh, I did not notice the clever solution of @alan-kilborn, using a possessive modifier for the line-break in order to cancel any backtracking attempt, which would split the line-break in two zones \r and \n

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by guy038

                Hello, @sylvester-bullitt, @alan-kilborn, @peterjones and All,

                First of all, I suppose that the Alan solution can be simplified as :

                SEARCH (<section class="lyrics">|<section id="corpus">)\R?+(?!<div class="audio">)

                FYI, @sylvester-bullitt, the \R syntax, in a search regex, represents any kind of line-break ( \r\n, \n or \r ) and few other ones !

                In addition, IF some characters may exist between the line-break or the first part AND the <div class="audio"> string, you may use this regex :

                SEARCH (?-s)(<section class="lyrics">|<section id="corpus">)\R?+(?!.*<div class="audio">)


                Now, it’s easy to understand the role of the possessive modifer +, placed after the quantifier ! When the possessive modifer is present it prevents the whole regex to backtrack in order than an other attempt succeeds !

                It’s very important to unsderstand that regex engines search, BY ALL MEANS a match of the current regex against the INPUT text

                Indeed, let’s consider the text :

                <section class="lyrics">
                <div class="audio">
                

                And the reduced regex :

                <section class="lyrics">\R?(?!<div class="audio">)

                As we use the simple \R? syntax, backtracking is possible. So :

                • First, the regex engine matches the <section class="lyrics"> part, followed with the line-break chars \r\n

                • But as it’s followed with the <div class="audio"> string, it does not satisfy the negative look-ahead (?!<div class="audio">)

                • Thus, the regex engine backtracks and choose only \r as a line-break

                • Then, the string \n + <div class="audio"> does satisfy the negative look-behind (?!<div class="audio">), producing a wrong match !

                Now, if we consider the regex :

                <section class="lyrics">\R?+(?!<div class="audio">), with a + sign after \R?

                As we use, this time, a possessive quantifier, backtracking is not allowed. So :

                • First, the regex engine matches the <section class="lyrics"> part, followed with the line-break chars \r\n

                • But as it’s followed with the <div class="audio"> string, it does not satisfy the negative look-ahead (?!<div class="audio">)

                • However, as no backtracking is possible, the regex engine does not have an other way to get a positive match and the whole process fails, as expected


                Now, a second solution is possible :

                • Regex A (<section class="lyrics">|<section id="corpus">)(?!\R?<div class="audio">)

                OR

                • Regex B (?-s)(<section class="lyrics">|<section id="corpus">)(?!\R?.*<div class="audio">) if some chars exist before the <div class="audio"> string

                You may test the regex A against the text below :

                --- YES ---
                
                <section class="lyrics">
                <h1 class="screen-reader-only">Lyrics</h1>
                
                <section class="lyrics"><h1 class="screen-reader-only">Lyrics</h1>
                
                <section id="corpus">
                <h1 class="screen-reader-only">Lyrics</h1>
                
                <section id="corpus"><h1 class="screen-reader-only">Lyrics</h1>
                
                --- NO ---
                
                <section class="lyrics">
                <div class="audio">
                
                <section id="corpus">
                <div class="audio">
                
                <section class="lyrics"><div class="audio">
                
                <section id="corpus"><div class="audio">
                

                As well as the regex B against the text below :

                --- YES ---
                
                <section class="lyrics">
                <h1 class="screen-reader-only">Lyrics</h1>
                
                <section class="lyrics"><h1 class="screen-reader-only">Lyrics</h1>
                
                <section id="corpus">
                <h1 class="screen-reader-only">Lyrics</h1>
                
                <section id="corpus"><h1 class="screen-reader-only">Lyrics</h1>
                
                --- NO ---
                
                <section class="lyrics">
                <div class="audio">
                
                <section class="lyrics">
                12345<div class="audio">
                
                <section id="corpus">
                <div class="audio">
                
                <section id="corpus">
                12345<div class="audio">
                
                <section class="lyrics"><div class="audio">
                
                <section id="corpus"><div class="audio">
                
                <section class="lyrics">12345<div class="audio">
                
                <section id="corpus">12345<div class="audio">
                

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 1
                • Sylvester BullittS
                  Sylvester Bullitt
                  last edited by

                  Thanks for the lucid explanation, Guy!

                  In case anyone is interested, we’re using this in a project to add embedded audio players to the pages at the Web site here.

                  The regular expression helps us find the pages which still need to be updated.

                  1 Reply Last reply Reply Quote 2
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors