Community
    • Login

    Why does this regexp fail to match?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    8 Posts 4 Posters 433 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • mkupperM
      mkupper
      last edited by

      I have a log file

      02/19/2024 08:22 Event
      02/26/2024 13:37 Event
      03/04/2024 09:27 Event
      03/11/2024 08:35 Event
      03/18/2024 09:28 Event
      

      I wanted to remove the event times, leaving the dates, and so did
      Search: ^(?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] )[012][0-9]:[0-5][0-9]\x20
      Replace: (empty)

      It failed to match any of the lines but it works when I moved the ^ beginning of line assertion to be inside the look-behind.
      Search: (?<=^[01][0-9]/[0-3][0-9]/20[0-9][0-9] )[012][0-9]:[0-5][0-9]\x20
      Replace: (empty)

      I’m wondering why the first version fails to match any lines.

      I was using npp v8.6.5 but then tried v8.5.8 and got the same not-matching issue in both versions.

      CoisesC 1 Reply Last reply Reply Quote 0
      • CoisesC
        Coises @mkupper
        last edited by

        @mkupper said in Why does this regexp fail to match?:

        I’m wondering why the first version fails to match any lines.

        Because you are asking it to first match the beginning of a line, then look at the preceding ten characters to see if they look like a date.

        Since the character immediately preceding the beginning of any line other than the first will be a line-ending character, that can never match.

        mkupperM 1 Reply Last reply Reply Quote 2
        • mkupperM
          mkupper @Coises
          last edited by

          Thank you @Coises. Does that mean that a lookbehind should be the first thing in a regexp?

          For example, if I insert a ~ in front of my lines to have x02/19/2024 08:22 Event and then I see that ~(?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] )[012][0-9]:[0-5][0-9]\x20 fails to match.

          I’m assuming the scanner first looked for an ~ and positions an invisible cursor just after the ~ and then when it gets to the (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] ) it will back up 17 characters to see if there’s match?

          I tested that using ~(?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9]~)[012][0-9]:[0-5][0-9] and it matches the ~hh:mm time stamps in

          ~02/19/2024~08:22 Event
          ~02/26/2024~13:37 Event
          ~03/04/2024~09:27 Event
          ~03/11/2024~08:35 Event
          ~03/18/2024~09:28 Event
          

          It’s a contrived example of a lookbehind that was not the first thing in the regexp but only worked because the scanner for the ~ that’s at the start of the regexp sees the second ~ in the data and at that time it the lookbehind matches. I suspect It’s too much mental gymnastics for production code.

          Alan KilbornA CoisesC 3 Replies Last reply Reply Quote 0
          • Alan KilbornA
            Alan Kilborn @mkupper
            last edited by

            This post is deleted!
            1 Reply Last reply Reply Quote 0
            • CoisesC
              Coises @mkupper
              last edited by

              @mkupper said in Why does this regexp fail to match?:

              Thank you @Coises. Does that mean that a lookbehind should be the first thing in a regexp?

              Not necessarily, but usually.

              You would usually use a lookbehind for one of two reasons:

              1. You need to match something, but you don’t want to include the first part of that match in what will be highlighted or replaced.

              2. You need to verify that your match begins at a boundary, and the lookbehind that identifies that boundary for a particular match could include characters that were already consumed by the previous match.

              Case 1 is essentially for convenience: for replacement you could use a capture group instead of lookbehind and use the capture group in your replacement; for Find or Replace All (but not stepwise Replace) you could use \K; or, for Find, you could just accept that a longer segment will be highlighted than the part on which you really want to focus.

              Case 2 is probably uncommon, but for that situation I know of no alternative to a lookbehind.

              In both of those cases, the look-behind would come at the beginning (or at the beginning of one or more alternatives where the alternation comes at the beginning). You are allowed to use lookbehinds in other ways, but I doubt there are many other cases where they are the best or most straightforward way to do something.

              I’m assuming the scanner first looked for an ~ and positions an invisible cursor just after the ~ and then when it gets to the (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] ) it will back up

              That’s correct.

              1 Reply Last reply Reply Quote 1
              • Alan KilbornA
                Alan Kilborn @mkupper
                last edited by

                @mkupper said in Why does this regexp fail to match?:

                It’s a contrived example of a lookbehind that was not the first thing

                It’s so contrived it makes my head hurt. :-P
                I suspect this is the non-contrived version (which feels much better):
                (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9])~[012][0-9]:[0-5][0-9]

                mkupperM 1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by

                  Hello, @mkupper, @coises, @alan-kilborn and All,

                  If I refer to your first post, @mkupper, your first solution, which does not work, could be rewritten as :

                  SEARCH (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] )^[012][0-9]:[0-5][0-9]\x20

                  Which is a representation similar to the @alan-kilborn’s one !


                  This time, it’s easier to see why this regex cannot work. Two different reasons explain this behaviour :

                  • The text, at beginning of lines, is never equal to [012][0-9]:[0-5][0-9]\x20

                  • The text, in the look-behind, should occur before the beginning of lines and, then, should include, somehow, line-ending characters !


                  However, note that the regex (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] \r\n)^[012][0-9]:[0-5][0-9]\x20, with the two chars \r\n at the end of the look-behind, after the space char, would match this text :

                  02/19/2024 
                  08:22 Event
                  02/26/2024 
                  13:37 Event
                  03/04/2024 
                  09:27 Event
                  03/11/2024 
                  08:35 Event
                  03/18/2024 
                  09:28 Event
                  

                  And would return this outputted text :

                  02/19/2024 
                  Event
                  02/26/2024 
                  Event
                  03/04/2024 
                  Event
                  03/11/2024 
                  Event
                  03/18/2024 
                  Event
                  

                  Finally, an other solution to your problem is, simply, to use the following regex S/R :

                  SEARCH (?<=\x20)[012][0-9]:[0-5][0-9]\x20

                  REPLACE Leave EMPTY

                  Best Regards,

                  guy038

                  1 Reply Last reply Reply Quote 0
                  • mkupperM
                    mkupper @Alan Kilborn
                    last edited by

                    @Alan-Kilborn said in Why does this regexp fail to match?:

                    It’s so contrived it makes my head hurt. :-P
                    I suspect this is the non-contrived version (which feels much better):

                    I agree on the hurt, or at least spinning, head. In hindsight, I should have used ^date\K time which both works and is easier to mentally parse.

                    I now understand why my first expression failed in the original message on this thread.

                    FWIW, while parsing @Coises’ reply I found https://www.regular-expressions.info/lookaround.html via Google that has a decent explanation on why many regexp engines require fixed length or width lookbehinds. Some engines have ways around this restriction but each of those seems to create their own cans of worms.

                    For now I’ll stick with lookbehinds only as the first thing in an expression or use \K as I now see that if I have a lookbehind later in an expression, even after a zero length ^ match, that it can be a brain twister to figure out what is happening.

                    Here’s a simplified version of the contrived data and regexp that is easier to understand. The data is

                    abc
                    

                    Searching using any of these get you the same results:

                    (?<=a)bc
                    b(?<=ab)c
                    bc(?<=abc)
                    

                    The latter two end up inspecting the b or bc in the data twice.

                    1 Reply Last reply Reply Quote 1
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors