• Login
Community
  • Login

Why does this regexp fail to match?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
8 Posts 4 Posters 434 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M
    mkupper
    last edited by Mar 18, 2024, 5:02 PM

    I have a log file

    02/19/2024 08:22 Event
    02/26/2024 13:37 Event
    03/04/2024 09:27 Event
    03/11/2024 08:35 Event
    03/18/2024 09:28 Event
    

    I wanted to remove the event times, leaving the dates, and so did
    Search: ^(?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] )[012][0-9]:[0-5][0-9]\x20
    Replace: (empty)

    It failed to match any of the lines but it works when I moved the ^ beginning of line assertion to be inside the look-behind.
    Search: (?<=^[01][0-9]/[0-3][0-9]/20[0-9][0-9] )[012][0-9]:[0-5][0-9]\x20
    Replace: (empty)

    I’m wondering why the first version fails to match any lines.

    I was using npp v8.6.5 but then tried v8.5.8 and got the same not-matching issue in both versions.

    C 1 Reply Last reply Mar 18, 2024, 5:09 PM Reply Quote 0
    • C
      Coises @mkupper
      last edited by Mar 18, 2024, 5:09 PM

      @mkupper said in Why does this regexp fail to match?:

      I’m wondering why the first version fails to match any lines.

      Because you are asking it to first match the beginning of a line, then look at the preceding ten characters to see if they look like a date.

      Since the character immediately preceding the beginning of any line other than the first will be a line-ending character, that can never match.

      M 1 Reply Last reply Mar 18, 2024, 5:58 PM Reply Quote 2
      • M
        mkupper @Coises
        last edited by Mar 18, 2024, 5:58 PM

        Thank you @Coises. Does that mean that a lookbehind should be the first thing in a regexp?

        For example, if I insert a ~ in front of my lines to have x02/19/2024 08:22 Event and then I see that ~(?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] )[012][0-9]:[0-5][0-9]\x20 fails to match.

        I’m assuming the scanner first looked for an ~ and positions an invisible cursor just after the ~ and then when it gets to the (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] ) it will back up 17 characters to see if there’s match?

        I tested that using ~(?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9]~)[012][0-9]:[0-5][0-9] and it matches the ~hh:mm time stamps in

        ~02/19/2024~08:22 Event
        ~02/26/2024~13:37 Event
        ~03/04/2024~09:27 Event
        ~03/11/2024~08:35 Event
        ~03/18/2024~09:28 Event
        

        It’s a contrived example of a lookbehind that was not the first thing in the regexp but only worked because the scanner for the ~ that’s at the start of the regexp sees the second ~ in the data and at that time it the lookbehind matches. I suspect It’s too much mental gymnastics for production code.

        A C 3 Replies Last reply Mar 18, 2024, 6:24 PM Reply Quote 0
        • A
          Alan Kilborn @mkupper
          last edited by Mar 18, 2024, 6:24 PM

          This post is deleted!
          1 Reply Last reply Reply Quote 0
          • C
            Coises @mkupper
            last edited by Mar 18, 2024, 6:34 PM

            @mkupper said in Why does this regexp fail to match?:

            Thank you @Coises. Does that mean that a lookbehind should be the first thing in a regexp?

            Not necessarily, but usually.

            You would usually use a lookbehind for one of two reasons:

            1. You need to match something, but you don’t want to include the first part of that match in what will be highlighted or replaced.

            2. You need to verify that your match begins at a boundary, and the lookbehind that identifies that boundary for a particular match could include characters that were already consumed by the previous match.

            Case 1 is essentially for convenience: for replacement you could use a capture group instead of lookbehind and use the capture group in your replacement; for Find or Replace All (but not stepwise Replace) you could use \K; or, for Find, you could just accept that a longer segment will be highlighted than the part on which you really want to focus.

            Case 2 is probably uncommon, but for that situation I know of no alternative to a lookbehind.

            In both of those cases, the look-behind would come at the beginning (or at the beginning of one or more alternatives where the alternation comes at the beginning). You are allowed to use lookbehinds in other ways, but I doubt there are many other cases where they are the best or most straightforward way to do something.

            I’m assuming the scanner first looked for an ~ and positions an invisible cursor just after the ~ and then when it gets to the (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] ) it will back up

            That’s correct.

            1 Reply Last reply Reply Quote 1
            • A
              Alan Kilborn @mkupper
              last edited by Mar 18, 2024, 6:34 PM

              @mkupper said in Why does this regexp fail to match?:

              It’s a contrived example of a lookbehind that was not the first thing

              It’s so contrived it makes my head hurt. :-P
              I suspect this is the non-contrived version (which feels much better):
              (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9])~[012][0-9]:[0-5][0-9]

              M 1 Reply Last reply Mar 18, 2024, 9:59 PM Reply Quote 0
              • G
                guy038
                last edited by Mar 18, 2024, 7:27 PM

                Hello, @mkupper, @coises, @alan-kilborn and All,

                If I refer to your first post, @mkupper, your first solution, which does not work, could be rewritten as :

                SEARCH (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] )^[012][0-9]:[0-5][0-9]\x20

                Which is a representation similar to the @alan-kilborn’s one !


                This time, it’s easier to see why this regex cannot work. Two different reasons explain this behaviour :

                • The text, at beginning of lines, is never equal to [012][0-9]:[0-5][0-9]\x20

                • The text, in the look-behind, should occur before the beginning of lines and, then, should include, somehow, line-ending characters !


                However, note that the regex (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] \r\n)^[012][0-9]:[0-5][0-9]\x20, with the two chars \r\n at the end of the look-behind, after the space char, would match this text :

                02/19/2024 
                08:22 Event
                02/26/2024 
                13:37 Event
                03/04/2024 
                09:27 Event
                03/11/2024 
                08:35 Event
                03/18/2024 
                09:28 Event
                

                And would return this outputted text :

                02/19/2024 
                Event
                02/26/2024 
                Event
                03/04/2024 
                Event
                03/11/2024 
                Event
                03/18/2024 
                Event
                

                Finally, an other solution to your problem is, simply, to use the following regex S/R :

                SEARCH (?<=\x20)[012][0-9]:[0-5][0-9]\x20

                REPLACE Leave EMPTY

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 0
                • M
                  mkupper @Alan Kilborn
                  last edited by Mar 18, 2024, 9:59 PM

                  @Alan-Kilborn said in Why does this regexp fail to match?:

                  It’s so contrived it makes my head hurt. :-P
                  I suspect this is the non-contrived version (which feels much better):

                  I agree on the hurt, or at least spinning, head. In hindsight, I should have used ^date\K time which both works and is easier to mentally parse.

                  I now understand why my first expression failed in the original message on this thread.

                  FWIW, while parsing @Coises’ reply I found https://www.regular-expressions.info/lookaround.html via Google that has a decent explanation on why many regexp engines require fixed length or width lookbehinds. Some engines have ways around this restriction but each of those seems to create their own cans of worms.

                  For now I’ll stick with lookbehinds only as the first thing in an expression or use \K as I now see that if I have a lookbehind later in an expression, even after a zero length ^ match, that it can be a brain twister to figure out what is happening.

                  Here’s a simplified version of the contrived data and regexp that is easier to understand. The data is

                  abc
                  

                  Searching using any of these get you the same results:

                  (?<=a)bc
                  b(?<=ab)c
                  bc(?<=abc)
                  

                  The latter two end up inspecting the b or bc in the data twice.

                  1 Reply Last reply Reply Quote 1
                  2 out of 8
                  • First post
                    2/8
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors