Why does this regexp fail to match?

mkupper

I have a log file

02/19/2024 08:22 Event
02/26/2024 13:37 Event
03/04/2024 09:27 Event
03/11/2024 08:35 Event
03/18/2024 09:28 Event

I wanted to remove the event times, leaving the dates, and so did
Search: ^(?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] )[012][0-9]:[0-5][0-9]\x20
Replace: (empty)

It failed to match any of the lines but it works when I moved the ^ beginning of line assertion to be inside the look-behind.
Search: (?<=^[01][0-9]/[0-3][0-9]/20[0-9][0-9] )[012][0-9]:[0-5][0-9]\x20
Replace: (empty)

I’m wondering why the first version fails to match any lines.

I was using npp v8.6.5 but then tried v8.5.8 and got the same not-matching issue in both versions.

Coises

@mkupper said in Why does this regexp fail to match?:

I’m wondering why the first version fails to match any lines.

Because you are asking it to first match the beginning of a line, then look at the preceding ten characters to see if they look like a date.

Since the character immediately preceding the beginning of any line other than the first will be a line-ending character, that can never match.

mkupper

Thank you @Coises. Does that mean that a lookbehind should be the first thing in a regexp?

For example, if I insert a ~ in front of my lines to have x02/19/2024 08:22 Event and then I see that ~(?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] )[012][0-9]:[0-5][0-9]\x20 fails to match.

I’m assuming the scanner first looked for an ~ and positions an invisible cursor just after the ~ and then when it gets to the (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] ) it will back up 17 characters to see if there’s match?

I tested that using ~(?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9]~)[012][0-9]:[0-5][0-9] and it matches the ~hh:mm time stamps in

~02/19/2024~08:22 Event
~02/26/2024~13:37 Event
~03/04/2024~09:27 Event
~03/11/2024~08:35 Event
~03/18/2024~09:28 Event

It’s a contrived example of a lookbehind that was not the first thing in the regexp but only worked because the scanner for the ~ that’s at the start of the regexp sees the second ~ in the data and at that time it the lookbehind matches. I suspect It’s too much mental gymnastics for production code.

Alan Kilborn

This post is deleted!

Coises

@mkupper said in Why does this regexp fail to match?:

Thank you @Coises. Does that mean that a lookbehind should be the first thing in a regexp?

Not necessarily, but usually.

You would usually use a lookbehind for one of two reasons:

You need to match something, but you don’t want to include the first part of that match in what will be highlighted or replaced.
You need to verify that your match begins at a boundary, and the lookbehind that identifies that boundary for a particular match could include characters that were already consumed by the previous match.

Case 1 is essentially for convenience: for replacement you could use a capture group instead of lookbehind and use the capture group in your replacement; for Find or Replace All (but not stepwise Replace) you could use \K; or, for Find, you could just accept that a longer segment will be highlighted than the part on which you really want to focus.

Case 2 is probably uncommon, but for that situation I know of no alternative to a lookbehind.

In both of those cases, the look-behind would come at the beginning (or at the beginning of one or more alternatives where the alternation comes at the beginning). You are allowed to use lookbehinds in other ways, but I doubt there are many other cases where they are the best or most straightforward way to do something.

I’m assuming the scanner first looked for an ~ and positions an invisible cursor just after the ~ and then when it gets to the (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] ) it will back up

That’s correct.

Alan Kilborn

@mkupper said in Why does this regexp fail to match?:

It’s a contrived example of a lookbehind that was not the first thing

It’s so contrived it makes my head hurt. :-P
I suspect this is the non-contrived version (which feels much better):
(?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9])~[012][0-9]:[0-5][0-9]

guy038

Hello, @mkupper, @coises, @alan-kilborn and All,

If I refer to your first post, @mkupper, your first solution, which does not work, could be rewritten as :

SEARCH (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] )^[012][0-9]:[0-5][0-9]\x20

Which is a representation similar to the @alan-kilborn’s one !

This time, it’s easier to see why this regex cannot work. Two different reasons explain this behaviour :

The text, at beginning of lines, is never equal to [012][0-9]:[0-5][0-9]\x20
The text, in the look-behind, should occur before the beginning of lines and, then, should include, somehow, line-ending characters !

However, note that the regex (?<=[01][0-9]/[0-3][0-9]/20[0-9][0-9] \r\n)^[012][0-9]:[0-5][0-9]\x20, with the two chars \r\n at the end of the look-behind, after the space char, would match this text :

02/19/2024 
08:22 Event
02/26/2024 
13:37 Event
03/04/2024 
09:27 Event
03/11/2024 
08:35 Event
03/18/2024 
09:28 Event

And would return this outputted text :

02/19/2024 
Event
02/26/2024 
Event
03/04/2024 
Event
03/11/2024 
Event
03/18/2024 
Event

Finally, an other solution to your problem is, simply, to use the following regex S/R :

SEARCH (?<=\x20)[012][0-9]:[0-5][0-9]\x20

REPLACE Leave EMPTY

Best Regards,

guy038

mkupper

@Alan-Kilborn said in Why does this regexp fail to match?:

It’s so contrived it makes my head hurt. :-P
I suspect this is the non-contrived version (which feels much better):

I agree on the hurt, or at least spinning, head. In hindsight, I should have used ^date\K time which both works and is easier to mentally parse.

I now understand why my first expression failed in the original message on this thread.

FWIW, while parsing @Coises’ reply I found https://www.regular-expressions.info/lookaround.html via Google that has a decent explanation on why many regexp engines require fixed length or width lookbehinds. Some engines have ways around this restriction but each of those seems to create their own cans of worms.

For now I’ll stick with lookbehinds only as the first thing in an expression or use \K as I now see that if I have a lookbehind later in an expression, even after a zero length ^ match, that it can be a brain twister to figure out what is happening.

Here’s a simplified version of the contrived data and regexp that is easier to understand. The data is

abc

Searching using any of these get you the same results:

(?<=a)bc
b(?<=ab)c
bc(?<=abc)

The latter two end up inspecting the b or bc in the data twice.