Why does this regexp get greedy when I use replace-all?

mkupper

I’m using Notepad++ v8.7 (64-bit) but this also applies to older versions.

I wanted to delete a leading tab if there is one. The regexp I have tried include:
Search: ^\t
Search: ^\t(?=.+)
Replace: (blank or empty)

Match-all highlights the first tabs on a line, as expected. Replace-all though gets greedy and removes all of the leading tabs. Why?

Something that works is:
Search: ^\t(.+)
Replace: \1

While using a replacement of \1 works and removed just the first tab I was hoping to do this with a blank or empty replacement as I have a series of several search/replace-with-blank operations.

Alan Kilborn

@mkupper

Why does this regexp get greedy

It doesn’t get “greedy” in a regex sense.
After the first replacement is made, the search begins again at the (new) current position.
If the current position satisfies “beginning of line then tab” (and it does), you have another match there, to be replaced next.

Match-all highlights the first tabs on a line

You mean Mark All (also same for Find All).
This works because nothing is removed. It’s the removal that’s the “problem” for the Replace or the Replace All.

Something that works is: Search: ^\t(.+) …

Yes, this is the correct form for this situation.

But… refer to HERE, because some consider it a bug.

Coises

@mkupper said in Why does this regexp get greedy when I use replace-all?:

While using a replacement of \1 works and removed just the first tab I was hoping to do this with a blank or empty replacement as I have a series of several search/replace-with-blank operations.

For this particular case, you could use:
^\t*\K\t

This removes the last tab in a string of leading tabs, rather than the first, which amounts to the same thing. This will work with Replace All but not with step-by-step replace.

The \K means, “ignore the part of the match that comes before the \K and select only the part starting after it.” Due to a quirk in the way Notepad++ recognizes that the string to be replaced in the replace step is a string found by the find step, \K doesn’t work in step-by-step replace: the find part highlights the correct string, but replace doesn’t replace it.

Alan Kilborn

@mkupper said in Why does this regexp get greedy when I use replace-all?:

I was hoping to do this with a blank or empty replacement as I have a series of several search/replace-with-blank operations

Perhaps you can craft it so you can always leave \1 in Replace with, rather than keeping it at “empty”.

mkupper

Thank you @Alan-Kilborn and @Coises. It was one of those smack forehead moments as I had not considered that the regexp engine keeps re-running the search/replace on a line until there are no more matches plus that deleting the entire match means the regexp cursor does not move and it’s still at the anchor.

I did another smack as I normally test by doing single search/replaces to see that it matches what I expected and replaces with what I expected and repeating the single search/replace a few times. Had I done that with ^\t I would have seen it remove one leading tab and then select the following tab as it’s now the leading tab. I would have then seen it siphoning up tabs one search/replace at a time.

@Alan-Kilborn said:

But… refer to HERE , because some consider it a bug.

That’s interesting food for thought. I experimented with expressions such as ^ and $ both of which cause the cursor to advance even though they should remain stuck at the start or end of a line. There’s already some special case logic within the engine.

I added a comment to the issue on github as (?<=a). demonstrates the same behavior as ^.

However, .$ and and .(?=a) only remove one character from before the anchor when you do a search-replace-all. Thus, if it’s considered a bug then it only affects the prefix style anchors.

Coises

@mkupper said in Why does this regexp get greedy when I use replace-all?:

There’s already some special case logic within the engine.

I added a comment to the issue on github as (?<=a). demonstrates the same behavior as ^.

However, .$ and and .(?=a) only remove one character from before the anchor when you do a search-replace-all. Thus, if it’s considered a bug then it only affects the prefix style anchors.

The position at which matching starts is an offset within the file. Think of it as representing an imaginary line at the left edge of the first character to be matched.

When a match is replaced, the next position to be matched is the imaginary line at the right edge of the replacement.

When you replace ^., the match starts at the left edge of the first character in a line (that’s what ^ requires). The match includes one character. When you replace it with nothing, the right edge of the replacement (the null string) is at the beginning of the line. The next match attempt starts there (which happens to be in the same place as the previous match), so it succeeds.

When you replace .$, the match succeeds when it starts at the left edge of the last character on a line. (So every position in the line is tried, until the position to the left of the last character in the line is reached; then the expression matches.) The match includes a single character; when it is replaced with the null string, the right edge of the replacement is at the end of the line. The next attempt to match begins there, but cannot match (because there is no non-line-ending character there), so match attempts are tried until the match begins at the left edge of the last character in the next line.

No special logic; it’s just how regular expressions work.

There is some special logic — enabled in Boost::regex, as I recall, not implemented in Notepad++ code — to avoid splitting up Windows line endings. A test shows that the results are strange if you have Windows line endings in a file in which some of the lines are empty, and you replace .$ with the empty string while . matches newline is enabled. When there are two line endings with nothing intervening (CR/LF/CR/LF), the first line ending (CR/LF) will match in the Find step, but only the LF is replaced in the Replace step, leaving CR/CR/LF. I’m not sure whether that’s a bug or just a “well, don’t do that!”

guy038

Hello, @mkupper, @alan-kilborn, @coises and All,

So, as a summury, to delete the first standard character, only, of any line, of current file, use one the following search regex S/R :

SEARCH (?-s)^.(.)?
SEARCH (?-s)^.(.*)
SEARCH (?-s)^.(.*\R)

And use :

REPLACE \1 or $1

Now, an other possibility would be to add a character or string, that you are certain is NOT part of current file. Then you’ll run the two consecutive regex S/R, below :

SEARCH ^
REPLACE Your characterorstring ( like ¤ or xyzt , for example )

And :

SEARCH (?-s)¤.? or (?-s)xyzt.?
REPLACE Leave EMPTY

You may also join these two regex S/R in a single one :

SEARCH ¤.?|(^)
REPLACE ?1¤

Or :

SEARCH xyzt.?|(^)
REPLACE ?1xyzt

And click TWICE on the Replace All button !

Best Regards,

guy038