Regex search spanning lines when told not to

sngbrdb

This has been an issue for a number of versions… I’m currently on 6.9.2

If I do a regular expression search, but leave “. matches newline” unchecked, the regular expression will grab text on multiple lines. My current example:

Regex: ^([^\\]+\\){6}
Text file:

<blank line>
<blank line>
E:\Deploy\Self-Service Portal_backups\Original Deployed\SelfServicePortal\bin\AccountResources.Designer.cs -> E:\Web Sites\SelfServicePortal\bin\AccountResources.Designer.cs
E:\Deploy\Self-Service Portal_backups\Original Deployed\SelfServicePortal\bin\AccountResources.Designer.resx -> E:\Web Sites\SelfServicePortal\bin\AccountResources.Designer.resx
E:\Deploy\Self-Service Portal_backups\Original Deployed\SelfServicePortal\bin\AspNet.ScriptManager.bootstrap.dll -> E:\Web Sites\SelfServicePortal\bin\AspNet.ScriptManager.bootstrap.dll
E:\Deploy\Self-Service Portal_backups\Original Deployed\SelfServicePortal\bin\AspNet.ScriptManager.jQuery.dll -> E:\Web Sites\SelfServicePortal\bin\AspNet.ScriptManager.jQuery.dll
<blank line>
<blank line>

Every E:\ line is a new line (any line breaks are browser wrapping). The blank lines are irrelevant… I included them because they are included in the match, but the issue exists whether they’re there or not.

If I start at the beginning of the file and search next, I match:

<blank line>
<blank line>
E:\Deploy\Self-Service Portal_backups\Original Deployed\SelfServicePortal\ bin\AccountResources.Designer.cs -> E:\Web Sites\SelfServicePortal\bin\AccountResources.Designer.cs

I expect to skip the first two lines and start matching at E:\… but the regex engine matches [^\\]+ to newlines. By default, regex expressions should never span lines.

When I do a replace on that match (with nothing, I’m removing it), the next match lights up with:

bin\AccountResources.Designer.cs -> E:\Web Sites\SelfServicePortal\ bin\AccountResources.Designer.cs
E:\Deploy\Self…

The two blank lines are gone, and the search eats up the rest of the modified line and matches into the next line. If I do a replace-all, my text file ends up with only one line.

I can get around this with the following regex:
^([^\\]+\\){6}(.*)$

…replacing with \2. However, there are situations where that gives me grief because of the groups. And there’s clearly a mismatch between how ‘$’ is matched… in the first regex, the end of line character was greedy-matched. In the second, .\* did not match the end of line character.

What’s going on?

.

MAPJe71

What’s going on?

No idea, but try ^([^\r\n\\]+?\\){6}.

MAPJe71

Wait, maybe I do know …

^ matches the start of the first blank line;
the first ([^\\]+\\) matches the two blank lines (including line breaks) and E:\ on the third line;
the second to sixth ([^\\]+?\\){2,6} match the remainder of the path on the third line i.e. Deploy\Self-Service Portal_backups\Original Deployed\SelfServicePortal\bin\.

By default, regex expressions should never span lines.

Where did you get that?
By default ^ and $ do match line breaks and and a . (dot) does not.

But … that’s not the problem …
[^\\]+ will match 1 or more characters not being a backslash and that includes line breaks and blanks/whitespace.

sngbrdb

In every language I know that supports regex (I’m reading Ruby is the sole exception), regex expressions don’t include the beginning or end of lines unless you explicitly enable multi-line mode. Unfortunately, there’s not a lot of supporting evidence for this - every site out there implies this is true and discusses how to enable multi-line mode if you need to span multiple lines, but it seems to be understood that single-line is the default.

If I run my regex and text through Expresso, I get the behavior I expect - single-line. I never saw this behavior in Notepad++ prior to the support for “. matches newline”.

But of course you’re correct, that’s exactly what notepad++ is doing… my question was more why. You’re saying multi-line is the default for notepad++, and after more digging, I found this on the wiki:

., \c
Matches any character. If you check the box which says “. matches newline”, the dot will indeed do that, enabling the “any” character to run over multiple lines. With the option unchecked, then . will only match characters within a line, and not the line ending characters (\r and \n)

… but also this:

[^…]
The complement of the characters in the set. For example, [^A-Za-z] means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence [^ABC]* will match until the first A,B or C (or a, b or c if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g. [^ABC\r\n].

So mystery solved, I guess… it just seems inconsistent that with respect to the . character, the documentation says you must check the box to get multi-line behavior, but with negation, you get multi-line behavior whether you want it or not. Every other regular expression engine seems to enable or disable multi-line as a whole, not on for some characters but off for others.

Thanks for the insight - I’ll add \r\n to my negated groups from now on.

sngbrdb

In every language I know that supports regex (I’m reading Ruby is the sole exception), regex expressions don’t include the beginning or end of lines unless you explicitly enable multi-line mode. Unfortunately, there’s not a lot of supporting evidence for this - every site out there implies this is true and discusses how to enable multi-line mode if you need to span multiple lines, but it seems to be understood that single-line is the default.

If I run my regex and text through Expresso, I get the behavior I expect - single-line. I never saw this behavior in Notepad++ prior to the support for “. matches newline”.

But of course you’re correct, that’s exactly what notepad++ is doing… my question was more why. You’re saying multi-line is the default for notepad++, and after more digging, I found this on the wiki:

., \c
Matches any character. If you check the box which says “. matches newline”, the dot will indeed do that, enabling the “any” character to run over multiple lines. With the option unchecked, then . will only match characters within a line, and not the line ending characters (\r and \n)

… but also this:

[^…]
The complement of the characters in the set. For example, [^A-Za-z] means any character except an alphabetic character. Care should be taken with a complement list, as regular expressions are always multi-line, and hence [^ABC]* will match until the first A,B or C (or a, b or c if match case is off), including any newline characters. To confine the search to a single line, include the newline characters in the exception list, e.g. [^ABC\r\n].

So mystery solved, I guess… it just seems inconsistent that with respect to the . character, the documentation says you must check the box to get multi-line behavior, but with negation, you get multi-line behavior whether you want it or not. Every other regular expression engine seems to enable or disable multi-line as a whole, not on for some characters but off for others.

Thanks for the insight - I’ll add \r\n to my negated groups from now on.

MAPJe71

In every language I know that supports regex …

In my experience with RegEx engines the functionality for single- and multi-line have all been the same.

… you must check the box to get multi-line behavior …

By checking the . matches newline option box “single-line” behavior is enabled!

But I guess it relies on the definition/interpretation one uses for single- and multi-line …

“multi-line” (default enabled in Notepad++):
1. regard the text string as multiple string sections (lines) separated by line-breaks;
2. ^ and $ mark begin and end of a string section resp.;
3. . (dot) does not match line-breaks;
4. can be (re-)enabled with the (?m) modifier;
5. can be disabled with the (?-m) modifier.
“single-line” (default disabled in Notepad++):
1. regard the text string as one single string section (line);
2. ^ and $ mark begin and end of the text string resp.;
3. . (dot) does match line-breaks;
4. can be enabled with the (?s) modifier;
5. can be (re-)disabled with the (?-s) modifier.

sngbrdb

So “multi-line” does not span multiple lines. Got it. :|

But if “single-line” means all text is treated as one line, regardless of line breaks, then:

To confine the search to a single line, include the newline characters in the exception list, e.g. [^ABC\r\n]

… is contradictory. The documentation uses “single line” to mean “everything from immediately after a line break (or start of file) up to (but not including) the next line break”, where “line break” is chr(13)?chr(10) - a physical line on the screen. The implication is that omitting the \r\n in the negated group will allow the match to span multiple “lines”. You can’t say that single line means just a portion of of the text, but also the entire text. Hence the confusion.

But for me, what’s really at issue is the fact that [^\\] matches line breaks when “.” does not. And after further research, there’s a gap in my understanding - that multi-line and single-line modes are not mutually exclusive. They mean two completely different things. #swear

So here’s where sh** gets real. First, NPP doesn’t recognize (?-m-s)^([^\]+\){6} as disabling both multi- and single-line modes; other engines do. No big deal, because NPP does recognize (?-m)(?-s)^([^\]+\){6}. Fine. However, the behavior is radically different than other engines, and worse, what you’re replacing the match with affects what’s matched.

Example: Same text as above, regex (?-s)(?m)^([^\]+\){6} replacing with the letter q. Result:

qAccountResources.Designer.cs -> E:\Web Sites\SelfServicePortal\bin\AccountResources.Designer.cs
qApplicationAccountResources.resx -> E:\Web Sites\SelfServicePortal\bin\ApplicationAccountResources.resx
qAspNet.ScriptManager.bootstrap.dll -> E:\Web Sites\SelfServicePortal\bin\AspNet.ScriptManager.bootstrap.dll
qAspNet.ScriptManager.jQuery.dll -> E:\Web Sites\SelfServicePortal\bin\AspNet.ScriptManager.jQuery.dll

After each match and replace, NPP moved down to the beginning of the next line and matched again. This is expected behavior (though the first two lines were collateral damage and disappeared).

Now replace the match with nothing, to just remove it:

AspNet.ScriptManager.jQuery.dll -> E:\Web Sites\SelfServicePortal\bin\AspNet.ScriptManager.jQuery.dll

This time, NPP did not move to the next line before beginning its match, and ate everything but the remnants of the last line that didn’t have enough backslashes to match. This is not expected behavior. Tools such as Expresso behave consistently, and leave all four lines intact, minus the replaced text at the beginning.

Leaving the q as a replacement to continue testing, a regex like (?s)(?-m)^([^\]+\){6} in Expresso replaces text only on the first line, leaving the others intact. This is consistent with beginning the next match at the end of the current match… since we’re in single-line mode and not breaking the match at line breaks, we read the entire file in the first match and only had one match to replace. This is consistent with your explanation. The first two lines are again collateral damage.

qAccountResources.Designer.cs -> E:\Web Sites\SelfServicePortal\bin\AccountResources.Designer.cs
E:\Deploy\Self-Service Portal_backups\Original Deployed\SelfServicePortal\bin\AccountResources.Designer.resx -> E:\Web Sites\SelfServicePortal\bin\AccountResources.Designer.resx
E:\Deploy\Self-Service Portal_backups\Original Deployed\SelfServicePortal\bin\AspNet.ScriptManager.bootstrap.dll -> E:\Web Sites\SelfServicePortal\bin\AspNet.ScriptManager.bootstrap.dll
E:\Deploy\Self-Service Portal_backups\Original Deployed\SelfServicePortal\bin\AspNet.ScriptManager.jQuery.dll -> E:\Web Sites\SelfServicePortal\bin\AspNet.ScriptManager.jQuery.dll

Notepad++ gives me this:

qqqqqqAspNet.ScriptManager.jQuery.dll -> E:\Web Sites\SelfServicePortal\bin\AspNet.ScriptManager.jQuery.dll

Not only did it not advance according to multi-line rules, NPP’s behavior seems to be solely driven by the (?m) conditional - the behavior never changes, whether I use (?s) or (?-s).

I could expand this further with the testing I’ve done, but one of these tools is behaving incorrectly, and NPP isn’t acting as I’d expect.

In all cases, [^\\] eats the end-of-line as you indicate :(

MAPJe71

FYI: The hyphen inverts the meaning of the modifier(s) that follow i.e. (?-ms) to disable both.

… NPP’s behavior seems to be solely driven by the (?m) conditional - the behavior never changes, whether I use (?s) or (?-s).

It’s not caused by NPP, it’s caused by the regular expression ^([^\\]+\\){6}:

It contains a ^ and thus will be effected by the m modifier;
It does not contain a . (dot) and thus will not be effected by the s modifier.

NPP is behaving according the specified expression.

I don’t believe I’ve ever read NPP’s documentation on Regular Expressions.
Jan Goyvaerts’ site has been my source of wisdom ;) and I use his RegexBuddy tool to check/verify regular expressions.

guy038

Hi sngbrdb and All,

First of all, as MAPJe71 said and, as you found out yourself, the regex [^,], for instance, matches, absolutely any of the 128,172 characters of the last Unicode 9.0 version, except for the usual comma sign !. So, to prevent any match of the regex engine to extend on several lines, it’s a good practise to include, systematically, the two characters \r\n in a negative character class. Therefore, our example should be re-written [^,\r\n]

Just note that the shorten syntax \R, to match any kind of EOL character(s), cannot be used, inside a character class !

Now, I would like to clarify some points, relative to :

The two modifiers (?s) and (?m) and their opposite form (?-s) and (?-m)
The two assertions ^ and $
The dot meta-character .

As I presume that this post will be ( too ! ) long , just have a drink and… let’s go !

In the first place, it’s VERY important to realize that the two modifiers (?m) and (?s) do NOT deal of the same things :

The (?m) modifier, and its opposite form (?-m), change the meaning of the ^ and $ assertions
The (?s) modifier, and its opposite form (?-s), change the meaning of the . dot meta-character

By default, the regex engine of N++ considers any text as made of multiple lines. So :

The ^ symbol is a zero length assertion, which represents the location between an EOL character OR the very beginning of the current file and the first standard character of a line
The $ symbol is a zero length assertion, which represents the location between the last standard character of a line and an EOL character OR the very end of the current file

Although not necessary, and, especially, if all parts of your regex follows that behaviour, you may include, at the beginning of the regex, the (?m) modifier ( for multi-lines )

For instance, the regexes ^123 or (?m)^123 would match the 123 string of any line, which begins with the string 123 and the regexes 789$ or (?m)789$ would match the 789 string of any line, which ends with the string 789

On the contrary, when your regex begins with the (?-m) modifier ( for no multi-lines ) the regex engine considers all the contents of your current file as an unique line. So, the meaning of the ^ and $ symbols are restricted :

The ^ symbol becomes a zero length assertion, which represents the location before the very first character of the current file
The $ symbol becomes a zero length assertion, which represents the location after the very last character of the current file

For instance, the regex (?-m)^123 would match a 123 string, at the beginning of the very first line of the current file and the regex (?-m)789$ would match a 789 string, at the end of the very last line of the current file. Notice, this implies that no EOL character follows the string 789. Indeed, in that case, the string 789 would not really end the file !!

You’ll probably agree, as I do, that the behaviour of the regex engine, when using a (?-m) modifier, seems rather uninteresting :-(( Indeed, the two regexes, above, could be, simply, re-written as \A123 and 789\z, with the zero-length assertions \A and \z

VERY IMPORTANT :

If your regex does NOT contain any ^ symbol, nor $ symbol, the modifiers (?m) and/or (?-m) are quite USELESS !!

By default, if the “. matches new line” option is UNCHECKED, the regex engine of N++ considers that the dot meta-character matches a standard character, only, and skips any EOL character !

Although not necessary, and, especially, if all parts of your regex follows that behaviour, you may include, at the beginning of the regex, the (?-s) modifier ( for NO single line )

Then, if we consider the simple text, below, with the two EOL characters \r\n, after digit 5

12345
67890

The regexes .+ or (?-s).+ would match, successively, the strings 12345 and 67890

On the contrary, when your regex begins with the (?s) modifier ( For single line ) AND/OR if the ". matches new line" option is CHECKED, the N++ regex engine considers that the dot meta-character can match, absolutely, any character ( standard and EOL ones ) !

Therefore, on the sample text above, the regex (?s).+ would match the overall string 12345\r\n67890, in one go !

Notes :

The in-line modifiers (?s) and (?-s) have priority on the present state of the . matches new line option of the Find/Replace dialog. So :
- Even if that option is checked, the regex (?-s).+ would match any standard text, till an EOL character, excluded
- Even if that option is UNchecked, the regex (?s).+ would match all the subsequent text, till the end of the current file
Keep in mind that the combined use of the (?s) and (?-s) in-line modifiers, in a same regex, may be very interesting. For instance, the search of the regex (?s)(.+\R)(?-s)(.*123.*\R) and the replacement regex \2\1 would move the last line, containing the string 123, before the present contents of the current file, by swapping the two groups 1 and 2 !

VERY INPORTANT :

As above, for the (?m) modifier, if your regex does NOT contain any . dot symbol, the modifiers (?s) and/or (?-s) are quite USELESS !!

Finally, let’s see the action of the two modifiers, m and s, used together. Consequently to what I said, just above, any regex containing these two modifiers should contain, at least, one dot meta-character and, either, a ^ or a $ symbol ! It will be the case, as we’re going to use the two regexes .{100,350}$ and ^.{100,350}, each of them preceded by one of the four modifier’s form, below :

(?s-m) ( in short : ^ and $ symbols match beginning and end of file / . symbol matches any character )
(?-sm) ( in short : ^ and $ symbols match beginning and end of file / . symbol matches standard characters )
(?m-s) ( in short : ^ and $ symbols match beginning and end of line / . symbol matches standard characters )
(?sm) ( in short : ^ and $ symbols match beginning and end of line / . symbol matches any character )

To clearly notice the differences, between all these cases, let’s use the test text, below, corresponding to some parts of the license.txt file, slightly changed :

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.

To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.

----- Test line, which contains 60 characters, ONLY ! ------

For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.

Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary.
To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all

Once this text, pasted in a new tab, and before applying the regexes, just verify that :

The word When, beginning the first line of that text, is NOT preceded by any character
The word all, ending the last line of that text, is NOT followed by any EOL character !

And, preferably :

Select the Word wrap behaviour, with the menu option View - Word wrap
Select the Show all characters behaviour, with the menu option View - Show Symbols - Show All Characters

Finally, in the Find dialog :

UNCHECK the Wrap around option
Select the Regular expression search mode

A last advice : Before testing any of the regexes, below, just move back the cursor, at the very beginning of this sample text ( so, before the word When ), with the CTRL + Origin shortcut

Thus :

The regex (?s-m).{100,350}$ matches the last 350 characters, of the current file, spread out on one or several lines ( 1 occurrence )
The regex (?-sm).{100,350}$ matches the maximum of the last characters, if between 100 and 350, of the very last line, of the current file ( 1 occurrence )
The regex (?m-s).{100,350}$ matches the maximum of the last characters, if between 100 and 350, of any single line, of the current file ( 6 occurrences )
The regex (?sm).{100,350}$ matches a maximum range of any character, if between 100 and 350, followed by an EOL character OR finishing the current file, in one or several lines, empty or not ( 5 occurrences )

and :

The regex (?s-m)^.{100,350} matches the first 350 characters, of the current file, spread out on one or several lines ( See, note below ! )
The regex (?-sm)^.{100,350} matches the maximum of the first characters, if between 100 and 350, of the very first line, of the current file ( 1 occurrence )
The regex (?m-s)^.{100,350} matches the maximum of the first characters, if between 100 and 350, of any single line, of the current file ( 6 occurrences )
The regex (?sm)^.{100,350} matches a maximum range of any character, if between 100 and 350, preceded by an EOL character OR beginning the current file, in one or several lines, empty or not ( 4 occurrences )

IMPORTANT :

Due to an incorrect handling of backward assertions, the N++ regex engine may NOT produce, in some cases, the right matches ! It’s just the case of the regex (?s-m)^.{100,350}, with the backward assertion ^ This regex engine should find one match, ONLY. However it, wrongly, find 5 occurrences :-((

In fact, the regex engine seems, in that specific case, to use, instead, the regex (?s).{100,350}, which, simply, matches the longest string, till 350 characters, of any character, in one or several lines !

With the hope that this global oversight could help you, in some cases !!

Best Regards,

guy038

Regex search spanning lines when told not to

Regex: ^([^\\]+\\){6} Text file:

Regex: ^([^\\]+\\){6}
Text file: