Change in the way end of line characters act in regex?
-
I’ve been running a relatively straightforward regex to pull certain lines from working text files that I’ve identified by putting a hash (#) at the start of.
I start by organizing the file, deleting all \n characters and then replacing every (line start) with “\n(line start)” (that’s not literal, there are specific characters that start each line that I use as a cue).
Then I do a complicated search and replace to put hashes in front of certain lines that I need to extract. ^(.*complicated string) => #\1
Finally, I use ^[^#].+\n to delete all lines that don’t start with a hash mark.
This has been working just fine for ages (seriously, I’ve been doing this for years). Suddenly, a week or so ago, it’s stopped working predictably-- it’s grabbing extra lines, or skipping ends or something. I tried using ^[^#].+$ instead, and that sort of works (although it leaves a lot of empty lines behind), but even that is somehow not grabbing what I expected. I haven’t changed my scripts at all. Does anyone have any idea if there was a update, or change in setting or checkbox or option or something that I might have missed somewhere?
Feel free to suggest other methods for isolating/extracting lines if you must, but bear in mind that what I’m describing above is about the limit of my experience with regex. Anything involving other tools or more complicated patterns is not likely to be useful and may be summarily ignored.ETA: for whatever it’s worth, I just booted up an older text editor that we used to use, and the same regex is still working fine there; that’s feeding my suspicion that there was something Notepad-specific involved. I’m going to try to check PeterJones’s suggestion of the .matches newline tickbox. Next week. After turkey. Thanks for the quick responses, though!
-
@Daniel-Brandon said:
Does anyone have any idea if there was a update, or change in setting or checkbox or option or something that I might have missed somewhere?
There have been no recent changes to regex in Notepad++.
The only options that affect regex are Match case and . matches newline, so, start there I guess. Well, other options can affect it, but from reading your posting I didn’t get the feeling like you were replacing-in-selection (e.g.), or if you were, I think you know what you’re doing.it’s stopped working predictably-- it’s grabbing extra lines, or skipping ends or something.
Probably no one can offer any advice without a very specific example.
But odds are, if you take the time to put that together for posting, you yourself may spot something you are doing wrong. -
@Daniel-Brandon said in Change in the way end of line characters act in regex?:
Suddenly, a week or so ago, it’s stopped working predictably-- it’s grabbing extra lines, or skipping ends or something.
Given that, and the fact that you said your regex had
\n
, I am wondering if your data source changed newline types.Windows newlines are
\r\n
; linux newlines are\n
; and the now-less-common ancient Mac newlines are\r
. For the matching portion of a regex, you can use\R
which will match any of those three as a line-ender; but\R
doesn’t work in the replacement, so you would have to pick one of the sequences to use for your replacement.If you are certain that your files have always been Windows EOL (and if they still are – you can use View > Show Symbol > Show All Characters so that you can see whether lines end with just
[LF]
or end with[CR][LF]
), then it may be that you previously had. matches newline
, which would have allowed the.+
in the^[^#].+\n
to match the[CR]
(\r
) before the[LF]
(\n
) that you were manually matching. So, if you turned off. matches newline
unexpectedly, the.
would stop matching the\r
character, and the search would never match “one or more non-newline characters followed by exactly the LF character”. -
@PeterJones said:
and the fact that you said your regex had \n
I considered that, but since OP has had the solution in place for a reasonably long time, I’d think he’d be aware.
Probably using Unix line endings in his data from the start, and that means that using\n
in the regex is fine. -
@PeterJones This sounds promising-- I thought I had checked that, but it’s worth checking again, and I didn’t know the trick to show the different line ends.
-
I considered that, but since OP has had the solution in place for a reasonably long time, I’d think he’d be aware. Probably using Unix line endings in his data from the start.
But that, in combination with the possibility that
. matches newline
was previously on and not now on, would have allowed his regex to “work” with Unix newlines even though his data had Windows newlines.And the “solution in place for a reasonably long time” doesn’t get rid of the possibility that I raised of the data he’s getting has been recently changed from Unix newlines to Windows newlines – thus, a regex that worked with the old unix newlines wouldn’t work with the new windows newlines. And if he doesn’t regularly show (or check) a particular file’s newlines, he might not have realized there was a change in the input data.
-
@Daniel-Brandon said in Change in the way end of line characters act in regex?:
it’s grabbing extra lines, or skipping ends or something.
My immediate thought is that you might have ticked the “. matches newline” option. Read this from the online manual. Actual option is down about a page.
Otherwise images of your replacement regex overlaying the file being worked on (if you dont mind showing your real data) might be helpful. Also consider showing all characters including line endings. That reference in the online manual is here.
Terry
PS writing on a tablet is so slow that others got there first!
-
| Finally, I use
^[^#].+\n
to delete all lines that don’t start with a hash mark.That expression seems to have three possible issues.
- It will match and delete
\n\n# Some stuff you wanted to keep
as[^#]
matches\n
in the data. - It won’t match
\n#\n~
in the data though it should be removing a line that only has a#
at the beginning. - While you use
\n
in your expressions can you guarantee that there is never a\r
n or\r\n
pair in your data?
A safer expression is
^[^#\r\n].*\R
.Notepad++ regexp is a little different than some other regexp engines in that in Notepad++
[^#]
will also match end of line characters such as\n
or\r
Also, the[^#]
only matches one character. If you have a two-character CRLF in your data it will match just the CR (\r
) and the next character in the data stream is a LF (\n
). - It will match and delete
-
Welp, when I came back from holiday, I tried again with the next batch of data, making sure that “. matches newline” was unchecked, and everything is working as expected once again. Either that did somehow get checked, or gremlins got into my laptop (again). Thanks again for all the suggestions!
-
@Daniel-Brandon said in Change in the way end of line characters act in regex?:
Either that did somehow get checked, or gremlins got into my laptop (again)
Most of the regulars here use the “search modifiers” instead of ticking the “. matches newline box”. That’s because we aren’t sure that users of our regex will have that ticked or not. The search modifier overrides that tick box and allows some certainty with what our solution will do.
The reference in the online manual is here. (?s) is the same as ticking that box and (?-s) is no tick in that box.
Terry
-
@Daniel-Brandon It also appears that some (but not all!) of the files are indeed reaching me with \r Carriage Return characters instead of \n Line feed characters baked in, which should not be happening, and never happened before, but there you go. Thanks to everyone who suggested looking for them. Between that and matches newline, I’m really hoping I’ve got it sorted now.