Remove unwanted Carriage Return
-
@guy038 Thanks, Guy.
I’m noticing now why the 2nd regex I presented as invalid,
(?<[^\R])\R(?=[^\R])
, is invalid: bad syntax due to missing = after <.After that fix, it’s merely wrong (since the fancy \R construct is not decoded when appearing inside
[]
as you pointed out).Best,
Neil -
@neil-schipper said in Remove unwanted Carriage Return:
(?<!\R)\R(?!\R)
(?<[^\R])\R(?=[^\R])
My best explanation is that \R can be either 1 or 2 bytes, but a look-behind must be fixed width. Any comment?Since a negative look-behind must be fixed width, the way I use to get a valid expression is to replace
\R
by a\v
or vertical tab, as follows:(?<!\v)\R(?!\R)
-
@astrosofista said:
replace \R by a \v
Confirmed, and good to know, thanks. It’s strange that both constructs encode the 1 or 2 byte newline sequences, but only \v is valid.
a negative look-behind must be fixed width
Both positive and negative look-behinds have this limitation (perhaps what you meant to say).
-
@neil-schipper said in Remove unwanted Carriage Return:
Both positive and negative look-behinds have this limitation (perhaps what you meant to say).
Nope.
vs
Notice that only the lookbehind has the
pattern must be of fixed length
caveat; the lookahead can be variable width. -
@peterjones My comment pertains to the two kinds of look-behind, positive
(?<=
and negative(?<!
(and I’ve tested both), and makes no mention of the two kinds of look-aheads. -
replace \R by a \v
This doesn’t make sense in the desired usage above.
A\R
can be one or two characters, thus not fixed length.
A\v
is always only one character, when it matches. -
@alan-kilborn said in Remove unwanted Carriage Return:
A \v is always only one character, when it matches.
Good point. But @astrosofista’s trick does work with conventional UTF-8 \r\n line endings; it’s a more tolerant way of simply specifying \n, and would also handle other non-standard line ending formats.
Your comment reminds me that it’s risky to get too used to throwing \v around: if used in a matched text expression which will be replaced, it’s easy to inadvertently destroy a pristine file’s uniform line endings.
-
Sorry, I misread. Time to stop trying to think for the evening, apparently
-
@neil-schipper said in Remove unwanted Carriage Return:
@astrosofista’s \v trick does work with conventional UTF-8 \r\n line endings; it’s a more tolerant way of simply specifying \n, and would also handle other non-standard line ending formats.
I don’t think it has value. If you need to match \r\n then \v\v will match it, but it will also match other things (that maybe aren’t wanted), so…no real point in it.
-
@neil-schipper said in Remove unwanted Carriage Return:
Both positive and negative look-behinds have this limitation (perhaps what you meant to say).
Yes, that’s what I meant. My apologies for the possible misunderstanding.
Anyway, although both lookbehinds share such limitation, they differ because in the case of the positive lookbehind we have at our disposal the alternative of the operator \K, but no operator for the negative one.
It would be nice if this issue could be fixed sometime.
-
@astrosofista said in Remove unwanted Carriage Return:
It would be nice if this issue could be fixed sometime.
It’s just a coincidence that
\K
can be used as a variable length positive lookbehind.
Because there is no equivalent for a negative lookbehind, doesn’t mean there’s an issue that could be “fixed”. -
@alan-kilborn said in Remove unwanted Carriage Return:
@astrosofista said in Remove unwanted Carriage Return:
It would be nice if this issue could be fixed sometime.
Because there is no equivalent for a negative lookbehind, doesn’t mean there’s an issue that could be “fixed”.
The lookbehind fixed-length restriction is caused by the Boost::regex library, not by anything Notepad++ does. So the issue would have to be fixed there.
To find more of the history of Boost::regex, and how long they’ve known about that restriction, I went to https://www.boost.org/doc/libs/ (I kept cutting stuff out of the 1.78 URL until I found a page that listed what other versions were available), and looked at old versions until I found the earliest Boost::regex that I noticed lookbehind syntax documented: v1.33.1 from 2004 – where they already note that restriction. If they’ve known about that limitation since 2004 and never gotten rid of that restriction, there’s probably a good technical reason that it’s too hard to implement, and it’s not likely to change anytime soon.
-
@peterjones said in Remove unwanted Carriage Return:
there’s probably a good technical reason that it’s too hard to implement, and it’s not likely to change anytime soon.
I think the limitation probably is rooted in runtime complexity for the engine that would provide a poor user experience (way too long for it to examine every possible match, potential for engine catastrophic overflow, etc.). Just my hunch.
-
@alan-kilborn @astrosofista @PeterJones
Open ended variable length negative look-behinds using subexpressions like
hello.*
would have huge performance implications.Upper-bounded variable length expressions like
dog|puppy
or\d{1,500}
would be “easy-peasy” (ie, computers doing exactly what computer are good at doing), and extremely useful. -
@neil-schipper said in Remove unwanted Carriage Return:
would be “easy-peasy”
I am sure if you can get that PR written and submitted to Boost::regex, they would be quite happy for your implementation. ;-)
-
@neil-schipper said in Remove unwanted Carriage Return:
dog|puppy
Also, since Boost::regex was derived as a PCRE, with roots in Perl, it still has the TIMTOWTDI philosophy:
(?<=dog|puppy)chow
can be represented as((?<=dog)|(?<=puppy))chow
(?<!dog|puppy)chow
can be represented as((?<!dog)(?<!puppy))chow
- note that because of De Morgan’s Laws ,
NOT(A OR B)
becomes NOT(A) AND NOT(B)`
- note that because of De Morgan’s Laws ,
\d{1,500}
is admittedly harder to come up with an equivalent lookbehind that will work. And by harder, I mean, I couldn’t in the last 5 minutes. (Specifically, we obviously don’t want to construct 500 alternatives manually, or fill up that space in the regex. That would have been the “easy” alternative, but not practical.) -
@guy038 said in Remove unwanted Carriage Return:
… the parts [^\R] just match any character … But … R … [and r if case sensitivity]
Yes, confirmed.
Interestingly, however, I also confirmed that each of these sets:
[\d] [\w] [\r] [\n] [\x31] [^\d] [^\w] [^\r] [^\n] [^\x31]
do match the specified character class or control character, or their complement, exactly “as advertised”.
I couldn’t find any reference to the extremely exceptional behavior of
[\R]
and[^\R]
either in the npp docs or in the 1.7.8 Boost doc Peter linked to earlier.I can’t imagine I’m the first to notice this.
-
@neil-schipper said in Remove unwanted Carriage Return:
extremely exceptional behavior of [\R] and [^\R]
IMO there is no exceptional behavior here.
Everything inside[
…]
is “one character”.[\d]
is one digit character
etc.Because
\R
is variable and can be one or two characters, its use inside[
…]
is not considered.Thus
[\R]
will matchR
(orr
if not case sensitive specified).Easy enough to do
[\r\n]
anyway, right? -
@alan-kilborn said in Remove unwanted Carriage Return:
Everything inside […] is “one character”.
You are backfilling into the spec(s) from observation a concept that isn’t there, even though observation indeed suggests that’s a plausible description of the internals.
Reading the specs, \d and \R are “peers”, and behave thusly in other contexts, such as \d+ and \R+, and, (?=\d) and (?=\R).
Easy enough to do [\r\n] anyway, right?
There’s the loss of generality/abstraction. The specs themselves suggest we expect to encounter
\x85|\x{2028}|\x{2029}
line endings now and again.If 0.5% of people seeking regex help had files in those formats, an experienced person such as yourself would not simply include
[\r\n]
in a solution without elaborating on its limitations. -
-
-
Ok, so I guess you can persist with what you’re talking about, but I suppose you’ll just be talking to yourself. :-)