Remove unwanted Carriage Return
-
@neil-schipper said in Remove unwanted Carriage Return:
would be “easy-peasy”
I am sure if you can get that PR written and submitted to Boost::regex, they would be quite happy for your implementation. ;-)
-
@neil-schipper said in Remove unwanted Carriage Return:
dog|puppy
Also, since Boost::regex was derived as a PCRE, with roots in Perl, it still has the TIMTOWTDI philosophy:
(?<=dog|puppy)chow
can be represented as((?<=dog)|(?<=puppy))chow
(?<!dog|puppy)chow
can be represented as((?<!dog)(?<!puppy))chow
- note that because of De Morgan’s Laws,
NOT(A OR B)
becomes NOT(A) AND NOT(B)`
- note that because of De Morgan’s Laws,
\d{1,500}
is admittedly harder to come up with an equivalent lookbehind that will work. And by harder, I mean, I couldn’t in the last 5 minutes. (Specifically, we obviously don’t want to construct 500 alternatives manually, or fill up that space in the regex. That would have been the “easy” alternative, but not practical.) -
@guy038 said in Remove unwanted Carriage Return:
… the parts [^\R] just match any character … But … R … [and r if case sensitivity]
Yes, confirmed.
Interestingly, however, I also confirmed that each of these sets:
[\d] [\w] [\r] [\n] [\x31] [^\d] [^\w] [^\r] [^\n] [^\x31]
do match the specified character class or control character, or their complement, exactly “as advertised”.
I couldn’t find any reference to the extremely exceptional behavior of
[\R]
and[^\R]
either in the npp docs or in the 1.7.8 Boost doc Peter linked to earlier.I can’t imagine I’m the first to notice this.
-
@neil-schipper said in Remove unwanted Carriage Return:
extremely exceptional behavior of [\R] and [^\R]
IMO there is no exceptional behavior here.
Everything inside[
…]
is “one character”.[\d]
is one digit character
etc.Because
\R
is variable and can be one or two characters, its use inside[
…]
is not considered.Thus
[\R]
will matchR
(orr
if not case sensitive specified).Easy enough to do
[\r\n]
anyway, right? -
@alan-kilborn said in Remove unwanted Carriage Return:
Everything inside […] is “one character”.
You are backfilling into the spec(s) from observation a concept that isn’t there, even though observation indeed suggests that’s a plausible description of the internals.
Reading the specs, \d and \R are “peers”, and behave thusly in other contexts, such as \d+ and \R+, and, (?=\d) and (?=\R).
Easy enough to do [\r\n] anyway, right?
There’s the loss of generality/abstraction. The specs themselves suggest we expect to encounter
\x85|\x{2028}|\x{2029}
line endings now and again.If 0.5% of people seeking regex help had files in those formats, an experienced person such as yourself would not simply include
[\r\n]
in a solution without elaborating on its limitations. -
-
-
Ok, so I guess you can persist with what you’re talking about, but I suppose you’ll just be talking to yourself. :-)
-
Reading the specs, \d and \R are “peers”,
I disagree. But I do agree that maybe it could be explained better in the Notepad++ Searching document. However, that document does point you to the canonical Boost regex documentation, which is the official spec for the regex used by Notepad++; and, in my opinion, the Boost documents can only be interpreted to say that
\R
behaves differently than\d
or\r
or\n
or even\h
or\v
or\s
.In that document, you will see that
\r
,\n
,\t
,\v
and others are listed under the sentence, “The following escape sequences are all synonyms for single characters:” – meaning that each of those sequences matches only a single character at a time. So\v
might match any of the vertical spaces (CR, LF, and the weird ones), but a single\v
in a regex will only match a single character at a time. So if you had the stringAB\r\n
and matched for\v
, the first FIND would find just the\r
.The
\R
is described in its own section called “Matching Line Endings”, which shows that it expands into(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
, which is an expression with parentheses around it and including an internal alternation|
– searching for(?>)
, you find it’s the syntax for an independent sub-expression . This is different than all the single-character escapes listed previously. With the same stringAB\r\n
, searching for\R
, the first FIND would match the two-character sequence\r\n
. The\R
behaves differently than all those other single-character escapes, because it can match multiple characters at once.Then if you back up a few paragraphs to Character sets, you will see the rules for character sets, including the sentence, “A bracket expression may contain any combination of the following:”. The sub-sections that follow underneath that are “Single characters”, “Character ranges”, “Negation”, “Character classes”, “Collating Elements”, “Collating Elements”, “Equivalence classes”, “Escaped Characters”, and “Combinations”. Note that none of those include “independent sub-expression”, or any other term that references a parentheses-based expression.
The bracket
[]
-based character sets cannot contain parentheses()
-based expressions. That is why\R
does not work in a bracket[]
-based class.An updated version of the usermanual mentions of
\R
can be found at https://github.com/pryrt/npp-usermanual/blob/backslashBigR/content/docs/searching.md (that temporary URL will be changed to the permanent URL by moderator power once the changes are merged into the main usermanual repository)- First, it’s been moved out of the Control Characters section into its own special section:
- Second, the Character Classes section has been improved to note that character classes cannot contain any parentheses-based group, including
\R
. - Third, in the Character Escape Sequences section, which contains the
\h
,\v
, and\s
(and thus people might assume that\R
fits in there), it is clarified that being a group causes\R
to be treated differently:
Hopefully, this is sufficient description in enough locations that it will prevent future confusion when users are looking up the meaning of
\R
and whether or not it can go inside a character class. -
-
-