Regex - Find Upper Case Followed by Its Lower Case Version
-
I’m trying to find errors introduced by manually splitting lines. The specific error I’m looking for is an upper case letter followed by its lower case equivalent. I tried the following regex, but it didn’t work:
\u\L$0
Is there a way to do this with a regex?
-
Hmm… not as easy as I’d hoped.
There are a some of fatal flaws in your existing search regex:
- depending on the state of your case-insensitive flag,
\u
might also match lowercase; I would recommend the explicit(?-i)\u
in order to make sure it’s case-sensitive. $0
doesn’t exist during the matching phase, because you don’t have a “whole match” yet to replace with. putting the\u
into a group, and using\1
to backreference to that group’s value- I think that instead of being an empty string, because the “whole match” isn’t defined yet, I think it actually resolves to trying to match end-of-line followed by 0; but since the end-of-line is always a zero-width match and doesn’t take up the actual CR and/or LF characters, end-of-line-followed-by-zero-character will never match anything.
\L
has two different meanings, depending on whether it’s a SEARCH or a REPLACE token- in the SEARCH, it’s a character escape sequence which means “any character that is not lowercase”
- so your regex thus effectively says “any uppercase character, followed by any character that’s not lowercase, followed by something that matches
$0
” - I’m not 100% sure whether the
$0
in a SEARCH expression means “use the empty match, because there is no value for the whole match yet”, or whether it means
- so your regex thus effectively says “any uppercase character, followed by any character that’s not lowercase, followed by something that matches
- in the REPLACE, it’s a substitution escape sequence for converting the next character(s) into uppercase, if possible
- based on your proposed regex, I am assuming you were thinking you could use it with the “convert the next character(s) into regex” meaning, but that’s not
- in the SEARCH, it’s a character escape sequence which means “any character that is not lowercase”
So instead of trying to “convert” the previously matched value to lowercase to continue the match, I would say use a case-sensitive wrapper around the
\1
backreference:(?-i)(\u)(?i:\1)
– however, this has the side effect that it would matchBB
, not justBb
:
If you don’t actually care about the case of the second copy of the letter, I’d recommend sticking with that one.
If you insist on caring that the second instance of the repeat letter must be lowercase to match, you could also add a negative lookahead that says “the next character cannot be uppercase”:
(?-i)(\u)(?!\u)(?i:\1)
… or you could add a positive lookahead that says “the next character must be lowercase”:
(?-i)(\u)(?=\l)(?i:\1)
:
- depending on the state of your case-insensitive flag,
-
(?i:(\u)\1)(?<=(?-i:\u\l))
Beware the difference between1
(one) andl
(lower case L).This makes use of two somewhat obscure features of Notepad++ regular expressions.
(?i:...)
and(?-i:...)
are used to make the included expressions case insensitive or case sensitive. Case sensitive or insensitive applies even to upper or lower case matches and back-references. So in the above expression, the first\u
matches any letter, while the second\u
matches only upper case letters.(?<=...)
is used to make a test against preceding characters (called a lookbehind). In this case, after finding two characters such that the first character is a letter and the second character is the same as the first (ignoring case), we look back to check that the first is uppercase and the second is lowercase.\L$0
does not do what you think it does. In the find field,\L
means any single character other than a lower case letter. The changing case function is only available in replacement strings. You can’t use the dollar sign that way, either, in the match field. -
Hello, @Sylvester-Bullitt, @peterjones, @coises and All,
The @coises’s answer is quite clever. Personally, I ended up with this search/mark regex :
SEARCH / MARK
(?=(?-i:\u\l))(?i:(\u)\1)
which uses a look-ahead expression at the beginning, instead of the look-behind expression at the end of the @coises’s regex :
(?i:(\u)\1)(?<=(?-i:\u\l))
You could say: it’s a minor difference, but it isn’t !! Indeed, as our
Boost
regex engine dos not allow look-behinds containing non-fixes expressions, my version has the advantage to work with any syntax of the look-ahead !For example, from the INPUT text :
Aaaaaaaa Axxxxxx Bbbbbbbbbbbbbbbbbbbb Bxxxxxx Cccc Cxxxxxx AAAAAAAA Axxxxxx BBBBBBBBBBBBBBBBBBBB Bxxxxxx CCCC Cxxxxxx
-
The regex
(?=(?-i:\u\l+))(?i:(\u)\1+)
would mark the left part of the first three lines, before thespace
char -
But the regex
(?i:(\u)\1+)(?<=(?-i:\u\l+))
would just display the messageFind: Invalid regular expression
Best Regards,
guy038
-