Community
    • Login

    Regex - Find Upper Case Followed by Its Lower Case Version

    Scheduled Pinned Locked Moved General Discussion
    regex
    4 Posts 4 Posters 1.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Sylvester BullittS
      Sylvester Bullitt
      last edited by

      I’m trying to find errors introduced by manually splitting lines. The specific error I’m looking for is an upper case letter followed by its lower case equivalent. I tried the following regex, but it didn’t work:

      \u\L$0
      

      Is there a way to do this with a regex?

      PeterJonesP 1 Reply Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @Sylvester Bullitt
        last edited by

        @Sylvester-Bullitt ,

        Hmm… not as easy as I’d hoped.

        There are a some of fatal flaws in your existing search regex:

        • depending on the state of your case-insensitive flag, \u might also match lowercase; I would recommend the explicit (?-i)\u in order to make sure it’s case-sensitive.
        • $0 doesn’t exist during the matching phase, because you don’t have a “whole match” yet to replace with. putting the \u into a group, and using \1 to backreference to that group’s value
          • I think that instead of being an empty string, because the “whole match” isn’t defined yet, I think it actually resolves to trying to match end-of-line followed by 0; but since the end-of-line is always a zero-width match and doesn’t take up the actual CR and/or LF characters, end-of-line-followed-by-zero-character will never match anything.
        • \L has two different meanings, depending on whether it’s a SEARCH or a REPLACE token
          • in the SEARCH, it’s a character escape sequence which means “any character that is not lowercase”
            • so your regex thus effectively says “any uppercase character, followed by any character that’s not lowercase, followed by something that matches $0”
            • I’m not 100% sure whether the $0 in a SEARCH expression means “use the empty match, because there is no value for the whole match yet”, or whether it means
          • in the REPLACE, it’s a substitution escape sequence for converting the next character(s) into uppercase, if possible
          • based on your proposed regex, I am assuming you were thinking you could use it with the “convert the next character(s) into regex” meaning, but that’s not

        So instead of trying to “convert” the previously matched value to lowercase to continue the match, I would say use a case-sensitive wrapper around the \1 backreference: (?-i)(\u)(?i:\1) – however, this has the side effect that it would match BB, not just Bb :
        c41416b7-17da-4ff9-aba5-30a4291981ab-image.png

        If you don’t actually care about the case of the second copy of the letter, I’d recommend sticking with that one.

        If you insist on caring that the second instance of the repeat letter must be lowercase to match, you could also add a negative lookahead that says “the next character cannot be uppercase”: (?-i)(\u)(?!\u)(?i:\1)
        6a76ca28-4c65-468a-93fd-2657e46ef392-image.png

        … or you could add a positive lookahead that says “the next character must be lowercase”: (?-i)(\u)(?=\l)(?i:\1) :
        7d227462-615b-4f40-8a43-e2e0c3249d0b-image.png

        1 Reply Last reply Reply Quote 4
        • CoisesC
          Coises
          last edited by Coises

          (?i:(\u)\1)(?<=(?-i:\u\l))
          Beware the difference between 1 (one) and l (lower case L).

          This makes use of two somewhat obscure features of Notepad++ regular expressions.

          (?i:...) and (?-i:...) are used to make the included expressions case insensitive or case sensitive. Case sensitive or insensitive applies even to upper or lower case matches and back-references. So in the above expression, the first \u matches any letter, while the second \u matches only upper case letters.

          (?<=...) is used to make a test against preceding characters (called a lookbehind). In this case, after finding two characters such that the first character is a letter and the second character is the same as the first (ignoring case), we look back to check that the first is uppercase and the second is lowercase.

          \L$0 does not do what you think it does. In the find field, \L means any single character other than a lower case letter. The changing case function is only available in replacement strings. You can’t use the dollar sign that way, either, in the match field.

          1 Reply Last reply Reply Quote 7
          • guy038G
            guy038
            last edited by guy038

            Hello, @Sylvester-Bullitt, @peterjones, @coises and All,

            The @coises’s answer is quite clever. Personally, I ended up with this search/mark regex :

            SEARCH / MARK (?=(?-i:\u\l))(?i:(\u)\1)

            which uses a look-ahead expression at the beginning, instead of the look-behind expression at the end of the @coises’s regex :

            (?i:(\u)\1)(?<=(?-i:\u\l))


            You could say: it’s a minor difference, but it isn’t !! Indeed, as our Boost regex engine dos not allow look-behinds containing non-fixes expressions, my version has the advantage to work with any syntax of the look-ahead !

            For example, from the INPUT text :

            Aaaaaaaa Axxxxxx
            Bbbbbbbbbbbbbbbbbbbb Bxxxxxx
            Cccc Cxxxxxx
            
            AAAAAAAA Axxxxxx
            BBBBBBBBBBBBBBBBBBBB Bxxxxxx
            CCCC Cxxxxxx
            
            • The regex (?=(?-i:\u\l+))(?i:(\u)\1+) would mark the left part of the first three lines, before the space char

            • But the regex (?i:(\u)\1+)(?<=(?-i:\u\l+)) would just display the message Find: Invalid regular expression

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 4
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors