Community
    • Login

    Unexpected regex behaviour

    Scheduled Pinned Locked Moved General Discussion
    6 Posts 3 Posters 5.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Claudia FrankC
      Claudia Frank
      last edited by

      Hi,

      I have a regular expression which is this

      editor\.([a-z]+)(?=[\(|\r\n])
      

      From my understanding it does looking for a string which has

      • editor. (literally) followed by
      • lower case character (one or multiple) followed by either
      • ( or carriage return newline

      So I don’t expect that this line

      editor.changeInsertion(int length, const char *text)
      

      is found but this line should be found

      editor.changeinsertion(int length, const char *text)
      

      What happens is, that both lines are found.
      If I check Match Case box then it works, only the second line is found.

      Is this expected behaviour or did I misunderstand the regex meaning?

      Cheers
      Claudia

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hi, Claudia,

        Oh ! Interesting problem, indeed !

        The regex syntax [a-z] doesn’t mean that you, necessarily, search for one lowercase letter, only ! In the same way, the syntax [A-Z] don’t try to match an uppercase letter, exclusively ! Indeed, all that depends on the current state of the Match case option :

        • If the Match case is checked, the regex engine DOES care about the case of the letter matched ( Upper or Lower ). Then :

          • The regex [A-Z] search an uppercase letter, exclusively, between A and Z, included

          • The regex [a-z] search an lowercase letter, exclusively, between a and z, included

        • If the Match case is NOT checked, the regex engine does NOT care about the case of the letter matched. So :

          • The regex [A-Z] search an uppercase OR a lowercase letter ( equivalent to the regex [A-Za-z] )

          • The regex [a-z] search a lowercase OR an uppercase letter ( equivalent to the regex [A-Za-z] )

        Moreover, if you’re using the (?i) OR (?-i) modifier, before the square bracket range, you force the regex engine to behave, in an insensitive / sensitive way, independently of the current state of the Match case option !


        I sum up all the different cases, in a table, below :

        •------------•-----------------------------------------------------------------------------------------------------------------•  
        |   Option   |                               REGEX Syntax for matching 1 NON ACCENTUATED letter                                |
        |            •---------------•---------------•---------------•---------------•----------------•----------------•---------------•
        | Match case |     [A-Z]     |     [a-z]     |   (?i)[A-Z]   |   (?i)[a-z]   |   (?-i)[A-Z]   |   (?-i)[a-z]   |   [A-Za-z]    |
        •------------•===============•===============•===============•===============•================•================•===============•
        |     NO     |  Upper/Lower  |  Upper/Lower  |  Upper/Lower  |  Upper/Lower  |     Upper      |     Lower      |  Upper/Lower  |
        •------------•---------------•---------------•---------------•---------------•----------------•----------------•---------------•
        |    YES     |     Upper     |     Lower     |  Upper/Lower  |  Upper/Lower  |     Upper      |     Lower      |  Upper/Lower  |
        •------------•---------------•---------------•---------------•---------------•----------------•----------------•---------------•
        

        From that table, it’s easy to see that :

        • The use of the (?-i) modifier implies a search of letters, sensitive to the case, whatever the Match case option is checked or NOT

        • The use of the (?i) modifier implies a search of letters, insensitive to the case, whatever the Match case option is checked or NOT, as well as the use of the regex [A-Za-z] !

        Best Regards,

        guy038

        Claudia FrankC 1 Reply Last reply Reply Quote 1
        • Claudia FrankC
          Claudia Frank @guy038
          last edited by

          Hi guy038,

          thank you for testing this.
          When you say necessarily, search for one lowercase letter, only does this mean
          it is expected behaviour, even by regex definition? If so, why do I have to distinguish between A-Z and a-z?
          Don’t understand me wrong, I’m absolutely fine with it if I have to use the match case check box just
          want to understand if, from regex point of view, this is a misunderstand from my side.

          Cheers
          Claudia
          Btw. I saw I got a folder on your notepad tab list - yeah ;-)

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi Claudia,

            Just a simple test because my reply to your previous post seems to be considered as a spam ( !? ) I get the message :

            Error
            Post content was flagged as spam by Akismet.com

            guy038

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hello Claudia and All,

              Because of some tiring days, this week, at work, and, also, because of a nice ski-day, at Courchevel, on Saturday ( quite tired too, as it was the first outing, this winter ! ) I have not posted anything yet, trying, each evening, to sort out these case problems, little by little, with regexes of the form [Char1-Char2] or [Char1-Char2]+ However, it comes that it’s even worse that I thought, at first sight :-((

              Globally speaking, we must distinguish TWO main cases :

              A) The Match case option, of the Search/Replace/Mark dialog, is checked

              Let’s consider the simple test string, with the 26 letters, in both cases, below :

              ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
              

              Then, for instance, the search of the regex [F-q]+ matches the string FGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopq

              Things are quite simple : The regex [Char1-Char2] matches any character, whose code-point is >= to Char1’s code-point AND <= to Char2’s code-point !

              Note two points :

              • This sensitive way of search is the default case option, in most regex engines

              • The code-point of Char2 MUST be >= to the code-point of Char1. Otherwise, while clicking on the Find Next button, you get, logically, the error message Find:Invalid regular expression.

              Strangely, with N++, while searching the wrong regex [F-D], if you click on the Count button, on the Find All of the Mark tab, or some other buttons, you just get the message Count: O matches or Mark: 0 matches, instead of the error message !?


              B) The Match case option, of the Search/Replace/Mark dialog, is UNCHECKED

              Things become more complicated ! Let’s consider our previous regex [F-q]+ and the same test string.

              Logically, as the search is considered, in an insensitive way, any letter of the previous found range FGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopq is matched, either, in lowercase or uppercase. But, as this string contains, exactly, the 26 letters of the alphabet, all the test string should be matched. Indeed, that is the correct behaviour, that I verified, on the site https://regex101.com , with the i modifier ( insensitive )

              Unfortunately, with Notepad++, the regex [F-q]+ matches, successively the two strings FGHIJKLMNOPQ, then fghijklmnopq, only :-( The regex engine wrongly considers, only, letters that are both, in lower AND upper case, from the range [F…q] ! I would consider this behaviour as a bug !

              Moreover, the two regex, [F-q]+ and [F-Q]+, give, wrongly, the same result, with the N++ regex engine, contrary to what the Regex101 site give !


              I also, found, some other weird cases, when one limit, of the range, is not a letter ! Let’s consider the standard ASCII list string ( from character \x21 to character \x7e ) , below :

              !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
              

              and suppose that you use the N++ Mark feature, with the two options Purge for each search and Wrap around checked

              Then, for instance, the two regexes [5-l] and [K-~] match, respectively, ANY character in :

              56789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijkl
              
              KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
              

              when the Match case option is CHECHED. Luckily, it’s the expected behaviour !!

              And these same regexes match, respectively, ANY character in :

              56789:;<=>?@ABCDEFGHIJKL[\]^_`abcdefghijkl
              
              KLMNOPQRSTUVWXYZklmnopqrstuvwxyz{|}~
              

              when the Match case option is UNCHECKED. Why, among other things, the six-characters block, below :

              [\]^_`
              

              is taken in account, in the first regex and NOT with the second one !!!???


              So, I’m asking for people who could test this two regexes [5-l] and [K-~], in an INSENSITIVE way, against the test string above ( range from \x21 to \x7e ), with other regex tools than the Boost regex engine, in order to know the different one-characters that are really matched ?

              As for me, to be logic, the correct behaviour is :

              • The regex [5-l] should match ANY character in :

                56789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz

              • The regex [K-~] should match ANY character in :

                ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~


              Therefore, and to sum up, when using the Regular expression search mode, I advice you to :

              • Always click, once, on the Find Next button to ensure that your regex is a valid one

              • Preferably, apart from simple searches, such a single word, always check the Match case option, to get logical results from the regex engine


              Claudia, I tried to find out some infos, about the case insensibility feature. Look at the different links, ( without order ), below :

              http://www.rexegg.com/regex-modifiers.html#i

              http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC9

              http://userguide.icu-project.org/strings/regexp

              http://unicode.org/faq/casemap_charprop.html

              and also :

              http://www.regular-expressions.info/modifiers.html

              http://perldoc.perl.org/perlre.html#Modifiers

              http://www.tutorialspoint.com/perl/perl_regular_expressions.htm

              http://stackoverflow.com/questions/3754097/what-is-the-best-way-to-match-only-letters-in-a-regex


              BTW, Claudia, the site http://www.rexegg.com seems a very, very insteresting site, with waluable examples to study. See, for instance, these topics, below :

              http://www.rexegg.com/regex-uses.html

              http://www.rexegg.com/regex-style.html

              http://www.rexegg.com/regex-best-trick.html

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 0
              • Michel MerlinM
                Michel Merlin
                last edited by

                PMJI, but I think throwing the Match case option in this problem is exploding its complexity; just see the length of ~24 jan post from “guy038” starting (after salute) with “Because of some tiring days, this week…” (sorry for this lengthy pointing a post but absolute dates and every civilized ways have been removed).

                So I suggest to just slightly change the Ctrl+F “Find” box so that the “Match case” line is treated just like the “Match whole word only”, i.e. gets GRAYED when “Regular expression” is selected. This would make the problem much more reliable and powerful since making it more general and more compliant with what its labels are saying.

                Now I know by years of experience that the Notepad++ developers hate everything new or different and that this NIH syndrome of theirs will most probably get this suggestion thrown to trash even before being read… yet I submit it anyway.

                Versailles, Fri 29 Jan 2016 21:10:00 +0100

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors