Unexpected regex behaviour



  • Hi,

    I have a regular expression which is this

    editor\.([a-z]+)(?=[\(|\r\n])
    

    From my understanding it does looking for a string which has

    • editor. (literally) followed by
    • lower case character (one or multiple) followed by either
    • ( or carriage return newline

    So I don’t expect that this line

    editor.changeInsertion(int length, const char *text)
    

    is found but this line should be found

    editor.changeinsertion(int length, const char *text)
    

    What happens is, that both lines are found.
    If I check Match Case box then it works, only the second line is found.

    Is this expected behaviour or did I misunderstand the regex meaning?

    Cheers
    Claudia



  • Hi, Claudia,

    Oh ! Interesting problem, indeed !

    The regex syntax [a-z] doesn’t mean that you, necessarily, search for one lowercase letter, only ! In the same way, the syntax [A-Z] don’t try to match an uppercase letter, exclusively ! Indeed, all that depends on the current state of the Match case option :

    • If the Match case is checked, the regex engine DOES care about the case of the letter matched ( Upper or Lower ). Then :

      • The regex [A-Z] search an uppercase letter, exclusively, between A and Z, included

      • The regex [a-z] search an lowercase letter, exclusively, between a and z, included

    • If the Match case is NOT checked, the regex engine does NOT care about the case of the letter matched. So :

      • The regex [A-Z] search an uppercase OR a lowercase letter ( equivalent to the regex [A-Za-z] )

      • The regex [a-z] search a lowercase OR an uppercase letter ( equivalent to the regex [A-Za-z] )

    Moreover, if you’re using the (?i) OR (?-i) modifier, before the square bracket range, you force the regex engine to behave, in an insensitive / sensitive way, independently of the current state of the Match case option !


    I sum up all the different cases, in a table, below :

    •------------•-----------------------------------------------------------------------------------------------------------------•  
    |   Option   |                               REGEX Syntax for matching 1 NON ACCENTUATED letter                                |
    |            •---------------•---------------•---------------•---------------•----------------•----------------•---------------•
    | Match case |     [A-Z]     |     [a-z]     |   (?i)[A-Z]   |   (?i)[a-z]   |   (?-i)[A-Z]   |   (?-i)[a-z]   |   [A-Za-z]    |
    •------------•===============•===============•===============•===============•================•================•===============•
    |     NO     |  Upper/Lower  |  Upper/Lower  |  Upper/Lower  |  Upper/Lower  |     Upper      |     Lower      |  Upper/Lower  |
    •------------•---------------•---------------•---------------•---------------•----------------•----------------•---------------•
    |    YES     |     Upper     |     Lower     |  Upper/Lower  |  Upper/Lower  |     Upper      |     Lower      |  Upper/Lower  |
    •------------•---------------•---------------•---------------•---------------•----------------•----------------•---------------•
    

    From that table, it’s easy to see that :

    • The use of the (?-i) modifier implies a search of letters, sensitive to the case, whatever the Match case option is checked or NOT

    • The use of the (?i) modifier implies a search of letters, insensitive to the case, whatever the Match case option is checked or NOT, as well as the use of the regex [A-Za-z] !

    Best Regards,

    guy038



  • Hi guy038,

    thank you for testing this.
    When you say necessarily, search for one lowercase letter, only does this mean
    it is expected behaviour, even by regex definition? If so, why do I have to distinguish between A-Z and a-z?
    Don’t understand me wrong, I’m absolutely fine with it if I have to use the match case check box just
    want to understand if, from regex point of view, this is a misunderstand from my side.

    Cheers
    Claudia
    Btw. I saw I got a folder on your notepad tab list - yeah ;-)



  • Hi Claudia,

    Just a simple test because my reply to your previous post seems to be considered as a spam ( !? ) I get the message :

    Error
    Post content was flagged as spam by Akismet.com

    guy038



  • Hello Claudia and All,

    Because of some tiring days, this week, at work, and, also, because of a nice ski-day, at Courchevel, on Saturday ( quite tired too, as it was the first outing, this winter ! ) I have not posted anything yet, trying, each evening, to sort out these case problems, little by little, with regexes of the form [Char1-Char2] or [Char1-Char2]+ However, it comes that it’s even worse that I thought, at first sight :-((

    Globally speaking, we must distinguish TWO main cases :

    A) The Match case option, of the Search/Replace/Mark dialog, is checked

    Let’s consider the simple test string, with the 26 letters, in both cases, below :

    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
    

    Then, for instance, the search of the regex [F-q]+ matches the string FGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopq

    Things are quite simple : The regex [Char1-Char2] matches any character, whose code-point is >= to Char1’s code-point AND <= to Char2’s code-point !

    Note two points :

    • This sensitive way of search is the default case option, in most regex engines

    • The code-point of Char2 MUST be >= to the code-point of Char1. Otherwise, while clicking on the Find Next button, you get, logically, the error message Find:Invalid regular expression.

    Strangely, with N++, while searching the wrong regex [F-D], if you click on the Count button, on the Find All of the Mark tab, or some other buttons, you just get the message Count: O matches or Mark: 0 matches, instead of the error message !?


    B) The Match case option, of the Search/Replace/Mark dialog, is UNCHECKED

    Things become more complicated ! Let’s consider our previous regex [F-q]+ and the same test string.

    Logically, as the search is considered, in an insensitive way, any letter of the previous found range FGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopq is matched, either, in lowercase or uppercase. But, as this string contains, exactly, the 26 letters of the alphabet, all the test string should be matched. Indeed, that is the correct behaviour, that I verified, on the site https://regex101.com , with the i modifier ( insensitive )

    Unfortunately, with Notepad++, the regex [F-q]+ matches, successively the two strings FGHIJKLMNOPQ, then fghijklmnopq, only :-( The regex engine wrongly considers, only, letters that are both, in lower AND upper case, from the range [F…q] ! I would consider this behaviour as a bug !

    Moreover, the two regex, [F-q]+ and [F-Q]+, give, wrongly, the same result, with the N++ regex engine, contrary to what the Regex101 site give !


    I also, found, some other weird cases, when one limit, of the range, is not a letter ! Let’s consider the standard ASCII list string ( from character \x21 to character \x7e ) , below :

    !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~
    

    and suppose that you use the N++ Mark feature, with the two options Purge for each search and Wrap around checked

    Then, for instance, the two regexes [5-l] and [K-~] match, respectively, ANY character in :

    56789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijkl
    
    KLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~
    

    when the Match case option is CHECHED. Luckily, it’s the expected behaviour !!

    And these same regexes match, respectively, ANY character in :

    56789:;<=>?@ABCDEFGHIJKL[]^_`abcdefghijkl
    
    KLMNOPQRSTUVWXYZklmnopqrstuvwxyz{|}~
    

    when the Match case option is UNCHECKED. Why, among other things, the six-characters block, below :

    []^_`
    

    is taken in account, in the first regex and NOT with the second one !!!???


    So, I’m asking for people who could test this two regexes [5-l] and [K-~], in an INSENSITIVE way, against the test string above ( range from \x21 to \x7e ), with other regex tools than the Boost regex engine, in order to know the different one-characters that are really matched ?

    As for me, to be logic, the correct behaviour is :

    • The regex [5-l] should match ANY character in :

      56789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz

    • The regex [K-~] should match ANY character in :

      ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~


    Therefore, and to sum up, when using the Regular expression search mode, I advice you to :

    • Always click, once, on the Find Next button to ensure that your regex is a valid one

    • Preferably, apart from simple searches, such a single word, always check the Match case option, to get logical results from the regex engine


    Claudia, I tried to find out some infos, about the case insensibility feature. Look at the different links, ( without order ), below :

    http://www.rexegg.com/regex-modifiers.html#i

    http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC9

    http://userguide.icu-project.org/strings/regexp

    http://unicode.org/faq/casemap_charprop.html

    and also :

    http://www.regular-expressions.info/modifiers.html

    http://perldoc.perl.org/perlre.html#Modifiers

    http://www.tutorialspoint.com/perl/perl_regular_expressions.htm

    http://stackoverflow.com/questions/3754097/what-is-the-best-way-to-match-only-letters-in-a-regex


    BTW, Claudia, the site http://www.rexegg.com seems a very, very insteresting site, with waluable examples to study. See, for instance, these topics, below :

    http://www.rexegg.com/regex-uses.html

    http://www.rexegg.com/regex-style.html

    http://www.rexegg.com/regex-best-trick.html

    Best Regards,

    guy038



  • PMJI, but I think throwing the Match case option in this problem is exploding its complexity; just see the length of ~24 jan post from “guy038” starting (after salute) with “Because of some tiring days, this week…” (sorry for this lengthy pointing a post but absolute dates and every civilized ways have been removed).

    So I suggest to just slightly change the Ctrl+F “Find” box so that the “Match case” line is treated just like the “Match whole word only”, i.e. gets GRAYED when “Regular expression” is selected. This would make the problem much more reliable and powerful since making it more general and more compliant with what its labels are saying.

    Now I know by years of experience that the Notepad++ developers hate everything new or different and that this NIH syndrome of theirs will most probably get this suggestion thrown to trash even before being read… yet I submit it anyway.

    Versailles, Fri 29 Jan 2016 21:10:00 +0100


Log in to reply