Unexpected regex behaviour

Reply to Unexpected regex behaviour on Fri, 29 Jan 2016 20:10:00 GMT

Michel Merlin — Fri, 29 Jan 2016 20:10:00 GMT

PMJI, but I think throwing the Match case option in this problem is exploding its complexity; just see the length of ~24 jan post from “guy038” starting (after salute) with “Because of some tiring days, this week…” (sorry for this lengthy pointing a post but absolute dates and every civilized ways have been removed).

So I suggest to just slightly change the Ctrl+F “Find” box so that the “Match case” line is treated just like the “Match whole word only”, i.e. gets GRAYED when “Regular expression” is selected. This would make the problem much more reliable and powerful since making it more general and more compliant with what its labels are saying.

Now I know by years of experience that the Notepad++ developers hate everything new or different and that this NIH syndrome of theirs will most probably get this suggestion thrown to trash even before being read… yet I submit it anyway.

Versailles, Fri 29 Jan 2016 21:10:00 +0100

Reply to Unexpected regex behaviour on Sun, 24 Jan 2016 20:53:11 GMT

guy038 — Sun, 24 Jan 2016 20:53:11 GMT

Hello Claudia and All,

Because of some tiring days, this week, at work, and, also, because of a nice ski-day, at Courchevel, on Saturday ( quite tired too, as it was the first outing, this winter ! ) I have not posted anything yet, trying, each evening, to sort out these case problems, little by little, with regexes of the form [Char1-Char2] or [Char1-Char2]+ However, it comes that it’s even worse that I thought, at first sight :-((

Globally speaking, we must distinguish TWO main cases :

A) The Match case option, of the Search/Replace/Mark dialog, is checked

Let’s consider the simple test string, with the 26 letters, in both cases, below :

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Then, for instance, the search of the regex [F-q]+ matches the string FGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopq

Things are quite simple : The regex [Char1-Char2] matches any character, whose code-point is >= to Char1’s code-point AND <= to Char2’s code-point !

Note two points :

This sensitive way of search is the default case option, in most regex engines
The code-point of Char2 MUST be >= to the code-point of Char1. Otherwise, while clicking on the Find Next button, you get, logically, the error message Find:Invalid regular expression.

Strangely, with N++, while searching the wrong regex [F-D], if you click on the Count button, on the Find All of the Mark tab, or some other buttons, you just get the message Count: O matches or Mark: 0 matches, instead of the error message !?

B) The Match case option, of the Search/Replace/Mark dialog, is UNCHECKED

Things become more complicated ! Let’s consider our previous regex [F-q]+ and the same test string.

Logically, as the search is considered, in an insensitive way, any letter of the previous found range FGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopq is matched, either, in lowercase or uppercase. But, as this string contains, exactly, the 26 letters of the alphabet, all the test string should be matched. Indeed, that is the correct behaviour, that I verified, on the site https://regex101.com , with the i modifier ( insensitive )

Unfortunately, with Notepad++, the regex [F-q]+ matches, successively the two strings FGHIJKLMNOPQ, then fghijklmnopq, only :-( The regex engine wrongly considers, only, letters that are both, in lower AND upper case, from the range [F…q] ! I would consider this behaviour as a bug !

Moreover, the two regex, [F-q]+ and [F-Q]+, give, wrongly, the same result, with the N++ regex engine, contrary to what the Regex101 site give !

I also, found, some other weird cases, when one limit, of the range, is not a letter ! Let’s consider the standard ASCII list string ( from character \x21 to character \x7e ) , below :

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

and suppose that you use the N++ Mark feature, with the two options Purge for each search and Wrap around checked

Then, for instance, the two regexes [5-l] and [K-~] match, respectively, ANY character in :

56789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijkl

KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

when the Match case option is CHECHED. Luckily, it’s the expected behaviour !!

And these same regexes match, respectively, ANY character in :

56789:;<=>?@ABCDEFGHIJKL[\]^_`abcdefghijkl

KLMNOPQRSTUVWXYZklmnopqrstuvwxyz{|}~

when the Match case option is UNCHECKED. Why, among other things, the six-characters block, below :

[\]^_`

is taken in account, in the first regex and NOT with the second one !!!???

So, I’m asking for people who could test this two regexes [5-l] and [K-~], in an INSENSITIVE way, against the test string above ( range from \x21 to \x7e ), with other regex tools than the Boost regex engine, in order to know the different one-characters that are really matched ?

As for me, to be logic, the correct behaviour is :

The regex [5-l] should match ANY character in :

56789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz
The regex [K-~] should match ANY character in :

ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~

Therefore, and to sum up, when using the Regular expression search mode, I advice you to :

Always click, once, on the Find Next button to ensure that your regex is a valid one
Preferably, apart from simple searches, such a single word, always check the Match case option, to get logical results from the regex engine

Claudia, I tried to find out some infos, about the case insensibility feature. Look at the different links, ( without order ), below :

http://www.rexegg.com/regex-modifiers.html#i

http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC9

http://userguide.icu-project.org/strings/regexp

http://unicode.org/faq/casemap_charprop.html

and also :

http://www.regular-expressions.info/modifiers.html

http://perldoc.perl.org/perlre.html#Modifiers

http://www.tutorialspoint.com/perl/perl_regular_expressions.htm

http://stackoverflow.com/questions/3754097/what-is-the-best-way-to-match-only-letters-in-a-regex

BTW, Claudia, the site http://www.rexegg.com seems a very, very insteresting site, with waluable examples to study. See, for instance, these topics, below :

http://www.rexegg.com/regex-uses.html

http://www.rexegg.com/regex-style.html

http://www.rexegg.com/regex-best-trick.html

Best Regards,

guy038

Reply to Unexpected regex behaviour on Sun, 24 Jan 2016 20:33:02 GMT

guy038 — Sun, 24 Jan 2016 20:33:02 GMT

Hi Claudia,

Just a simple test because my reply to your previous post seems to be considered as a spam ( !? ) I get the message :

Error
Post content was flagged as spam by Akismet.com

guy038

Reply to Unexpected regex behaviour on Mon, 18 Jan 2016 13:30:15 GMT

Claudia Frank — Mon, 18 Jan 2016 13:30:15 GMT

Hi guy038,

thank you for testing this.
When you say necessarily, search for one lowercase letter, only does this mean
it is expected behaviour, even by regex definition? If so, why do I have to distinguish between A-Z and a-z?
Don’t understand me wrong, I’m absolutely fine with it if I have to use the match case check box just
want to understand if, from regex point of view, this is a misunderstand from my side.

Cheers
Claudia
Btw. I saw I got a folder on your notepad tab list - yeah ;-)

Reply to Unexpected regex behaviour on Mon, 18 Jan 2016 02:56:16 GMT

guy038 — Mon, 18 Jan 2016 02:56:16 GMT

Hi, Claudia,

Oh ! Interesting problem, indeed !

The regex syntax [a-z] doesn’t mean that you, necessarily, search for one lowercase letter, only ! In the same way, the syntax [A-Z] don’t try to match an uppercase letter, exclusively ! Indeed, all that depends on the current state of the Match case option :

If the Match case is checked, the regex engine DOES care about the case of the letter matched ( Upper or Lower ). Then :
- The regex [A-Z] search an uppercase letter, exclusively, between A and Z, included
- The regex [a-z] search an lowercase letter, exclusively, between a and z, included
If the Match case is NOT checked, the regex engine does NOT care about the case of the letter matched. So :
- The regex [A-Z] search an uppercase OR a lowercase letter ( equivalent to the regex [A-Za-z] )
- The regex [a-z] search a lowercase OR an uppercase letter ( equivalent to the regex [A-Za-z] )

Moreover, if you’re using the (?i) OR (?-i) modifier, before the square bracket range, you force the regex engine to behave, in an insensitive / sensitive way, independently of the current state of the Match case option !

I sum up all the different cases, in a table, below :

•------------•-----------------------------------------------------------------------------------------------------------------•  
|   Option   |                               REGEX Syntax for matching 1 NON ACCENTUATED letter                                |
|            •---------------•---------------•---------------•---------------•----------------•----------------•---------------•
| Match case |     [A-Z]     |     [a-z]     |   (?i)[A-Z]   |   (?i)[a-z]   |   (?-i)[A-Z]   |   (?-i)[a-z]   |   [A-Za-z]    |
•------------•===============•===============•===============•===============•================•================•===============•
|     NO     |  Upper/Lower  |  Upper/Lower  |  Upper/Lower  |  Upper/Lower  |     Upper      |     Lower      |  Upper/Lower  |
•------------•---------------•---------------•---------------•---------------•----------------•----------------•---------------•
|    YES     |     Upper     |     Lower     |  Upper/Lower  |  Upper/Lower  |     Upper      |     Lower      |  Upper/Lower  |
•------------•---------------•---------------•---------------•---------------•----------------•----------------•---------------•

From that table, it’s easy to see that :

The use of the (?-i) modifier implies a search of letters, sensitive to the case, whatever the Match case option is checked or NOT
The use of the (?i) modifier implies a search of letters, insensitive to the case, whatever the Match case option is checked or NOT, as well as the use of the regex [A-Za-z] !

Best Regards,

guy038