Can't Figure what is wrong in selecting text within the parentheses.

David Chiu

Dear all,

Some help needed.

Text:
The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).

Want to select the quotes about references:

The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).

Use the RegExp : \(See .+\)

But the result is all the text between two set of parentheses are selected.

The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).
instead of
The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).

Examine the syntax and cannot find out why.

Thanks in advance.

Scott Sumner

@David-Chiu

Your regex as specified is “greedy”. Use .+? instead of .+

David Chiu

Oh, yes it works.

so .+? means to make it lazy, now I understand what lazy and greedy means.

But another issue now comes up.
Text:
In patients with recurrent cellulitis due to S. aureus, attempting decolonization is reasonable; this is discussed further separately. (See “Methicillin-resistant Staphylococcus aureus (MRSA) in adults: Prevention and control”, section on ‘Decolonization’ and “Methicillin-resistant Staphylococcus aureus in children: Prevention and control”.)

Intended to select the everything in the “outermost parenthesis”, but when I use the lazy syntax, it selects less when there is another pair of parenthesis inside.

So when I use the code: .+? , the selection ends unexpectedly

In patients with recurrent cellulitis due to S. aureus, attempting decolonization is reasonable; this is discussed further separately. (See "Methicillin-resistant Staphylococcus aureus (MRSA) in adults: Prevention and control", section on ‘Decolonization’ and “Methicillin-resistant Staphylococcus aureus in children: Prevention and control”.)

This would not happy if I use the Greedy one but in the text, but it will select too much in other situation.

Scott Sumner

@David-Chiu

Okay, so not so much my strong suit, but for that kind of processing you need something called a “recursive regular expression”. You can google that and do some reading, but here’s a link that deals with nested parenthesis processing with a regular expression: http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns

From that I derived the following regex that seems to do what you need, as long as all the parentheses are balanced:
(?=\(See)(\((?>[^()]+|(?1))*\))

That’s my shot at it; if you need something more complicated, you should now have the tools (after you read and learn) to get where you need to go yourself. :-D

guy038

Hello David,

Scott is right about it. You need a recursive regex pattern.

The more simple recursive regex, that I’ve found, is :

SEARCH \(([^()\r\n]|(?0))*\)

This regex matches the longest range of text, in a same line, containing well-balanced parentheses, enclosed by a couple of final parentheses, also included

Just test it against the text, below :

(This)--()sen(tence)con(tains(a lot)of)paren(theses)and the ((regex))matches((the())longest)range((of))well((()))balanced(((((((parentheses), enclosed )inside two) final )parentheses

Notes :

The regex try to match, first, an opening round bracket \(
The part [^()\r\n] matches any character, different from a parenthese and an EOL character
The part (?0) is a reference to the whole regex \(([^()\r\n]|(?0))*\), that is to say, an second form (.....)
As this reference (?0) is located inside the group to which it refers ( i.e. the whole regex ), this regex turns, automatically, to be a recursive regex
The two sub-regexes [^()\r\n] and (?0) are the two parts of an alternative, which can be repeated, from 0 to n times *
Finally, the regex matches an ending round bracket \)

Remark :

If your text and parentheses may be on several lines, prefer the recursive regex, below :

SEARCH \(([^()]|(?0))*\)

Best regards

guy038

P.S. :

If you consider, for instance, the regex ((\d+)[a-z])([aeiouy])(?2)\3 :

The first group contains the regex (\d+)[a-z]
The second group contains the regex \d+ ( an integer )
The third group contains the regex [aeiouy] ( a vowel )
The reference (?2), located outside the regex to which it refers \d+, is called, in that case, a subroutine call ( instead of a recursive subpattern ) and we could have replaced (?2) by the pattern of group 2, i.e. \d+
Finally, the back-reference \3 refers to the value of the regex [aeiouy]

This regex matches expressions like :

123ai4567i
78zu12345u
999ha999a

but would fail to match :

123ai4567e
78zu12345y

As I said above, the two regexes ((\d+)[a-z])([aeiouy])(?2)\3 and ((\d+)[a-z])([aeiouy])\d+\3 are strictly identical !

Beware of the main difference between the regexes (\d)(?1) ( = (\d)\d ) and (\d)\1 :

The regex (\d)(?1) would match any two digits integer from 00 to 99
The regex (\d)\1 would match any two digits integer, which contains two times a same digit

Test these two regexes against the following list :

David Chiu

Dear Scott and Guy

Thank you for your help and detailed explanation.
Scott one works for me as I need "See " in the beginning of the parenthesis.
I try to modify Guy’s one to work for me but not working it out.

Thanks will continue to study it.