Can't Figure what is wrong in selecting text within the parentheses.
-
Dear all,
Some help needed.
Text:
The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).Want to select the quotes about references:
The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).
Use the RegExp :
\(See .+\)
But the result is all the text between two set of parentheses are selected.
The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).
instead of
The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).Examine the syntax and cannot find out why.
Thanks in advance.
-
Your regex as specified is “greedy”. Use
.+?
instead of.+
-
Oh, yes it works.
so
.+?
means to make it lazy, now I understand what lazy and greedy means.But another issue now comes up.
Text:
In patients with recurrent cellulitis due to S. aureus, attempting decolonization is reasonable; this is discussed further separately. (See “Methicillin-resistant Staphylococcus aureus (MRSA) in adults: Prevention and control”, section on ‘Decolonization’ and “Methicillin-resistant Staphylococcus aureus in children: Prevention and control”.)Intended to select the everything in the “outermost parenthesis”, but when I use the lazy syntax, it selects less when there is another pair of parenthesis inside.
So when I use the code:
.+?
, the selection ends unexpectedlyIn patients with recurrent cellulitis due to S. aureus, attempting decolonization is reasonable; this is discussed further separately. (See "Methicillin-resistant Staphylococcus aureus (MRSA) in adults: Prevention and control", section on ‘Decolonization’ and “Methicillin-resistant Staphylococcus aureus in children: Prevention and control”.)
This would not happy if I use the Greedy one but in the text, but it will select too much in other situation.
-
Okay, so not so much my strong suit, but for that kind of processing you need something called a “recursive regular expression”. You can google that and do some reading, but here’s a link that deals with nested parenthesis processing with a regular expression: http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns
From that I derived the following regex that seems to do what you need, as long as all the parentheses are balanced:
(?=\(See)(\((?>[^()]+|(?1))*\))
That’s my shot at it; if you need something more complicated, you should now have the tools (after you read and learn) to get where you need to go yourself. :-D
-
Hello David,
Scott is right about it. You need a recursive regex pattern.
The more simple recursive regex, that I’ve found, is :
SEARCH
\(([^()\r\n]|(?0))*\)
This regex matches the longest range of text, in a same line, containing well-balanced parentheses, enclosed by a couple of final parentheses, also included
Just test it against the text, below :
(This)--()sen(tence)con(tains(a lot)of)paren(theses)and the ((regex))matches((the())longest)range((of))well((()))balanced(((((((parentheses), enclosed )inside two) final )parentheses
Notes :
-
The regex try to match, first, an opening round bracket
\(
-
The part
[^()\r\n]
matches any character, different from a parenthese and an EOL character -
The part
(?0)
is a reference to the whole regex\(([^()\r\n]|(?0))*\)
, that is to say, an second form(.....)
-
As this reference
(?0)
is located inside the group to which it refers ( i.e. the whole regex ), this regex turns, automatically, to be a recursive regex -
The two sub-regexes
[^()\r\n]
and(?0)
are the two parts of an alternative, which can be repeated, from 0 to n times*
-
Finally, the regex matches an ending round bracket
\)
Remark :
If your text and parentheses may be on several lines, prefer the recursive regex, below :
SEARCH
\(([^()]|(?0))*\)
Best regards
guy038
P.S. :
If you consider, for instance, the regex
((\d+)[a-z])([aeiouy])(?2)\3
:-
The first group contains the regex
(\d+)[a-z]
-
The second group contains the regex
\d+
( an integer ) -
The third group contains the regex
[aeiouy]
( a vowel ) -
The reference
(?2)
, located outside the regex to which it refers\d+
, is called, in that case, a subroutine call ( instead of a recursive subpattern ) and we could have replaced(?2)
by the pattern of group 2, i.e.\d+
-
Finally, the back-reference
\3
refers to the value of the regex[aeiouy]
This regex matches expressions like :
- 123ai4567i
- 78zu12345u
- 999ha999a
but would fail to match :
- 123ai4567e
- 78zu12345y
As I said above, the two regexes
((\d+)[a-z])([aeiouy])(?2)\3
and((\d+)[a-z])([aeiouy])\d+\3
are strictly identical !
Beware of the main difference between the regexes
(\d)(?1)
( =(\d)\d
) and(\d)\1
:-
The regex
(\d)(?1)
would match any two digits integer from 00 to 99 -
The regex
(\d)\1
would match any two digits integer, which contains two times a same digit
Test these two regexes against the following list :
10 11 13 27 34 40 44 63 66 98 99
-
-
Dear Scott and Guy
Thank you for your help and detailed explanation.
Scott one works for me as I need "See " in the beginning of the parenthesis.
I try to modify Guy’s one to work for me but not working it out.Thanks will continue to study it.