Can't Figure what is wrong in selecting text within the parentheses.



  • Dear all,

    Some help needed.

    Text:
    The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).

    Want to select the quotes about references:

    The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).

    Use the RegExp : \(See .+\)

    But the result is all the text between two set of parentheses are selected.

    The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).
    instead of
    The odds ratio is 1.4 (See Reference 12) and NTT is 5 (See Reference 13).

    Examine the syntax and cannot find out why.

    Thanks in advance.



  • @David-Chiu

    Your regex as specified is “greedy”. Use .+? instead of .+



  • Oh, yes it works.

    so .+? means to make it lazy, now I understand what lazy and greedy means.

    But another issue now comes up.
    Text:
    In patients with recurrent cellulitis due to S. aureus, attempting decolonization is reasonable; this is discussed further separately. (See “Methicillin-resistant Staphylococcus aureus (MRSA) in adults: Prevention and control”, section on ‘Decolonization’ and “Methicillin-resistant Staphylococcus aureus in children: Prevention and control”.)

    Intended to select the everything in the “outermost parenthesis”, but when I use the lazy syntax, it selects less when there is another pair of parenthesis inside.

    So when I use the code: .+? , the selection ends unexpectedly

    In patients with recurrent cellulitis due to S. aureus, attempting decolonization is reasonable; this is discussed further separately. (See "Methicillin-resistant Staphylococcus aureus (MRSA) in adults: Prevention and control", section on ‘Decolonization’ and “Methicillin-resistant Staphylococcus aureus in children: Prevention and control”.)

    This would not happy if I use the Greedy one but in the text, but it will select too much in other situation.



  • @David-Chiu

    Okay, so not so much my strong suit, but for that kind of processing you need something called a “recursive regular expression”. You can google that and do some reading, but here’s a link that deals with nested parenthesis processing with a regular expression: http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns

    From that I derived the following regex that seems to do what you need, as long as all the parentheses are balanced:
    (?=\(See)(\((?>[^()]+|(?1))*\))

    That’s my shot at it; if you need something more complicated, you should now have the tools (after you read and learn) to get where you need to go yourself.    :-D



  • Hello David,

    Scott is right about it. You need a recursive regex pattern.

    The more simple recursive regex, that I’ve found, is :

    SEARCH \(([^()\r\n]|(?0))*\)

    This regex matches the longest range of text, in a same line, containing well-balanced parentheses, enclosed by a couple of final parentheses, also included

    Just test it against the text, below :

    (This)--()sen(tence)con(tains(a lot)of)paren(theses)and the ((regex))matches((the())longest)range((of))well((()))balanced(((((((parentheses), enclosed )inside two) final )parentheses
    

    Notes :

    • The regex try to match, first, an opening round bracket \(

    • The part [^()\r\n] matches any character, different from a parenthese and an EOL character

    • The part (?0) is a reference to the whole regex \(([^()\r\n]|(?0))*\), that is to say, an second form (.....)

    • As this reference (?0) is located inside the group to which it refers ( i.e. the whole regex ), this regex turns, automatically, to be a recursive regex

    • The two sub-regexes [^()\r\n] and (?0) are the two parts of an alternative, which can be repeated, from 0 to n times *

    • Finally, the regex matches an ending round bracket \)

    Remark :

    If your text and parentheses may be on several lines, prefer the recursive regex, below :

    SEARCH \(([^()]|(?0))*\)

    Best regards

    guy038

    P.S. :

    If you consider, for instance, the regex ((\d+)[a-z])([aeiouy])(?2)\3 :

    • The first group contains the regex (\d+)[a-z]

    • The second group contains the regex \d+ ( an integer )

    • The third group contains the regex [aeiouy] ( a vowel )

    • The reference (?2), located outside the regex to which it refers \d+, is called, in that case, a subroutine call ( instead of a recursive subpattern ) and we could have replaced (?2) by the pattern of group 2, i.e. \d+

    • Finally, the back-reference \3 refers to the value of the regex [aeiouy]

    This regex matches expressions like :

    • 123ai4567i
    • 78zu12345u
    • 999ha999a

    but would fail to match :

    • 123ai4567e
    • 78zu12345y

    As I said above, the two regexes ((\d+)[a-z])([aeiouy])(?2)\3 and ((\d+)[a-z])([aeiouy])\d+\3 are strictly identical !


    Beware of the main difference between the regexes (\d)(?1) ( = (\d)\d ) and (\d)\1 :

    • The regex (\d)(?1) would match any two digits integer from 00 to 99

    • The regex (\d)\1 would match any two digits integer, which contains two times a same digit

    Test these two regexes against the following list :

    10
    11
    13
    27
    34
    40
    44
    63
    66
    98
    99


  • Dear Scott and Guy

    Thank you for your help and detailed explanation.
    Scott one works for me as I need "See " in the beginning of the parenthesis.
    I try to modify Guy’s one to work for me but not working it out.

    Thanks will continue to study it.


Log in to reply