Generic regex: Selecting a complete hierarchy between an opening HTML/XML tag and its corresponding closing tag
guy038 last edited by guy038
In order to select an entire block of text, between an opening tag and its CORRESPONDING closing tag, we’ll have to use recursive regex syntax. Indeed, the inner range of chars, between the tags, may contain, itself, one or several same opening and closing tags !
For instance, given this text :
<div> bla bla <div> blah bla bla </div> bla bla <div> blah bla bla </div> blah blah </div>
The simple regex
^\h*<div>(?s:.*?)</div>does not work as it matches an opening tag
<div>without its corresponding closing tag !
Of course, we could use this syntax
^\h*<div>(?s:((?!</?div>).)*?)</div>. However, this solution is not right too, as it would only select the inner blocks, ignoring the outer block
So, the only valid solution is to use recursive regex syntax ! The following generic regex is a bit impressive and may scare you, at first sight !
The road map is simple :
Open the Find dialog (
Ctrl + F)
Copy / paste the generic regex, above, in the
Replace all uppercase TAG names with a correct
html/xmltag. For instance,
Match caseoption ( IMPORTANT as a leading modifier
(?-i)may fail this generic regex, in some cases )
Click on the
=> From the current cursor position, the regex should find the next complete sequence of paired tags. Voila !
You may even try with the
atag. Although the generic regex remains correct, the recursive syntax is useless, in this case ! Indeed, the
<a href=......<a href=.......</a>....</a>syntax, for instance, is quite illegal and will never occur !
What this regex is searching for ?
First, it searches for an opening TAG, with its exact case ( part
Then, in a multi-lines non-capturing group, it looks, either, for :
Any character, not beginning an opening or closing TAG sequence, with its exact case
A recursive call to the entire regex
moretimes, in a non-capturing group ( so, the part
And, finally, it searches for the closing TAG, with its exact case ( part
Note that the
(?R)syntax of the recursive call to the entire regex has the same meaning as the
$0syntax, in a replacement part !
Robin Cruise last edited by
@guy038 super answer, thanks