Generic regex: Selecting a complete hierarchy between an opening HTML/XML tag and its corresponding closing tag

guy038

In order to select an entire block of text, between an opening tag and its CORRESPONDING closing tag, we’ll have to use recursive regex syntax. Indeed, the inner range of chars, between the tags, may contain, itself, one or several same opening and closing tags !

For instance, given this text :

<div>
    bla bla
    <div>
	    blah bla bla
    </div>
	bla bla
    <div>
	    blah bla bla
    </div>
    blah blah
</div>

The simple regex ^\h*<div>(?s:.*?)</div> does not work as it matches an opening tag <div> without its corresponding closing tag !

Of course, we could use this syntax ^\h*<div>(?s:((?!</?div>).)*?)</div>. However, this solution is not right too, as it would only select the inner blocks, ignoring the outer block

So, the only valid solution is to use recursive regex syntax ! The following generic regex is a bit impressive and may scare you, at first sight !

<TAG(?:>|\x20)(?s:(?!<TAG>)(?!<TAG\x20)(?!</TAG>).|(?R))*</TAG>

The road map is simple :

Open the Find dialog ( Ctrl + F )
Copy / paste the generic regex, above, in the Find what: zone
Replace all uppercase TAG names with a correct html/xml tag. For instance, tr or div
Tick the Match case option ( IMPORTANT as a leading modifier (?-i) may fail this generic regex, in some cases )
Click on the Find Next button

=> From the current cursor position, the regex should find the next complete sequence of paired tags. Voila !

Remark :

You may even try with the a tag. Although the generic regex remains correct, the recursive syntax is useless, in this case ! Indeed, the <a href=......<a href=.......</a>....</a> syntax, for instance, is quite illegal and will never occur !

What this regex is searching for ?

First, it searches for an opening TAG, with its exact case ( part <TAG(?:>|\x20) )
Then, in a multi-lines non-capturing group, it looks, either, for :
- Any character, not beginning an opening or closing TAG sequence, with its exact case
- A recursive call to the entire regex (?R)
Repeated 0 to more times, in a non-capturing group ( so, the part (?s:(?!<TAG>)(?!<TAG\x20)(?!</TAG>).|(?R))* )
And, finally, it searches for the closing TAG, with its exact case ( part </TAG> )
Note that the (?R) syntax of the recursive call to the entire regex has the same meaning as the $0 syntax, in a replacement part !

Robin Cruise

@guy038 super answer, thanks