Community
    • Login

    Generic regex: Selecting a complete hierarchy between an opening HTML/XML tag and its corresponding closing tag

    Scheduled Pinned Locked Moved Blogs
    2 Posts 2 Posters 2.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      In order to select an entire block of text, between an opening tag and its CORRESPONDING closing tag, we’ll have to use recursive regex syntax. Indeed, the inner range of chars, between the tags, may contain, itself, one or several same opening and closing tags !

      For instance, given this text :

      <div>
          bla bla
          <div>
      	    blah bla bla
          </div>
      	bla bla
          <div>
      	    blah bla bla
          </div>
          blah blah
      </div>
      

      The simple regex ^\h*<div>(?s:.*?)</div> does not work as it matches an opening tag <div> without its corresponding closing tag !

      Of course, we could use this syntax ^\h*<div>(?s:((?!</?div>).)*?)</div>. However, this solution is not right too, as it would only select the inner blocks, ignoring the outer block

      So, the only valid solution is to use recursive regex syntax ! The following generic regex is a bit impressive and may scare you, at first sight !

      <TAG(?:>|\x20)(?s:(?!<TAG>)(?!<TAG\x20)(?!</TAG>).|(?R))*</TAG>

      The road map is simple :

      • Open the Find dialog ( Ctrl + F )

      • Copy / paste the generic regex, above, in the Find what: zone

      • Replace all uppercase TAG names with a correct html/xml tag. For instance, tr or div

      • Tick the Match case option ( IMPORTANT as a leading modifier (?-i) may fail this generic regex, in some cases )

      • Click on the Find Next button

      => From the current cursor position, the regex should find the next complete sequence of paired tags. Voila !

      Remark :

      You may even try with the a tag. Although the generic regex remains correct, the recursive syntax is useless, in this case ! Indeed, the <a href=......<a href=.......</a>....</a> syntax, for instance, is quite illegal and will never occur !


      What this regex is searching for ?

      • First, it searches for an opening TAG, with its exact case ( part <TAG(?:>|\x20) )

      • Then, in a multi-lines non-capturing group, it looks, either, for :

        • Any character, not beginning an opening or closing TAG sequence, with its exact case

        • A recursive call to the entire regex (?R)

      • Repeated 0 to more times, in a non-capturing group ( so, the part (?s:(?!<TAG>)(?!<TAG\x20)(?!</TAG>).|(?R))* )

      • And, finally, it searches for the closing TAG, with its exact case ( part </TAG> )

      • Note that the (?R) syntax of the recursive call to the entire regex has the same meaning as the $0 syntax, in a replacement part !

      Robin CruiseR 1 Reply Last reply Reply Quote 6
      • PeterJonesP PeterJones referenced this topic on
      • PeterJonesP PeterJones referenced this topic on
      • Alan KilbornA Alan Kilborn referenced this topic on
      • Robin CruiseR
        Robin Cruise @guy038
        last edited by

        @guy038 super answer, thanks

        1 Reply Last reply Reply Quote 0
        • PeterJonesP PeterJones referenced this topic on
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors