Community
    • Login

    Regex: Delete all html tags inside 2 other tags, except <a href=.*?"> and </a>

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    4 Posts 2 Posters 1.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Vasile CarausV
      Vasile Caraus
      last edited by Vasile Caraus

      I want to delete all html tags inside 2 other tags, except <a href=".*?"> and </a>

      For example:

      <p class="mb-40px">Another blending </h2>option is to all the <div>brushstrokes to show. In the painting of trees above, I didn’t spend much time trying to <a href=https://orfun.com/acrylic class="color-bebe" target="_new">blend the colors</a>. I simply mix each color and apply it without fussing with it.</p>
      

      In the case below, must delete <div> and </h2> , but keep <a href and </a>

      Output:

      <p class="mb-40px">Another blending option is to all the brushstrokes to show. In the painting of trees above, I didn’t spend much time trying to <a href=https://orfun.com/acrylic class="color-bebe" target="_new">blend the colors</a>. I simply mix each color and apply it without fussing with it.</p>
      

      My regex is not to good:

      (?s-i)^.+<p class="mb-40px">\R|</p>.+|(?-s)(<a href.*>)?(?|(.+)(</a>)|(.+))$

      Replace by:

      ?2(?1:<p class="mb-40px">)$0(?3:</p>):$1

      Mark OlsonM 1 Reply Last reply Reply Quote 0
      • Mark OlsonM
        Mark Olson @Vasile Caraus
        last edited by Mark Olson

        Tough challenge! But I believe I have a regex that will meet your need.

        FIND: (?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K<(?!(?:/[ap]>|a\x20[^>]+>))[^>]*>
        REPLACE WITH: <empty>

        I converted

        <p class="mb-40px">Delete <h2>ALL </h2>of the <div>html</div>
        <abc foo="bar">tags inside </abc> of a p element
            <abstract>even this one here</abstract> 
            <a href=https://orfun.com/acrylic class="color-bebe" target="_new">UNLESS THE TAG IS AN a tag</a>.
            <A HREF="blah">uppercase A tags don't count</A>
        Text should be left as is</p>
        <div>
        This is not a p tag, so <all>the tags</all>
        in these <here>tags</here> should be left <a href="orneorne">untouched.</a>
        <p>but <a href="reorn">not</a> <this>tag!</this></p>
        </div>
        

        into this:

        <p class="mb-40px">Delete ALL of the html
        tags inside  of a p element
            even this one here 
            <a href=https://orfun.com/acrylic class="color-bebe" target="_new">UNLESS THE TAG IS AN a tag</a>.
            uppercase A tags don't count
        Text should be left as is</p>
        <div>
        This is not a p tag, so <all>the tags</all>
        in these <here>tags</here> should be left <a href="orneorne">untouched.</a>
        <p>but <a href="reorn">not</a> tag!</p>
        </div>
        
        Vasile CarausV 1 Reply Last reply Reply Quote 0
        • Vasile CarausV
          Vasile Caraus @Mark Olson
          last edited by Vasile Caraus

          @Mark-Olson said in Regex: Delete all html tags inside 2 other tags, except <a href=.*?"> and </a>:

          (?s-i)(?:<p[^>]>|(?!\A)\G)(?:(?!</p>).)?\K<(?!(?:/[ap]>|a\x20[^>]+>))[^>]*>

          thanks a lot. But how did you manage to find this solution?

          And how did you convert the text ?

          Where to find this [ap] ? I never see it !

          Mark OlsonM 1 Reply Last reply Reply Quote 0
          • Mark OlsonM
            Mark Olson @Vasile Caraus
            last edited by Mark Olson

            @Vasile-Caraus said in Regex: Delete all html tags inside 2 other tags, except <a href=.*?"> and </a>:

            And how did you convert the text ?

            I just used the find/replace form, with regular expressions on.

            thanks a lot. But how did you manage to find this solution?

            Since you’ve taken an interest, I’ll give a pretty detailed explanation of my regex.

            By the way, I have a slight update that should work just as well, but is simpler:
            Replace (?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K<(?!(?:/[ap]>|a\x20))[^>]*> with nothing.

            1. It’s modeled off of guy038’s now-famous replacing in a specific region of text regex. I won’t explain all the parts of this regex that are indebted to that; you can just read his excellent explanation in the linked post.
            2. Specifically, the BSR is <p[^>]*>, which is an opening p tag, and the ESR is </p>, the closing p tag.
            3. So far this accounts for the first part of the regex, (?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K. But the tricky part is matching only tags other than <a> and the closing </p> tag.
            4. We know that any tag we want to remove contains <[^>]*>, that is, an opening <, some stuff, and a closing >.
            5. To distinguish the tags we want to remove, we’ll do a negative lookahead right after the opening <, so we get <(?!{%distinguishing text%})[^>]*>.
            6. Let’s start by observing that the tag cannot be a closing a or p tag. This is the /[ap]> branch of the negative lookahead, where [ap] simply means “a or p”.
            7. Next we need to rule out opening a tags. This is the a\x20 branch of the negative lookahead. By the way, \x20 is just another way to say space, as in the space you make with your space bar. Regex aficionados like to use \x20, because it can’t be mistaken for any other character.
            8. So we arrive at the final regex, (?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K<(?!(?:/[ap]>|a\x20))[^>]*>
            1 Reply Last reply Reply Quote 3
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors