• Login
Community
  • Login

Regex: Delete all html tags inside 2 other tags, except <a href=.*?"> and </a>

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
4 Posts 2 Posters 1.1k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • V
    Vasile Caraus
    last edited by Vasile Caraus Jun 23, 2023, 9:56 AM Jun 23, 2023, 9:56 AM

    I want to delete all html tags inside 2 other tags, except <a href=".*?"> and </a>

    For example:

    <p class="mb-40px">Another blending </h2>option is to all the <div>brushstrokes to show. In the painting of trees above, I didn’t spend much time trying to <a href=https://orfun.com/acrylic class="color-bebe" target="_new">blend the colors</a>. I simply mix each color and apply it without fussing with it.</p>
    

    In the case below, must delete <div> and </h2> , but keep <a href and </a>

    Output:

    <p class="mb-40px">Another blending option is to all the brushstrokes to show. In the painting of trees above, I didn’t spend much time trying to <a href=https://orfun.com/acrylic class="color-bebe" target="_new">blend the colors</a>. I simply mix each color and apply it without fussing with it.</p>
    

    My regex is not to good:

    (?s-i)^.+<p class="mb-40px">\R|</p>.+|(?-s)(<a href.*>)?(?|(.+)(</a>)|(.+))$

    Replace by:

    ?2(?1:<p class="mb-40px">)$0(?3:</p>):$1

    M 1 Reply Last reply Jun 24, 2023, 1:20 AM Reply Quote 0
    • M
      Mark Olson @Vasile Caraus
      last edited by Mark Olson Jun 24, 2023, 1:25 AM Jun 24, 2023, 1:20 AM

      Tough challenge! But I believe I have a regex that will meet your need.

      FIND: (?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K<(?!(?:/[ap]>|a\x20[^>]+>))[^>]*>
      REPLACE WITH: <empty>

      I converted

      <p class="mb-40px">Delete <h2>ALL </h2>of the <div>html</div>
      <abc foo="bar">tags inside </abc> of a p element
          <abstract>even this one here</abstract> 
          <a href=https://orfun.com/acrylic class="color-bebe" target="_new">UNLESS THE TAG IS AN a tag</a>.
          <A HREF="blah">uppercase A tags don't count</A>
      Text should be left as is</p>
      <div>
      This is not a p tag, so <all>the tags</all>
      in these <here>tags</here> should be left <a href="orneorne">untouched.</a>
      <p>but <a href="reorn">not</a> <this>tag!</this></p>
      </div>
      

      into this:

      <p class="mb-40px">Delete ALL of the html
      tags inside  of a p element
          even this one here 
          <a href=https://orfun.com/acrylic class="color-bebe" target="_new">UNLESS THE TAG IS AN a tag</a>.
          uppercase A tags don't count
      Text should be left as is</p>
      <div>
      This is not a p tag, so <all>the tags</all>
      in these <here>tags</here> should be left <a href="orneorne">untouched.</a>
      <p>but <a href="reorn">not</a> tag!</p>
      </div>
      
      V 1 Reply Last reply Jun 24, 2023, 2:03 AM Reply Quote 0
      • V
        Vasile Caraus @Mark Olson
        last edited by Vasile Caraus Jun 24, 2023, 2:17 AM Jun 24, 2023, 2:03 AM

        @Mark-Olson said in Regex: Delete all html tags inside 2 other tags, except <a href=.*?"> and </a>:

        (?s-i)(?:<p[^>]>|(?!\A)\G)(?:(?!</p>).)?\K<(?!(?:/[ap]>|a\x20[^>]+>))[^>]*>

        thanks a lot. But how did you manage to find this solution?

        And how did you convert the text ?

        Where to find this [ap] ? I never see it !

        M 1 Reply Last reply Jun 24, 2023, 2:32 AM Reply Quote 0
        • M
          Mark Olson @Vasile Caraus
          last edited by Mark Olson Jun 24, 2023, 2:35 AM Jun 24, 2023, 2:32 AM

          @Vasile-Caraus said in Regex: Delete all html tags inside 2 other tags, except <a href=.*?"> and </a>:

          And how did you convert the text ?

          I just used the find/replace form, with regular expressions on.

          thanks a lot. But how did you manage to find this solution?

          Since you’ve taken an interest, I’ll give a pretty detailed explanation of my regex.

          By the way, I have a slight update that should work just as well, but is simpler:
          Replace (?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K<(?!(?:/[ap]>|a\x20))[^>]*> with nothing.

          1. It’s modeled off of guy038’s now-famous replacing in a specific region of text regex. I won’t explain all the parts of this regex that are indebted to that; you can just read his excellent explanation in the linked post.
          2. Specifically, the BSR is <p[^>]*>, which is an opening p tag, and the ESR is </p>, the closing p tag.
          3. So far this accounts for the first part of the regex, (?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K. But the tricky part is matching only tags other than <a> and the closing </p> tag.
          4. We know that any tag we want to remove contains <[^>]*>, that is, an opening <, some stuff, and a closing >.
          5. To distinguish the tags we want to remove, we’ll do a negative lookahead right after the opening <, so we get <(?!{%distinguishing text%})[^>]*>.
          6. Let’s start by observing that the tag cannot be a closing a or p tag. This is the /[ap]> branch of the negative lookahead, where [ap] simply means “a or p”.
          7. Next we need to rule out opening a tags. This is the a\x20 branch of the negative lookahead. By the way, \x20 is just another way to say space, as in the space you make with your space bar. Regex aficionados like to use \x20, because it can’t be mistaken for any other character.
          8. So we arrive at the final regex, (?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K<(?!(?:/[ap]>|a\x20))[^>]*>
          1 Reply Last reply Reply Quote 3
          2 out of 4
          • First post
            2/4
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors