Regex: Delete all html tags inside 2 other tags, except <a href=.*?"> and </a>
-
I want to delete all html tags inside 2 other tags, except
<a href=".*?">
and</a>
For example:
<p class="mb-40px">Another blending </h2>option is to all the <div>brushstrokes to show. In the painting of trees above, I didn’t spend much time trying to <a href=https://orfun.com/acrylic class="color-bebe" target="_new">blend the colors</a>. I simply mix each color and apply it without fussing with it.</p>
In the case below, must delete <div> and </h2> , but keep <a href and </a>
Output:
<p class="mb-40px">Another blending option is to all the brushstrokes to show. In the painting of trees above, I didn’t spend much time trying to <a href=https://orfun.com/acrylic class="color-bebe" target="_new">blend the colors</a>. I simply mix each color and apply it without fussing with it.</p>
My regex is not to good:
(?s-i)^.+<p class="mb-40px">\R|</p>.+|(?-s)(<a href.*>)?(?|(.+)(</a>)|(.+))$
Replace by:
?2(?1:<p class="mb-40px">)$0(?3:</p>):$1
-
Tough challenge! But I believe I have a regex that will meet your need.
FIND:
(?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K<(?!(?:/[ap]>|a\x20[^>]+>))[^>]*>
REPLACE WITH: <empty>I converted
<p class="mb-40px">Delete <h2>ALL </h2>of the <div>html</div> <abc foo="bar">tags inside </abc> of a p element <abstract>even this one here</abstract> <a href=https://orfun.com/acrylic class="color-bebe" target="_new">UNLESS THE TAG IS AN a tag</a>. <A HREF="blah">uppercase A tags don't count</A> Text should be left as is</p> <div> This is not a p tag, so <all>the tags</all> in these <here>tags</here> should be left <a href="orneorne">untouched.</a> <p>but <a href="reorn">not</a> <this>tag!</this></p> </div>
into this:
<p class="mb-40px">Delete ALL of the html tags inside of a p element even this one here <a href=https://orfun.com/acrylic class="color-bebe" target="_new">UNLESS THE TAG IS AN a tag</a>. uppercase A tags don't count Text should be left as is</p> <div> This is not a p tag, so <all>the tags</all> in these <here>tags</here> should be left <a href="orneorne">untouched.</a> <p>but <a href="reorn">not</a> tag!</p> </div>
-
@Mark-Olson said in Regex: Delete all html tags inside 2 other tags, except <a href=.*?"> and </a>:
(?s-i)(?:<p[^>]>|(?!\A)\G)(?:(?!</p>).)?\K<(?!(?:/[ap]>|a\x20[^>]+>))[^>]*>
thanks a lot. But how did you manage to find this solution?
And how did you convert the text ?
Where to find this
[ap]
? I never see it ! -
@Vasile-Caraus said in Regex: Delete all html tags inside 2 other tags, except <a href=.*?"> and </a>:
And how did you convert the text ?
I just used the find/replace form, with regular expressions on.
thanks a lot. But how did you manage to find this solution?
Since you’ve taken an interest, I’ll give a pretty detailed explanation of my regex.
By the way, I have a slight update that should work just as well, but is simpler:
Replace(?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K<(?!(?:/[ap]>|a\x20))[^>]*>
with nothing.- It’s modeled off of guy038’s now-famous replacing in a specific region of text regex. I won’t explain all the parts of this regex that are indebted to that; you can just read his excellent explanation in the linked post.
- Specifically, the
BSR
is<p[^>]*>
, which is an opening p tag, and theESR
is</p>
, the closing p tag. - So far this accounts for the first part of the regex,
(?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K
. But the tricky part is matching only tags other than<a>
and the closing</p>
tag. - We know that any tag we want to remove contains
<[^>]*>
, that is, an opening<
, some stuff, and a closing>
. - To distinguish the tags we want to remove, we’ll do a negative lookahead right after the opening
<
, so we get<(?!{%distinguishing text%})[^>]*>
. - Let’s start by observing that the tag cannot be a closing a or p tag. This is the
/[ap]>
branch of the negative lookahead, where[ap]
simply means “a or p”. - Next we need to rule out opening a tags. This is the
a\x20
branch of the negative lookahead. By the way,\x20
is just another way to sayspace
, as in the space you make with your space bar. Regex aficionados like to use\x20
, because it can’t be mistaken for any other character. - So we arrive at the final regex,
(?s-i)(?:<p[^>]*>|(?!\A)\G)(?:(?!</p>).)*?\K<(?!(?:/[ap]>|a\x20))[^>]*>