How to remove all HTML tags except <p> or <h1> <h2> tags?
-
I have several articles in txt files under a directory.
The articles’ html code is somehow messed up.
I wish to remove all html tags except <p> or <h1> <h2> tags
The following code is removing all HTML tags
<[^>]+>How to add an exception?
Keep any tags that have p, h1 or h2Thank you in advance for your sharing of RegEx knowledge!
-
I would recommend a negative lookahead assertion: FIND =
<(?!h1|h2|p)[^>]+>
: that says, “look for <, lookahead and make sure it isn’t h1 or h2 or p, consume one or more non-> characters until the first > found” -
Thank you for the reply.
This code now replace all html codes except h1,h2,or p tag
<(?!h1|h2|p)[^>]+>But I notice that it also replace the ending </h1>, </h2>, and </p>
I tried to use these below try to keep the above tags, it failed.
<(?!h1|/h2|h2|/h2||p|/p)[^>]+>
or this
<(?!h1|\h2|h2|\h2||p|\p)[^>]+>Would you advise how to keep the trialing tags?
-
I found this code will do the job
</?(?!a)(?!p)(?!ul)(?!li)(?!h)\w*\b[^>]*>