How to remove all HTML tags except <p> or <h1> <h2> tags?
I have several articles in txt files under a directory.
The articles’ html code is somehow messed up.
I wish to remove all html tags except <p> or <h1> <h2> tags
The following code is removing all HTML tags
How to add an exception?
Keep any tags that have p, h1 or h2
Thank you in advance for your sharing of RegEx knowledge!
PeterJones last edited by
I would recommend a negative lookahead assertion: FIND =
<(?!h1|h2|p)[^>]+>: that says, “look for <, lookahead and make sure it isn’t h1 or h2 or p, consume one or more non-> characters until the first > found”
Thank you for the reply.
This code now replace all html codes except h1,h2,or p tag
But I notice that it also replace the ending </h1>, </h2>, and </p>
I tried to use these below try to keep the above tags, it failed.
Would you advise how to keep the trialing tags?
I found this code will do the job