How to remove all HTML tags except <p> or <h1> <h2> tags?
-
I have several articles in txt files under a directory.
The articles’ html code is somehow messed up.
I wish to remove all html tags except <p> or <h1> <h2> tags
The following code is removing all HTML tags
<[^>]+>How to add an exception?
Keep any tags that have p, h1 or h2Thank you in advance for your sharing of RegEx knowledge!
-
I would recommend a negative lookahead assertion: FIND =
<(?!h1|h2|p)[^>]+>: that says, “look for <, lookahead and make sure it isn’t h1 or h2 or p, consume one or more non-> characters until the first > found” -
Thank you for the reply.
This code now replace all html codes except h1,h2,or p tag
<(?!h1|h2|p)[^>]+>But I notice that it also replace the ending </h1>, </h2>, and </p>
I tried to use these below try to keep the above tags, it failed.
<(?!h1|/h2|h2|/h2||p|/p)[^>]+>
or this
<(?!h1|\h2|h2|\h2||p|\p)[^>]+>Would you advise how to keep the trialing tags?
-
I found this code will do the job
</?(?!a)(?!p)(?!ul)(?!li)(?!h)\w*\b[^>]*>
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login