How to remove all HTML tags except <p> or <h1> <h2> tags?

NZ Select

I have several articles in txt files under a directory.

The articles’ html code is somehow messed up.

I wish to remove all html tags except <p> or <h1> <h2> tags

The following code is removing all HTML tags
<[^>]+>

How to add an exception?
Keep any tags that have p, h1 or h2

Thank you in advance for your sharing of RegEx knowledge!

PeterJones

@NZ-Select ,

I would recommend a negative lookahead assertion: FIND = <(?!h1|h2|p)[^>]+>: that says, “look for <, lookahead and make sure it isn’t h1 or h2 or p, consume one or more non-> characters until the first > found”

NZ Select

@PeterJones

Thank you for the reply.

This code now replace all html codes except h1,h2,or p tag
<(?!h1|h2|p)[^>]+>

But I notice that it also replace the ending </h1>, </h2>, and </p>
I tried to use these below try to keep the above tags, it failed.
<(?!h1|/h2|h2|/h2||p|/p)[^>]+>
or this
<(?!h1|\h2|h2|\h2||p|\p)[^>]+>

Would you advise how to keep the trialing tags?

NZ Select

I found this code will do the job
</?(?!a)(?!p)(?!ul)(?!li)(?!h)\w*\b[^>]*>