How to remove all HTML tags except <p> or <h1> <h2> tags?

  • I have several articles in txt files under a directory.

    The articles’ html code is somehow messed up.

    I wish to remove all html tags except <p> or <h1> <h2> tags

    The following code is removing all HTML tags

    How to add an exception?
    Keep any tags that have p, h1 or h2

    Thank you in advance for your sharing of RegEx knowledge!

  • @NZ-Select ,

    I would recommend a negative lookahead assertion: FIND = <(?!h1|h2|p)[^>]+>: that says, “look for <, lookahead and make sure it isn’t h1 or h2 or p, consume one or more non-> characters until the first > found”

  • @PeterJones

    Thank you for the reply.

    This code now replace all html codes except h1,h2,or p tag

    But I notice that it also replace the ending </h1>, </h2>, and </p>
    I tried to use these below try to keep the above tags, it failed.
    or this

    Would you advise how to keep the trialing tags?

  • I found this code will do the job

