How to remove all HTML tags except <p> or <h1> <h2> tags?



  • I have several articles in txt files under a directory.

    The articles’ html code is somehow messed up.

    I wish to remove all html tags except <p> or <h1> <h2> tags

    The following code is removing all HTML tags
    <[^>]+>

    How to add an exception?
    Keep any tags that have p, h1 or h2

    Thank you in advance for your sharing of RegEx knowledge!



  • @NZ-Select ,

    I would recommend a negative lookahead assertion: FIND = <(?!h1|h2|p)[^>]+>: that says, “look for <, lookahead and make sure it isn’t h1 or h2 or p, consume one or more non-> characters until the first > found”



  • @PeterJones

    Thank you for the reply.

    This code now replace all html codes except h1,h2,or p tag
    <(?!h1|h2|p)[^>]+>

    But I notice that it also replace the ending </h1>, </h2>, and </p>
    I tried to use these below try to keep the above tags, it failed.
    <(?!h1|/h2|h2|/h2||p|/p)[^>]+>
    or this
    <(?!h1|\h2|h2|\h2||p|\p)[^>]+>

    Would you advise how to keep the trialing tags?



  • I found this code will do the job
    </?(?!a)(?!p)(?!ul)(?!li)(?!h)\w*\b[^>]*>


Log in to reply