Need help replacing HTML tags! Any help welcome.



  • I need help with replacing and removing a series of HTML tags for 500+ excel cells.

    • There are HTML tags that I need to replace with new tag names and there are tags that I would like to remove altogether.
    • The text I need to format is case sensitive, and in some cases, the tags contain capitalization. I need to replace all capitalized tags to lower-case lettering.
    • Lastly, some tags have additional formatting included within the tag, but I would like to simplify them to their most basic tag naming convention. An asterisk is noted in some tags to indicate wildcard HTML tags where I would like to replace all variations of that tag that start the way I indicated.

    There are variations of the following tags but here are just a few examples:

    Find and Replace:
    Find <DIV style…> and replace with <DIV>
    Find <STRONG> and replace with <b>
    Find <SPAN lang*> and replace with <span>

    Remove Altogether:
    <FONT Style…> and </FONT>
    <tbody*> and </tbody>



  • Hello, @carolin-marschke,

    REading carefully your post, I noticed a small contradiction ! You said :

    I need to replace all capitalized tags to lower-case lettering.

    And, further on, the example :

    Find <DIV style…> and replace with <DIV>

    So, I supposed that replacement should be <div> !

    Anyway, for all your text manipulations, here are some appropriate regexes to run, after opening the Replace dialog and checking the Regular exprssion search mode :


    • For changes, as :

    Find <DIV style…> and replace with <DIV>
    Find <SPAN lang*> and replace with <span>

    Use the regex S/R : SEARCH = (?i-s)<(SPAN|DIV).*?> and REPLACE = <\L\1>

    Notes :

    • Search is non-insensitive ( i ) and the dot matches any standard character only (-s )

    • It looks for the tag SPAN or DIV, in any case, stored as group 1 ( <(SPAN|DIV) ), possibly followed by some characters, till the nearest > closing tag ( .*?> ), in the same line

    • In replacement, it just rewrites the group 1 tag ( \1 ), in lower case, due to the preceding syntax \L


    • For changes, as :

    Find <STRONG> and replace with <b>

    Use, for instance, the regex S/R : SEARCH = (?i)(<strong>)|(<em>) and REPLACE = (?1<b>)(?2<i>)

    Notes :

    • Simple change of a string by an another one !. The general syntax is :

    SEARCH = (?i)(Word1)|(Word2)|(Word3)........|(Wordn)

    REPLACE = (?1subst1)(?2subst2)(?3subst3)........(?nsubstn)


    For suppressions as :

    <FONT Style…> and </FONT>
    <tbody*> and </tbody>

    Use the regex S/R : SEARCH = (?is)<(FONT|tbody).+?</\1>\R? and REPLACE = EMPTY

    Notes :

    • Search is non-insensitive ( i ) and the dot matches any character ( standard or EOL ones ) (s )

    • It looks for tags FONT or tbody, in any case, stored as group 1 ( <(FONT|tbody) ), till the nearest closing IDENTICAL group 1 tag ( </\1> ), possibly followed by EOL characters(s) ( \R? )

    • As the replacement zone is empty, the bloc <Tag...>.......</Tag>, most of a time, multi-lines, is simply deleted


    Hoping that this first attempt will be useful to you !

    Best Regards,

    guy038



  • Hello @guy038 ,

    Thank you very much for your response. Also, you are correct in catching my mistake. Thanks for pointing that out. I will use your suggestions and get back to you shortly to let you know if those worked for me.

    Best,

    Carolin



  • @guy038 ,

    Hello! Thank you again for your help. I am now stumbling upon an issue with the following tag (any possible variations of it after the first few characters):

    <?xml *>

    There may be variations of strings that begin with this lettering. I tried to replace these by the following search function:

    (?i)<?xml[^>]*>

    But once replaced, the following remained in my script:
    <?

    How do you recommend I remove all instances of " <? " ?



  • Hi, @carolin-marschke,

    I supposed that you’re speaking about the tag , as below, which starts, generally, an XML file !

    <?xml version=“1.0” encoding=“UTF-8” ?>

    Your troubles come from the Question mark character ( ? ), which is considered, in the regex universe, as a special character, also called a meta-character. So, by default, these characters are NOT simple literal characters !


    There are two different sets of meta-characters :

    • Those, which are recognized anywhere in the pattern, EXCEPT within square brackets :

      • \ General escape character, with several uses

      • ^ Start of a line ( or a file )

      • $ End of a line ( or a file )

      • . Match any character, except new-line ones ( by default )

      • [ Start a class definition or a class range

      • | Start of an alternative branch

      • ( Start of a sub-pattern group

      • ) End of a sub-pattern group

      • { Start a Min / Max range of a quantifier

      • } End a `Min / Max range of a quantifier

      • * 0 to more times, the preceding character or group

      • +

        • 1 to more times, the preceding character or group

        • Possessive behaviour of the quantifiers *, + and ?

      • ?

        • 0 or 1 time, the preceding character or group

        • Meaning extender, for groups or conditions, (....)

        • Minimizer of the quantifiers *, + and ?

    • Those, which are recognized within square brackets ( character class ), EXCLUSIVELY :

      • \ General escape character

      • ^ Negate the class, if first character of the class

      • - Character range indicator

      • [: Start of a POSIX character class, if followed by regular POSIX syntax

      • :] End of a POSIX character class

    So, Carolin, if you need to search for any of the above characters, as a literal, you must escape it with the backslash character \

    Therefore, your regex must be rewritten as : (?i)<\?xml[^>]*>, with a \, right before the ? character !

    However, the regex (?-is)<\?xml.+?> give better results ! Indeed, due to the -s modifier, any dot will match standard characters, only, with case sensitivity. So, after matching the literal string <?xml, with that exact case ( <\?xml ), it looks for the shortest non-null range of characters ( .+? ), of the current line, till a closing symbol >

    Assuming that the unique XML line <?xml version=“1.0” encoding=“UTF-8” ?> would be split into the four lines, as below :

    <?xml ver
    sion="1.0" enc
    oding="UT
    F-8" ?>
    

    My regex doesn’t match this incorrect text, unlike your regex ! To get the same behaviour, you should change your regex as (?i)<\?xml[^>\r\n]*>. This time, the syntax [^>\r\n] matches any character, different from the > character AND different, also, from any EOL character :-))

    Cheers,

    guy038