Need help replacing HTML tags! Any help welcome.

Girlswhocode312

I need help with replacing and removing a series of HTML tags for 500+ excel cells.

There are HTML tags that I need to replace with new tag names and there are tags that I would like to remove altogether.
The text I need to format is case sensitive, and in some cases, the tags contain capitalization. I need to replace all capitalized tags to lower-case lettering.
Lastly, some tags have additional formatting included within the tag, but I would like to simplify them to their most basic tag naming convention. An asterisk is noted in some tags to indicate wildcard HTML tags where I would like to replace all variations of that tag that start the way I indicated.

There are variations of the following tags but here are just a few examples:

Find and Replace:
Find <DIV style…> and replace with <DIV>
Find <STRONG> and replace with <b>
Find <SPAN lang*> and replace with <span>

Remove Altogether:
<FONT Style…> and </FONT>
<tbody*> and </tbody>

guy038

Hello, @carolin-marschke,

REading carefully your post, I noticed a small contradiction ! You said :

I need to replace all capitalized tags to lower-case lettering.

And, further on, the example :

Find <DIV style…> and replace with <DIV>

So, I supposed that replacement should be <div> !

Anyway, for all your text manipulations, here are some appropriate regexes to run, after opening the Replace dialog and checking the Regular exprssion search mode :

For changes, as :

Find <DIV style…> and replace with <DIV>
Find <SPAN lang*> and replace with <span>

Use the regex S/R : SEARCH = (?i-s)<(SPAN|DIV).*?> and REPLACE = <\L\1>

Notes :

Search is non-insensitive ( i ) and the dot matches any standard character only (-s )
It looks for the tag SPAN or DIV, in any case, stored as group 1 ( <(SPAN|DIV) ), possibly followed by some characters, till the nearest > closing tag ( .*?> ), in the same line
In replacement, it just rewrites the group 1 tag ( \1 ), in lower case, due to the preceding syntax \L

For changes, as :

Find <STRONG> and replace with <b>

Use, for instance, the regex S/R : SEARCH = (?i)(<strong>)|(<em>) and REPLACE = (?1<b>)(?2<i>)

Notes :

Simple change of a string by an another one !. The general syntax is :

SEARCH = (?i)(Word1)|(Word2)|(Word3)........|(Wordn)

REPLACE = (?1subst1)(?2subst2)(?3subst3)........(?nsubstn)

For suppressions as :

<FONT Style…> and </FONT>
<tbody*> and </tbody>

Use the regex S/R : SEARCH = (?is)<(FONT|tbody).+?</\1>\R? and REPLACE = EMPTY

Notes :

Search is non-insensitive ( i ) and the dot matches any character ( standard or EOL ones ) (s )
It looks for tags FONT or tbody, in any case, stored as group 1 ( <(FONT|tbody) ), till the nearest closing IDENTICAL group 1 tag ( </\1> ), possibly followed by EOL characters(s) ( \R? )
As the replacement zone is empty, the bloc <Tag...>.......</Tag>, most of a time, multi-lines, is simply deleted

Hoping that this first attempt will be useful to you !

Best Regards,

guy038

Girlswhocode312

Hello @guy038 ,

Thank you very much for your response. Also, you are correct in catching my mistake. Thanks for pointing that out. I will use your suggestions and get back to you shortly to let you know if those worked for me.

Best,

Carolin

Girlswhocode312

@guy038 ,

Hello! Thank you again for your help. I am now stumbling upon an issue with the following tag (any possible variations of it after the first few characters):

<?xml *>

There may be variations of strings that begin with this lettering. I tried to replace these by the following search function:

(?i)<?xml[^>]*>

But once replaced, the following remained in my script:
<?

How do you recommend I remove all instances of " <? " ?

guy038

Hi, @carolin-marschke,

I supposed that you’re speaking about the tag , as below, which starts, generally, an XML file !

<?xml version=“1.0” encoding=“UTF-8” ?>

Your troubles come from the Question mark character ( ? ), which is considered, in the regex universe, as a special character, also called a meta-character. So, by default, these characters are NOT simple literal characters !

There are two different sets of meta-characters :

Those, which are recognized anywhere in the pattern, EXCEPT within square brackets :
- \ General escape character, with several uses
- ^ Start of a line ( or a file )
- $ End of a line ( or a file )
- . Match any character, except new-line ones ( by default )
- [ Start a class definition or a class range
- | Start of an alternative branch
- ( Start of a sub-pattern group
- ) End of a sub-pattern group
- { Start a Min / Max range of a quantifier
- } End a `Min / Max range of a quantifier
- * 0 to more times, the preceding character or group
- +
  - 1 to more times, the preceding character or group
  - Possessive behaviour of the quantifiers *, + and ?
- ?
  - 0 or 1 time, the preceding character or group
  - Meaning extender, for groups or conditions, (....)
  - Minimizer of the quantifiers *, + and ?
Those, which are recognized within square brackets ( character class ), EXCLUSIVELY :
- \ General escape character
- ^ Negate the class, if first character of the class
- - Character range indicator
- [: Start of a POSIX character class, if followed by regular POSIX syntax
- :] End of a POSIX character class

So, Carolin, if you need to search for any of the above characters, as a literal, you must escape it with the backslash character \

Therefore, your regex must be rewritten as : (?i)<\?xml[^>]*>, with a \, right before the ? character !

However, the regex (?-is)<\?xml.+?> give better results ! Indeed, due to the -s modifier, any dot will match standard characters, only, with case sensitivity. So, after matching the literal string <?xml, with that exact case ( <\?xml ), it looks for the shortest non-null range of characters ( .+? ), of the current line, till a closing symbol >

Assuming that the unique XML line <?xml version=“1.0” encoding=“UTF-8” ?> would be split into the four lines, as below :

<?xml ver
sion="1.0" enc
oding="UT
F-8" ?>

My regex doesn’t match this incorrect text, unlike your regex ! To get the same behaviour, you should change your regex as (?i)<\?xml[^>\r\n]*>. This time, the syntax [^>\r\n] matches any character, different from the > character AND different, also, from any EOL character :-))

Cheers,

guy038