Need help replacing HTML tags! Any help welcome.
-
I need help with replacing and removing a series of HTML tags for 500+ excel cells.
- There are HTML tags that I need to replace with new tag names and there are tags that I would like to remove altogether.
- The text I need to format is case sensitive, and in some cases, the tags contain capitalization. I need to replace all capitalized tags to lower-case lettering.
- Lastly, some tags have additional formatting included within the tag, but I would like to simplify them to their most basic tag naming convention. An asterisk is noted in some tags to indicate wildcard HTML tags where I would like to replace all variations of that tag that start the way I indicated.
There are variations of the following tags but here are just a few examples:
Find and Replace:
Find <DIV style…> and replace with <DIV>
Find <STRONG> and replace with <b>
Find <SPAN lang*> and replace with <span>Remove Altogether:
<FONT Style…> and </FONT>
<tbody*> and </tbody> -
Hello, @carolin-marschke,
REading carefully your post, I noticed a small contradiction ! You said :
I need to replace all capitalized tags to lower-case lettering.
And, further on, the example :
Find <DIV style…> and replace with <DIV>
So, I supposed that replacement should be <div> !
Anyway, for all your text manipulations, here are some appropriate regexes to run, after opening the Replace dialog and checking the Regular exprssion search mode :
- For changes, as :
Find <DIV style…> and replace with <DIV>
Find <SPAN lang*> and replace with <span>Use the regex S/R : SEARCH =
(?i-s)<(SPAN|DIV).*?>and REPLACE =<\L\1>Notes :
-
Search is non-insensitive (
i) and the dot matches any standard character only (-s) -
It looks for the tag SPAN or DIV, in any case, stored as group 1 (
<(SPAN|DIV)), possibly followed by some characters, till the nearest>closing tag (.*?>), in the same line -
In replacement, it just rewrites the group 1 tag (
\1), in lower case, due to the preceding syntax\L
- For changes, as :
Find <STRONG> and replace with <b>
Use, for instance, the regex S/R : SEARCH =
(?i)(<strong>)|(<em>)and REPLACE =(?1<b>)(?2<i>)Notes :
- Simple change of a string by an another one !. The general syntax is :
SEARCH =
(?i)(Word1)|(Word2)|(Word3)........|(Wordn)REPLACE =
(?1subst1)(?2subst2)(?3subst3)........(?nsubstn)
For suppressions as :
<FONT Style…> and </FONT>
<tbody*> and </tbody>Use the regex S/R : SEARCH =
(?is)<(FONT|tbody).+?</\1>\R?and REPLACE =EMPTYNotes :
-
Search is non-insensitive (
i) and the dot matches any character ( standard or EOL ones ) (s) -
It looks for tags FONT or tbody, in any case, stored as group 1 (
<(FONT|tbody)), till the nearest closing IDENTICAL group 1 tag (</\1>), possibly followed by EOL characters(s) (\R?) -
As the replacement zone is empty, the bloc
<Tag...>.......</Tag>, most of a time, multi-lines, is simply deleted
Hoping that this first attempt will be useful to you !
Best Regards,
guy038
-
Hello @guy038 ,
Thank you very much for your response. Also, you are correct in catching my mistake. Thanks for pointing that out. I will use your suggestions and get back to you shortly to let you know if those worked for me.
Best,
Carolin
-
@guy038 ,
Hello! Thank you again for your help. I am now stumbling upon an issue with the following tag (any possible variations of it after the first few characters):
<?xml *>
There may be variations of strings that begin with this lettering. I tried to replace these by the following search function:
(?i)<?xml[^>]*>
But once replaced, the following remained in my script:
<?How do you recommend I remove all instances of " <? " ?
-
Hi, @carolin-marschke,
I supposed that you’re speaking about the tag , as below, which starts, generally, an XML file !
<?xml version=“1.0” encoding=“UTF-8” ?>
Your troubles come from the Question mark character (
?), which is considered, in the regex universe, as a special character, also called a meta-character. So, by default, these characters are NOT simple literal characters !
There are
twodifferent sets of meta-characters :-
Those, which are recognized anywhere in the pattern, EXCEPT within square brackets :
-
\General escape character, with several uses -
^Start of a line ( or a file ) -
$End of a line ( or a file ) -
.Match any character, except new-line ones ( by default ) -
[Start a class definition or a class range -
|Start of an alternative branch -
(Start of a sub-pattern group -
)End of a sub-pattern group -
{Start aMin / Maxrange of a quantifier -
}End a `Min / Max range of a quantifier -
*0tomoretimes, the preceding character or group -
+-
1tomoretimes, the preceding character or group -
Possessive behaviour of the quantifiers
*,+and?
-
-
?-
0or1time, the preceding character or group -
Meaning extender, for groups or conditions,
(....) -
Minimizer of the quantifiers
*,+and?
-
-
-
Those, which are recognized within square brackets ( character class ), EXCLUSIVELY :
-
\General escape character -
^Negate the class, if first character of the class -
-Character range indicator -
[:Start of a POSIX character class, if followed by regular POSIX syntax -
:]End of a POSIX character class
-
So, Carolin, if you need to search for any of the above characters, as a literal, you must escape it with the backslash character
\Therefore, your regex must be rewritten as :
(?i)<\?xml[^>]*>, with a\, right before the?character !However, the regex
(?-is)<\?xml.+?>give better results ! Indeed, due to the-smodifier, any dot will match standard characters, only, with case sensitivity. So, after matching the literal string <?xml, with that exact case (<\?xml), it looks for the shortest non-null range of characters (.+?), of the current line, till a closing symbol>Assuming that the unique XML line <?xml version=“1.0” encoding=“UTF-8” ?> would be split into the four lines, as below :
<?xml ver sion="1.0" enc oding="UT F-8" ?>My regex doesn’t match this incorrect text, unlike your regex ! To get the same behaviour, you should change your regex as
(?i)<\?xml[^>\r\n]*>. This time, the syntax[^>\r\n]matches any character, different from the>character AND different, also, from any EOL character :-))Cheers,
guy038
-