Need help replacing HTML tags! Any help welcome.
-
I need help with replacing and removing a series of HTML tags for 500+ excel cells.
- There are HTML tags that I need to replace with new tag names and there are tags that I would like to remove altogether.
- The text I need to format is case sensitive, and in some cases, the tags contain capitalization. I need to replace all capitalized tags to lower-case lettering.
- Lastly, some tags have additional formatting included within the tag, but I would like to simplify them to their most basic tag naming convention. An asterisk is noted in some tags to indicate wildcard HTML tags where I would like to replace all variations of that tag that start the way I indicated.
There are variations of the following tags but here are just a few examples:
Find and Replace:
Find <DIV style…> and replace with <DIV>
Find <STRONG> and replace with <b>
Find <SPAN lang*> and replace with <span>Remove Altogether:
<FONT Style…> and </FONT>
<tbody*> and </tbody> -
Hello, @carolin-marschke,
REading carefully your post, I noticed a small contradiction ! You said :
I need to replace all capitalized tags to lower-case lettering.
And, further on, the example :
Find <DIV style…> and replace with <DIV>
So, I supposed that replacement should be <div> !
Anyway, for all your text manipulations, here are some appropriate regexes to run, after opening the Replace dialog and checking the Regular exprssion search mode :
- For changes, as :
Find <DIV style…> and replace with <DIV>
Find <SPAN lang*> and replace with <span>Use the regex S/R : SEARCH =
(?i-s)<(SPAN|DIV).*?>
and REPLACE =<\L\1>
Notes :
-
Search is non-insensitive (
i
) and the dot matches any standard character only (-s
) -
It looks for the tag SPAN or DIV, in any case, stored as group 1 (
<(SPAN|DIV)
), possibly followed by some characters, till the nearest>
closing tag (.*?>
), in the same line -
In replacement, it just rewrites the group 1 tag (
\1
), in lower case, due to the preceding syntax\L
- For changes, as :
Find <STRONG> and replace with <b>
Use, for instance, the regex S/R : SEARCH =
(?i)(<strong>)|(<em>)
and REPLACE =(?1<b>)(?2<i>)
Notes :
- Simple change of a string by an another one !. The general syntax is :
SEARCH =
(?i)(Word1)|(Word2)|(Word3)........|(Wordn)
REPLACE =
(?1subst1)(?2subst2)(?3subst3)........(?nsubstn)
For suppressions as :
<FONT Style…> and </FONT>
<tbody*> and </tbody>Use the regex S/R : SEARCH =
(?is)<(FONT|tbody).+?</\1>\R?
and REPLACE =EMPTY
Notes :
-
Search is non-insensitive (
i
) and the dot matches any character ( standard or EOL ones ) (s
) -
It looks for tags FONT or tbody, in any case, stored as group 1 (
<(FONT|tbody)
), till the nearest closing IDENTICAL group 1 tag (</\1>
), possibly followed by EOL characters(s) (\R?
) -
As the replacement zone is empty, the bloc
<Tag...>.......</Tag>
, most of a time, multi-lines, is simply deleted
Hoping that this first attempt will be useful to you !
Best Regards,
guy038
-
Hello @guy038 ,
Thank you very much for your response. Also, you are correct in catching my mistake. Thanks for pointing that out. I will use your suggestions and get back to you shortly to let you know if those worked for me.
Best,
Carolin
-
@guy038 ,
Hello! Thank you again for your help. I am now stumbling upon an issue with the following tag (any possible variations of it after the first few characters):
<?xml *>
There may be variations of strings that begin with this lettering. I tried to replace these by the following search function:
(?i)<?xml[^>]*>
But once replaced, the following remained in my script:
<?How do you recommend I remove all instances of " <? " ?
-
Hi, @carolin-marschke,
I supposed that you’re speaking about the tag , as below, which starts, generally, an XML file !
<?xml version=“1.0” encoding=“UTF-8” ?>
Your troubles come from the Question mark character (
?
), which is considered, in the regex universe, as a special character, also called a meta-character. So, by default, these characters are NOT simple literal characters !
There are
two
different sets of meta-characters :-
Those, which are recognized anywhere in the pattern, EXCEPT within square brackets :
-
\
General escape character, with several uses -
^
Start of a line ( or a file ) -
$
End of a line ( or a file ) -
.
Match any character, except new-line ones ( by default ) -
[
Start a class definition or a class range -
|
Start of an alternative branch -
(
Start of a sub-pattern group -
)
End of a sub-pattern group -
{
Start aMin / Max
range of a quantifier -
}
End a `Min / Max range of a quantifier -
*
0
tomore
times, the preceding character or group -
+
-
1
tomore
times, the preceding character or group -
Possessive behaviour of the quantifiers
*
,+
and?
-
-
?
-
0
or1
time, the preceding character or group -
Meaning extender, for groups or conditions,
(....)
-
Minimizer of the quantifiers
*
,+
and?
-
-
-
Those, which are recognized within square brackets ( character class ), EXCLUSIVELY :
-
\
General escape character -
^
Negate the class, if first character of the class -
-
Character range indicator -
[:
Start of a POSIX character class, if followed by regular POSIX syntax -
:]
End of a POSIX character class
-
So, Carolin, if you need to search for any of the above characters, as a literal, you must escape it with the backslash character
\
Therefore, your regex must be rewritten as :
(?i)<\?xml[^>]*>
, with a\
, right before the?
character !However, the regex
(?-is)<\?xml.+?>
give better results ! Indeed, due to the-s
modifier, any dot will match standard characters, only, with case sensitivity. So, after matching the literal string <?xml, with that exact case (<\?xml
), it looks for the shortest non-null range of characters (.+?
), of the current line, till a closing symbol>
Assuming that the unique XML line <?xml version=“1.0” encoding=“UTF-8” ?> would be split into the four lines, as below :
<?xml ver sion="1.0" enc oding="UT F-8" ?>
My regex doesn’t match this incorrect text, unlike your regex ! To get the same behaviour, you should change your regex as
(?i)<\?xml[^>\r\n]*>
. This time, the syntax[^>\r\n]
matches any character, different from the>
character AND different, also, from any EOL character :-))Cheers,
guy038
-