Regex to remove specific classes from specific tags in HTML code

Dario de Judicibus

I need to use regular expressions to remove specific classes from specific tags in HTML code. For example: remove class fred from tag p only.

some text → some text
<h1 class="matt fred">some text</h1> → <h1 class="matt fred">some text</h1>
some text → some text
some text → some text
some text → some text
some text → some text

Please, note that a tag may have one or more classes and that the class to be removed can be in any position, so it is necessary to manage space separators too.

Thank you in advance.

guy038

Hello Dario

No problem ! So, given your original example, to whom I added some possible cases :

<p class="matt fred">some text</p>
<p id="12345" class="matt fred">some text</p>

<h1 class="matt fred">some text</h1>
<h1 id="12345" class="matt fred">some text</h1>

<p class="fred">some text</p>
<p id="12345" class="fred">some text</p>

<p id="fred">some text</p>
<p type="xxx" id="fred">some text</p>

<p class="matt fred jane">some text</p>
<p id="12345" class="matt fred jane">some text</p>

<p class="fred jane">some text</p>
<p id="12345" class="fred jane">some text</p>

I propose the following S/R :

SEARCH (?-s)<p[^<>\r\n]*(\K class="fred"| class=".+\K fred| class="\Kfred )

REPLACE Leave EMPTY

REMARKS :

In the search regex, there is a space character :
- Before each word class
- Before the second occurrence of the word fred
- After the third occurrence of the word fred
The Regular expression search mode needs to be set, of course !
You must use the Replace All button, exclusively ( The simple step by step replacement, with the Replace button, does nothing, due to an incorrect handling of look-behinds and backward assertions ! )

After clicking on the Replace All button, you should get the changed text, below :

<p class="matt">some text</p>
<p id="12345" class="matt">some text</p>

<h1 class="matt fred">some text</h1>
<h1 id="12345" class="matt fred">some text</h1>

<p>some text</p>
<p id="12345">some text</p>

<p id="fred">some text</p>
<p type="xxx" id="fred">some text</p>

<p class="matt jane">some text</p>
<p id="12345" class="matt jane">some text</p>

<p class="jane">some text</p>
<p id="12345" class="jane">some text</p>

NOTES :

The (?-s) syntax, beginning the regex, ensures that any dot will match a single standard character, only and NOT an End of Line character
Then the part <p[^<>\r\n]* looks for the string <p, followed by a range, even null, of characters, different from <, >, \r and \n, till, either, each part of the alternative sequence (....|....|....) :
- The string class="fred", beginning with a space character. As this string is preceded by the \K syntax, this means that the regex engine position is reset and that everything matched before that string is forgotten. So, this whole string will be deleted, during the replacement phase
- The string class=".+\K fred, beginning with a space which contains a non-null range of standard characters, till the string fred, preceded by a space. Again, due to the \K form, this later string will be deleted
- The string class="\Kfred ), beginning with a space and ending with the string fred and a space character. Again, because of the \K syntax, only the string fred and the space character, will be suppressed, during replacement

If you don’t want to repeat the word Fred, several times in the search regex, here is a second solution :

SEARCH (?(DEFINE)(fred))(?-s)<p[^<>\r\n]*(\K class="(?1)"| class=".+\K (?1)| class="\K(?1) )

REPLACE Leave EMPTY

This construction is a special conditional regex syntax. Refer to the two links, below, for further information :

http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.conditional_expressions

http://www.regular-expressions.info/subroutine.html#define

To end, as you said :

Please, note that a tag may have one or more classes and that the class to be removed can be in any position, so it is necessary to manage space separators too.

Just tell me about all the missing cases which may occur, in your HTML text and I’ll try to improve the regex :-)

Best Regards,

guy038

guy038

Hi, Dario

At the end of my previous post, I said :

If you don’t want to repeat the word Fred, several times in the search regex,…

and I gave a second, but rather complicated, regex !

Thinking about it, here is a third and more simple regex, for the same purpose :

SEARCH (?-s)<p[^<>\r\n]*(\K class="(fred)"| class=".+\K (?2)| class="\K(?2) )

REPLACE Leave EMPTY

Remember that the (?n) syntax, called a subroutine call, represents the regex enclosed in the nth pair of round brackets, when going through the regex, from left to right and can be placed after of before the group to which it refers !

In our example, the regex is, simply, the string fred and the (?2) forms are located after the second group (fred)

It’s very important to note that the similar regex, which uses back-references, does not work ! Indeed, the regex :

(?-s)<p[^<>\r\n]*(\K class="(fred)"| class=".+\K \2| class="\K\2 ), matches, only, the two lines, below :

<p class="fred">some text</p>
<p id="12345" class="fred">some text</p>

Why ? Well, these positive matchs concern the first case of the alternative. When it tries to match the two other cases of the alternative ( " class=".+\K \2" or " class="\K\2 " ) the group 2 does not exist, as the first part of the alternative is not used !!

Cheers,

guy038