Community
    • Login

    Regex to remove specific classes from specific tags in HTML code

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    3 Posts 2 Posters 3.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Dario de JudicibusD
      Dario de Judicibus
      last edited by Dario de Judicibus

      I need to use regular expressions to remove specific classes from specific tags in HTML code. For example: remove class fred from tag p only.

      <p class="matt fred">some text</p> → <p class="matt">some text</p>
      <h1 class="matt fred">some text</h1> → <h1 class="matt fred">some text</h1>
      <p class="fred">some text</p> → <p>some text</p>
      <p id="fred">some text</p> → <p id="fred">some text</p>
      <p class="matt fred jane">some text</p> → <p class="matt jane">some text</p>
      <p class="fred jane">some text</p> → <p class="jane">some text</p>

      Please, note that a tag may have one or more classes and that the class to be removed can be in any position, so it is necessary to manage space separators too.

      Thank you in advance.

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello Dario

        No problem ! So, given your original example, to whom I added some possible cases :

        <p class="matt fred">some text</p>
        <p id="12345" class="matt fred">some text</p>
        
        <h1 class="matt fred">some text</h1>
        <h1 id="12345" class="matt fred">some text</h1>
        
        <p class="fred">some text</p>
        <p id="12345" class="fred">some text</p>
        
        <p id="fred">some text</p>
        <p type="xxx" id="fred">some text</p>
        
        <p class="matt fred jane">some text</p>
        <p id="12345" class="matt fred jane">some text</p>
        
        <p class="fred jane">some text</p>
        <p id="12345" class="fred jane">some text</p>
        

        I propose the following S/R :

        SEARCH (?-s)<p[^<>\r\n]*(\K class="fred"| class=".+\K fred| class="\Kfred )

        REPLACE Leave EMPTY

        REMARKS :

        • In the search regex, there is a space character :

          • Before each word class

          • Before the second occurrence of the word fred

          • After the third occurrence of the word fred

        • The Regular expression search mode needs to be set, of course !

        • You must use the Replace All button, exclusively ( The simple step by step replacement, with the Replace button, does nothing, due to an incorrect handling of look-behinds and backward assertions ! )


        After clicking on the Replace All button, you should get the changed text, below :

        <p class="matt">some text</p>
        <p id="12345" class="matt">some text</p>
        
        <h1 class="matt fred">some text</h1>
        <h1 id="12345" class="matt fred">some text</h1>
        
        <p>some text</p>
        <p id="12345">some text</p>
        
        <p id="fred">some text</p>
        <p type="xxx" id="fred">some text</p>
        
        <p class="matt jane">some text</p>
        <p id="12345" class="matt jane">some text</p>
        
        <p class="jane">some text</p>
        <p id="12345" class="jane">some text</p>
        

        NOTES :

        • The (?-s) syntax, beginning the regex, ensures that any dot will match a single standard character, only and NOT an End of Line character

        • Then the part <p[^<>\r\n]* looks for the string <p, followed by a range, even null, of characters, different from <, >, \r and \n, till, either, each part of the alternative sequence (....|....|....) :

          • The string class="fred", beginning with a space character. As this string is preceded by the \K syntax, this means that the regex engine position is reset and that everything matched before that string is forgotten. So, this whole string will be deleted, during the replacement phase

          • The string class=".+\K fred, beginning with a space which contains a non-null range of standard characters, till the string fred, preceded by a space. Again, due to the \K form, this later string will be deleted

          • The string class="\Kfred ), beginning with a space and ending with the string fred and a space character. Again, because of the \K syntax, only the string fred and the space character, will be suppressed, during replacement


        If you don’t want to repeat the word Fred, several times in the search regex, here is a second solution :

        SEARCH (?(DEFINE)(fred))(?-s)<p[^<>\r\n]*(\K class="(?1)"| class=".+\K (?1)| class="\K(?1) )

        REPLACE Leave EMPTY

        This construction is a special conditional regex syntax. Refer to the two links, below, for further information :

        http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.conditional_expressions

        http://www.regular-expressions.info/subroutine.html#define


        To end, as you said :

        Please, note that a tag may have one or more classes and that the class to be removed can be in any position, so it is necessary to manage space separators too.

        Just tell me about all the missing cases which may occur, in your HTML text and I’ll try to improve the regex :-)

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hi, Dario

          At the end of my previous post, I said :

          If you don’t want to repeat the word Fred, several times in the search regex,…

          and I gave a second, but rather complicated, regex !

          Thinking about it, here is a third and more simple regex, for the same purpose :

          SEARCH (?-s)<p[^<>\r\n]*(\K class="(fred)"| class=".+\K (?2)| class="\K(?2) )

          REPLACE Leave EMPTY

          Remember that the (?n) syntax, called a subroutine call, represents the regex enclosed in the nth pair of round brackets, when going through the regex, from left to right and can be placed after of before the group to which it refers !

          In our example, the regex is, simply, the string fred and the (?2) forms are located after the second group (fred)


          It’s very important to note that the similar regex, which uses back-references, does not work ! Indeed, the regex :

          (?-s)<p[^<>\r\n]*(\K class="(fred)"| class=".+\K \2| class="\K\2 ), matches, only, the two lines, below :

          <p class="fred">some text</p>
          <p id="12345" class="fred">some text</p>
          

          Why ? Well, these positive matchs concern the first case of the alternative. When it tries to match the two other cases of the alternative ( " class=".+\K \2" or " class="\K\2 " ) the group 2 does not exist, as the first part of the alternative is not used !!

          Cheers,

          guy038

          1 Reply Last reply Reply Quote 0
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors