Select all exclamation marks ! from a specific html tag



  • <p class="ONE">LoveThePlanet products are free from ! parabens and palm oil, despatched in paper envelopes !as well</p>

    <p class="TWO">LoveThePlanet products are free from ! parabens and palm oil, despatched in paper envelopes !as well</p>

    So, I want to select all exclamation marks ! from the tag <p class="ONE">...</p> and to replace it with a space before and after it

    I made a regex, but it select also the second tag, but I want to modify only the first one (<p class="ONE">)

    FIND: (?:<p class="one">|\G)[^|"]*\K \!\h|\!\h*

    REPLACE BY: \x20!\x20

    So, the single problem is that my regex will modify also the second html tag <p class="TWO">, and I want to modify only the first html tag <p class="ONE">



  • Hello, @robin-cruise and All,

    I suppose that the following regex S/R should do the trick :

    SEARCH (?i-s)(?:<p class="ONE">|\G).*?\K\h*!\h*

    REPLACE \x20!\x20

    Here are the key points and steps :

    • We want to look for any exclamation mark, possibly preceded and/or followed with horizontal blanks characters ( the character class [\t\x20\xA0] ) so the simple regex \h*!\h* and replace it with an exclamation mark surrounded with a space char so the regex \x20!\x20

    • We also want to do this search ONLY IF  the tag is <p class="ONE">. As, in your example, you’re using the upper and lower case, I suppose that the correct regex should be (?i)<p class="ONE">, with a search Insensible to case. If not, use either (?-i)<p class="ONE"> or (?-i)<p class="one">

    • Now, the main idea is :

      • To look for the shortest range of standard characters between the tag <p class="ONE"> and our searched expression \h*!\h*, with the lazy quantifier *? and select only the searched expression, with the \K feature, so the regex (?i-s)<p class="ONE">.*?\K\h*!\h*

      • To look, again, for the shortest range of standard chars from the position right after the end of the previous match, with the \G assertion and our searched expression, with the regex \G.*?\K\h*!\h*

      • And so on… … If we factor the anchor of the characters range, which is either <p class="ONE"> or \G, we end with the regex (?i-s)(?:<p class="ONE">|\G).*?\K\h*!\h*, as above !

    • The important point to understand is that the range of chars, before reaching the searched expression, consists of standard characters. So, when the end of tag <p class="ONE">...........</p> is reached, the only way to go on is to skip the EOL characters, located after </p>. But, in that case, the \G assertion is not verified anymore and, necessarily, the next match will have, first, to search for the other anchor <p class="ONE"> !


    If we use the free-spacing mode, our regex can be expressed as :

    
    (?xi-s)               #  FREE-SPACING mode, INSENSIBLE to case, any DOT = 1 STANDARD character ONLY
    (?:                   #  NON-CAPTURING group
      <p\ class="ONE">    #    LITERAL string <p class="ONE"> ( Note the ESCAPED SPACE char, after <p )
    |                     #  The ALTERNATION symbol
      \G                  #    MATCHES the position RIGHT AFTER the PREVIOUS match, ONLY
    )                     #  End of the NON-CAPTURING group
     .*?                  #  The SHORTEST range, possiblY NUL, of STANDARD characters till ...
     \K                   #  CANCELS any MATCH so far and RESETS the regex ENGINE position to PRESENT position
    \h*  !  \h*           #  ... the SEARCHED expression, so an EXCLAMATION MARK, possibly PRECEDED and/or FOLLOWED with HORIZONTAL BLANK characters
    

    Best Regards,

    guy038



  • thank you very much



  • @guy038

    Isn’t this is just another variant on THIS ??



  • @Alan-Kilborn you can compare also the solutions on both topics, see yourself if is the same and if it can be applied in the same context :)



  • Hello, @alan-kilborn, @robin-cruise and All,

    Alan, you quite right about it. For instance, the three main search regexes that I provided to @robin-cruise, expressed with the free-spacing mode, are, finally :

    Regex A   (?xs)    (?:  <My\ Tag>         |  \G  )    ((?!^<).)+?    <a\ href="    \K               (?=/)
    
    Regex B   (?xs)    (?:  <!--\ BEGIN\ -->  |  \G  )    ((?!^<).)+?    <a\ href="    \K               (?=/)
    
    Regex C   (?xi-s)  (?:  <p\ class="ONE">  |  \G  )        .*?                      \K    \h*!\h*
    

    They follow the generic scheme, below :

    SEARCH (?-s)(BR|\G)((?!ER).)*?\KSR        OR        (?s)(BR|\G)((?!ER).)*?\KSR

    REPLACE RR

    where :

    • BR ( Begining Regex ) is the regex which defines the start of the specific area to look for a possible Search Regex match

    • ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur and not elsewhere !

    • SR ( Search Regex ) is the regex which defines the expression to search for, if , both, the Begining Regex has been matched and the Excluded Regex has not been matched so far, at any position

    • RR ( Replace Regex ) is simply the regex which defines the regex expression replacing the Search Regex

    Note that, when the ER zone is not needed , these S/R can be simplified as :

    SEARCH (?-s)(BR|\G).*?\KSR        OR        (?s)(BR|\G).*?\KSR


    For instance :

    • In the regex A, BR = <My Tag>, ER = ^<, SR = EMPTY string between <a\ href=" and /, RR = https://link.ca

    • In the regex B, BR = <!-- BEGIN -->, ER = ^<, SR = EMPTY string between <a\ href=" and /, RR = https://link.ca

    • In the regex C, BR = <p\ class="ONE">, ER = None, SR = \h*!\h*, RR = \x20!\x20

    Note that :

    • In regexes A and B, due to the muti-lines search with the leading (?s) modifier, an Excluding Regex is necessary to not overlap through an other section <My Tag> or <!-- BEGIN -->, starting at beginning of line. Hence the negative look-ahead (?!^<) in the expression ((?!^<).)+?

    • in regex C, the Excluded Regex is implicit as it could be written with the negative look-ahead (?![\r\n]) which is applied to each character of the shortest range .*? , hence the syntax ((?![\r\n]).)*?. Indeed, because of the leading (?-s) modifier, any char of that range will never be an EOL character. So, it defines, implicitly, a zone after the string <p\ class="ONE"> till the first </p> included, where to search for \h*!\h* and the shortest range of any standard characters can just be defined with the simple syntax (?-s).*? !

    Best Regards,

    guy038



  • @guy038 very well explained, thank you



  • @guy038

    I as well like your explanation.
    It could help people start learning how to solve these types of problems.
    Perhaps in the future posters (and especially repetitive posters asking the same questions for similar situations) could be directed to this solution to try before asking for more help.


Log in to reply