Select all exclamation marks ! from a specific html tag



  • <p class="ONE">LoveThePlanet products are free from ! parabens and palm oil, despatched in paper envelopes !as well</p>

    <p class="TWO">LoveThePlanet products are free from ! parabens and palm oil, despatched in paper envelopes !as well</p>

    So, I want to select all exclamation marks ! from the tag <p class="ONE">...</p> and to replace it with a space before and after it

    I made a regex, but it select also the second tag, but I want to modify only the first one (<p class="ONE">)

    FIND: (?:<p class="one">|\G)[^|"]*\K \!\h|\!\h*

    REPLACE BY: \x20!\x20

    So, the single problem is that my regex will modify also the second html tag <p class="TWO">, and I want to modify only the first html tag <p class="ONE">



  • Hello, @robin-cruise and All,

    I suppose that the following regex S/R should do the trick :

    SEARCH (?i-s)(?:<p class="ONE">|\G).*?\K\h*!\h*

    REPLACE \x20!\x20

    Here are the key points and steps :

    • We want to look for any exclamation mark, possibly preceded and/or followed with horizontal blanks characters ( the character class [\t\x20\xA0] ) so the simple regex \h*!\h* and replace it with an exclamation mark surrounded with a space char so the regex \x20!\x20

    • We also want to do this search ONLY IF  the tag is <p class="ONE">. As, in your example, you’re using the upper and lower case, I suppose that the correct regex should be (?i)<p class="ONE">, with a search Insensible to case. If not, use either (?-i)<p class="ONE"> or (?-i)<p class="one">

    • Now, the main idea is :

      • To look for the shortest range of standard characters between the tag <p class="ONE"> and our searched expression \h*!\h*, with the lazy quantifier *? and select only the searched expression, with the \K feature, so the regex (?i-s)<p class="ONE">.*?\K\h*!\h*

      • To look, again, for the shortest range of standard chars from the position right after the end of the previous match, with the \G assertion and our searched expression, with the regex \G.*?\K\h*!\h*

      • And so on… … If we factor the anchor of the characters range, which is either <p class="ONE"> or \G, we end with the regex (?i-s)(?:<p class="ONE">|\G).*?\K\h*!\h*, as above !

    • The important point to understand is that the range of chars, before reaching the searched expression, consists of standard characters. So, when the end of tag <p class="ONE">...........</p> is reached, the only way to go on is to skip the EOL characters, located after </p>. But, in that case, the \G assertion is not verified anymore and, necessarily, the next match will have, first, to search for the other anchor <p class="ONE"> !


    If we use the free-spacing mode, our regex can be expressed as :

    
    (?xi-s)               #  FREE-SPACING mode, INSENSIBLE to case, any DOT = 1 STANDARD character ONLY
    (?:                   #  NON-CAPTURING group
      <p\ class="ONE">    #    LITERAL string <p class="ONE"> ( Note the ESCAPED SPACE char, after <p )
    |                     #  The ALTERNATION symbol
      \G                  #    MATCHES the position RIGHT AFTER the PREVIOUS match, ONLY
    )                     #  End of the NON-CAPTURING group
     .*?                  #  The SHORTEST range, possiblY NUL, of STANDARD characters till ...
     \K                   #  CANCELS any MATCH so far and RESETS the regex ENGINE position to PRESENT position
    \h*  !  \h*           #  ... the SEARCHED expression, so an EXCLAMATION MARK, possibly PRECEDED and/or FOLLOWED with HORIZONTAL BLANK characters
    

    Best Regards,

    guy038



  • thank you very much



  • @guy038

    Isn’t this is just another variant on THIS ??



  • @Alan-Kilborn you can compare also the solutions on both topics, see yourself if is the same and if it can be applied in the same context :)



  • Hello, @alan-kilborn, @robin-cruise and All,

    See the updated version of this post, with the @alan-kilborn advices at    https://community.notepad-plus-plus.org/post/62123


    Alan, you quite right about it. For instance, the three main search regexes that I provided to @robin-cruise, expressed with the free-spacing mode, are, finally :

    Regex A   (?xs)    (?:  <My\ Tag>         |  \G  )    ((?!^<).)+?    <a\ href="    \K               (?=/)
    
    Regex B   (?xs)    (?:  <!--\ BEGIN\ -->  |  \G  )    ((?!^<).)+?    <a\ href="    \K               (?=/)
    
    Regex C   (?xi-s)  (?:  <p\ class="ONE">  |  \G  )        .*?                      \K    \h*!\h*
    

    They follow the generic scheme, below :

    SEARCH (?-s)(BR|\G)((?!ER).)*?\KSR        OR        (?s)(BR|\G)((?!ER).)*?\KSR

    REPLACE RR

    where :

    • BR ( Begining Regex ) is the regex which defines the start of the specific area to look for a possible Search Regex match

    • ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur and not elsewhere !

    • SR ( Search Regex ) is the regex which defines the expression to search for, if , both, the Begining Regex has been matched and the Excluded Regex has not been matched so far, at any position

    • RR ( Replace Regex ) is simply the regex which defines the regex expression replacing the Search Regex

    Note that, when the ER zone is not needed , these S/R can be simplified as :

    SEARCH (?-s)(BR|\G).*?\KSR        OR        (?s)(BR|\G).*?\KSR


    For instance :

    • In the regex A, BR = <My Tag>, ER = ^<, SR = EMPTY string between <a\ href=" and /, RR = https://link.ca

    • In the regex B, BR = <!-- BEGIN -->, ER = ^<, SR = EMPTY string between <a\ href=" and /, RR = https://link.ca

    • In the regex C, BR = <p\ class="ONE">, ER = None, SR = \h*!\h*, RR = \x20!\x20

    Note that :

    • In regexes A and B, due to the muti-lines search with the leading (?s) modifier, an Excluding Regex is necessary to not overlap through an other section <My Tag> or <!-- BEGIN -->, starting at beginning of line. Hence the negative look-ahead (?!^<) in the expression ((?!^<).)+?

    • in regex C, the Excluded Regex is implicit as it could be written with the negative look-ahead (?![\r\n]) which is applied to each character of the shortest range .*? , hence the syntax ((?![\r\n]).)*?. Indeed, because of the leading (?-s) modifier, any char of that range will never be an EOL character. So, it defines, implicitly, a zone after the string <p\ class="ONE"> till the first </p> included, where to search for \h*!\h* and the shortest range of any standard characters can just be defined with the simple syntax (?-s).*? !

    Best Regards,

    guy038



  • @guy038 very well explained, thank you



  • @guy038

    I as well like your explanation.
    It could help people start learning how to solve these types of problems.
    Perhaps in the future posters (and especially repetitive posters asking the same questions for similar situations) could be directed to this solution to try before asking for more help.



  • @guy038 said:

    ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur and not elsewhere !

    I was trying to use this, but I’m sort of confused about the “ER”, and perhaps it is just trying to decode the sentence above.

    What I was needing to do is find, inside a function foo for a function parameter of, literally, 0xBA or 0xDE. Thus, I want to match:

    x = foo(0, 12, 0xBA, 34, 27);  // this is my foo function
    

    But foo could also be spread across several lines:

    x = foo(0, 
        12, 
        34, 
        0xDE, 
        27);  // this is another way I could write my foo function
    

    So I set up the technique this way:

    BR = foo\(
    ER = \);
    SR = 0x(BA|DE)

    to get a final search regex of (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)

    It seemed to work, but I really was unsure about my “ER” expression, so @guy038 , if you could comment and shed some additional light on it for me, I’d appreciate it.



  • Hi, @alan-kilborn,

    I thought it was better to write this post with Word and provide a screenshot, in order to see colored zones and some writing styles ;-))

    The sample text used is :

    x = foo(0, 
        12, 
        34, 
        0xDE, 
        12, 
        0xBA, 
        34, 
        27);  // this is another way I could write my foo function
    
    0xDE
    This is
    
    0xBA
    a test
    
    x = foo(0, 
        12, 
        34, 
        0xDE, 
        12, 
        0xBA, 
        34, 
        27);  // this is another way I could write my foo function
    
    0xDE
    This is
    
    0xBA
    a test
    

    a9772e5c-7f7a-440b-936f-96c4ebe588d9-image.png

    Best Regards,

    guy038

    Alan, as it could be difficult to rewrite all the regexes for tests, here they are, in their order of appearance :

    • (?s)(?!\);).

    • (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)    : Your regex

    • (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)(?=((?!\);).)*?\);)

    • (?s)(foo\(|\G).*?\K0x(BA|DE)

    • (?s)(foo\(|\G).*?\K0x(BA|DE)(?=.*?\);)

    Oh, I just saw the caret of my Word document, located inside the first (?s)(?!\);). regex ! Don’t pay any attention ;-))



  • @guy038

    Yes, that clarifies things; thank you for that.


    Onto a new aspect…

    Again, here’s your original general case regex:

    (?-s)(BR|\G)((?!ER).)*?\KSR

    Would it be better to express it this way?:

    (?-s)((?:BR)|\G)((?!ER).)*?\K(?:SR)

    So that the BR and SR expressions “stay together” if they are “complicated”? Or are they already totally “safe” the way you expressed them in the original? I’m not totally sure of the precedence of the | operator, and especially not the \K – is the \K of “top priority”?

    The ER already seems sufficiently “wrapped” via (?!) and shouldn’t need any more than that, although the outer grouping on ER seems as if it could be non-capturing as well, so maybe:

    (?-s)((?:BR)|\G)(?:(?!ER).)*?\K(?:SR)

    I’m not trying to take this totally off-topic into regex land, but I intend to use this technique with N++ a lot in the future, so (to me) it is worth exploring fully.



  • Hi, @alan-kilborn,

    Nice deductions, indeed ! You’re right in many ways : using non-capturing groups, everywhere, should be beneficial in all cases . :

    • Firstly, using the non-capturing group (?:(?!ER).) prevents the regex engine from storing any single character between the BR/current location and the SR, one at a time, which should increase the global performance of the overall regex ( as some code simplification in a loop ! )

    • Secondly, using the non-capturing group (?:SR) can be interesting if you should re-use a part of the SR, in the replacement part and ensures you that you just have to start with group 1 !

    • Now, I think that the first part ((?:BR)|\G) could simply be expressed as (?:BR|\G), because the zero-length assertion \G is not going to be stored, anyway ;-))


    Finally, we end with these generic expressions :

    SEARCH (?s)(?:BR|\G)(?:(?!ER).)*?\K(?:SR)    OR    (?-s)(?:BR|\G)(?:(?!ER).)*?\K(?:SR)

    REPLACE RR

    where :

    • BR ( Begining Regex ) is the regex which defines the start of the specific area to look for a possible Search Regex match

    • ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines areas of continuous characters, where the Search Regex must occur and not elsewhere !

    • SR ( Search Regex ) is the regex which defines the expression to search for, if , both, the Begining Regex has been matched and the Excluded Regex has not been matched so far, at any position, between BR and SR

    • RR ( Replace Regex ) is simply the regex which defines the regex expression replacing the Search Regex

    Note, that I rewrote the last part of the the ER and SR definitions !

    And, if this ER zone is not needed, these generic regexes can be simplified as :

    SEARCH (?s)(?:BR|\G).*?\K(?:SR)    OR    (?-s)(?:BR|\G).*?\K(?:SR)

    IMPORTANT : Because the ER regex implicitly defines several non-contiguous areas where SR may exist, when the regex engine skip from a zone ( the yellow area of my previous post ) to the next non-contiguous zone ( The blue area, after the ending parenthesis ), the \G is not verified anymore and only the first alternative BR must occur first to get, later, a possible match of SR


    So, your previous regex could be written as :

    SEARCH (?s)(?:foo\(|\G)(?:(?!\);).)*?\K(?:0x(BA|DE))

    And using the free-spacing mode (?x), it becomes :

    
    (?xs)  (?: foo\( | \G )  (?: (?! \); ). )*?  \K  (?: 0x(BA|DE) )        TESTED => OK
               ¯¯¯¯¯                 ¯¯¯                 ¯¯¯¯¯¯¯¯¯
                BR                   ER                     SR
    

    Best Regards,

    guy038



  • @guy038 said in Select all exclamation marks ! from a specific html tag:

    (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)    : Your regex

    I seem to have found a problem; with this text:

    int y = 0xBA;
    
    int z = 0xDE;
    
    int x = foo(0,
        12,
        34,
        0xDE,
        12,
        0xBA,
        34,
        27);  // this is another way I could write my foo function
    

    I get hits on the y = and z = lines, even though I thought they had to be inside the foo( and ); delimiters for there to be such hits…



  • Hello, @alan-kilborn and All,

    Unfortunately, we should have predicted such behavior !

    Basically, your regex (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE) looks, either :

    • For the literal string foo(, followed by any char till the first literal string 0xBA or 0xDE

    • For any char , right after the previous match ( \G ) till the first literal string 0xBA or 0xDE

    So, given this sample :

    0xBA    ( Line A )
    0xBA
    
    0xDE
    
    int x = foo(0,
    
    0xBA
    
    0xDE
    
    );
    
        ( Line B )
    
    0xDE
    
    0xBA
    
    int x = foo(0,
    
    0xDE
    
    0xBA
    
    );
    
    0xBA
    
    0xDE
    
    
    0xBA
    
    0xDE
    
    int x = foo(0,
    
    0xBA
    
    0xDE
    
    );
    
    0xBA
    
    0xDE
    

    Move the caret to the very beginning of line B, for instance. Normally, as the next 0xDE is still outside a function f00 range, it should not be matched. However, it does match this occurrence ! Why ?

    Because of the combination of the (?s) modifier, which considers any char and the \G assertion : wherever your caret is located, the \G assertion is always true when your first execute your regex . Indeed, in this case, the regex engine considers that a virtual previous occurrence occurred and stopped right before the caret location. So, it will always find the nearest literal string 0xBA or 0xDE, at any location ( refer to the regex (?s)\G.*?0x(BA|DE) )


    Luckily, I found out a solution, which supposes that three hypotheses are verified :

    • You must use the N++ version 7.9.1 or a later version, which correctly handles the behavior of the \A assertion

    • You systematically must move the caret to the very beginning of current file ( implicit for a Find All in Current Document, a Find in all Opened Documents or a Find All operation ! )

    • You must use the (?!\A)\G syntax, in the overall regex ( instead of \G ! )


    So the generic regexes, of my previous post, should be improved as :

    SEARCH (?s)(?:BR|(?!\A)\G)(?:(?!ER).)*?\K(?:SR)    OR    (?-s)(?:BR|(?!\A)\G)(?:(?!ER).)*?\K(?:SR)

    And gives, for your specific regex :

    (?xs)  (?: foo\( | (?! \A ) \G )  (?: (?! \); ). )*?  \K  (?: 0x(BA|DE) )
    

    You may verify, with the provided sample, that, at soon as the caret is not at the very beginning of the first line ( Line A ), before running this improved regex, it wrongly matches the two strings 0xBA and the string 0xDE, located before the first foo\( string !

    Hence, the necessity to respect the second hypothesis above, which ensures that the \A assertion is true, before regex execution. By this means, the second alternative of BR : (?!\A)\G will not be true, at the first execution of the regex ;-)

    BR

    guy038


Log in to reply