Select all exclamation marks ! from a specific html tag

guy038

Hello, @alan-kilborn, @robin-cruise and All,

See the updated version of this post, with the @alan-kilborn advices at https://community.notepad-plus-plus.org/post/62123

Alan, you quite right about it. For instance, the three main search regexes that I provided to @robin-cruise, expressed with the free-spacing mode, are, finally :

Regex A   (?xs)    (?:  <My\ Tag>         |  \G  )    ((?!^<).)+?    <a\ href="    \K               (?=/)

Regex B   (?xs)    (?:  <!--\ BEGIN\ -->  |  \G  )    ((?!^<).)+?    <a\ href="    \K               (?=/)

Regex C   (?xi-s)  (?:  <p\ class="ONE">  |  \G  )        .*?                      \K    \h*!\h*

They follow the generic scheme, below :

SEARCH (?-s)(BR|\G)((?!ER).)*?\KSR OR (?s)(BR|\G)((?!ER).)*?\KSR

REPLACE RR

where :

BR ( Begining Regex ) is the regex which defines the start of the specific area to look for a possible Search Regex match
ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur and not elsewhere !
SR ( Search Regex ) is the regex which defines the expression to search for, if , both, the Begining Regex has been matched and the Excluded Regex has not been matched so far, at any position
RR ( Replace Regex ) is simply the regex which defines the regex expression replacing the Search Regex

Note that, when the ER zone is not needed , these S/R can be simplified as :

SEARCH (?-s)(BR|\G).*?\KSR OR (?s)(BR|\G).*?\KSR

For instance :

In the regex A, BR = <My Tag>, ER = ^<, SR = EMPTY string between <a\ href=" and /, RR = https://link.ca
In the regex B, BR = , ER = ^<, SR = EMPTY string between <a\ href=" and /, RR = https://link.ca
In the regex C, BR = <p\ class="ONE">, ER = None, SR = \h*!\h*, RR = \x20!\x20

Note that :

In regexes A and B, due to the muti-lines search with the leading (?s) modifier, an Excluding Regex is necessary to not overlap through an other section <My Tag> or , starting at beginning of line. Hence the negative look-ahead (?!^<) in the expression ((?!^<).)+?
in regex C, the Excluded Regex is implicit as it could be written with the negative look-ahead (?![\r\n]) which is applied to each character of the shortest range .*? , hence the syntax ((?![\r\n]).)*?. Indeed, because of the leading (?-s) modifier, any char of that range will never be an EOL character. So, it defines, implicitly, a zone after the string <p\ class="ONE"> till the first </p> included, where to search for \h*!\h* and the shortest range of any standard characters can just be defined with the simple syntax (?-s).*? !

Best Regards,

guy038

Robin Cruise

@guy038 very well explained, thank you

Alan Kilborn

@guy038

I as well like your explanation.
It could help people start learning how to solve these types of problems.
Perhaps in the future posters (and especially repetitive posters asking the same questions for similar situations) could be directed to this solution to try before asking for more help.

Alan Kilborn

@guy038 said:

ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur and not elsewhere !

I was trying to use this, but I’m sort of confused about the “ER”, and perhaps it is just trying to decode the sentence above.

What I was needing to do is find, inside a function foo for a function parameter of, literally, 0xBA or 0xDE. Thus, I want to match:

x = foo(0, 12, 0xBA, 34, 27);  // this is my foo function

But foo could also be spread across several lines:

x = foo(0, 
    12, 
    34, 
    0xDE, 
    27);  // this is another way I could write my foo function

So I set up the technique this way:

BR = foo\(
ER = \);
SR = 0x(BA|DE)

to get a final search regex of (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)

It seemed to work, but I really was unsure about my “ER” expression, so @guy038 , if you could comment and shed some additional light on it for me, I’d appreciate it.

guy038

Hi, @alan-kilborn,

I thought it was better to write this post with Word and provide a screenshot, in order to see colored zones and some writing styles ;-))

The sample text used is :

x = foo(0, 
    12, 
    34, 
    0xDE, 
    12, 
    0xBA, 
    34, 
    27);  // this is another way I could write my foo function

0xDE
This is

0xBA
a test

x = foo(0, 
    12, 
    34, 
    0xDE, 
    12, 
    0xBA, 
    34, 
    27);  // this is another way I could write my foo function

0xDE
This is

0xBA
a test

Best Regards,

guy038

Alan, as it could be difficult to rewrite all the regexes for tests, here they are, in their order of appearance :

(?s)(?!\);).
(?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE) : Your regex
(?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)(?=((?!\);).)*?\);)
(?s)(foo\(|\G).*?\K0x(BA|DE)
(?s)(foo\(|\G).*?\K0x(BA|DE)(?=.*?\);)

Oh, I just saw the caret of my Word document, located inside the first (?s)(?!\);). regex ! Don’t pay any attention ;-))

Alan Kilborn

@guy038

Yes, that clarifies things; thank you for that.

Onto a new aspect…

Again, here’s your original general case regex:

(?-s)(BR|\G)((?!ER).)*?\KSR

Would it be better to express it this way?:

(?-s)((?:BR)|\G)((?!ER).)*?\K(?:SR)

So that the BR and SR expressions “stay together” if they are “complicated”? Or are they already totally “safe” the way you expressed them in the original? I’m not totally sure of the precedence of the | operator, and especially not the \K – is the \K of “top priority”?

The ER already seems sufficiently “wrapped” via (?!…) and shouldn’t need any more than that, although the outer grouping on ER seems as if it could be non-capturing as well, so maybe:

(?-s)((?:BR)|\G)(?:(?!ER).)*?\K(?:SR)

I’m not trying to take this totally off-topic into regex land, but I intend to use this technique with N++ a lot in the future, so (to me) it is worth exploring fully.

guy038

Hi, @alan-kilborn,

Nice deductions, indeed ! You’re right in many ways : using non-capturing groups, everywhere, should be beneficial in all cases . :

Firstly, using the non-capturing group (?:(?!ER).) prevents the regex engine from storing any single character between the BR/current location and the SR, one at a time, which should increase the global performance of the overall regex ( as some code simplification in a loop ! )
Secondly, using the non-capturing group (?:SR) can be interesting if you should re-use a part of the SR, in the replacement part and ensures you that you just have to start with group 1 !
Now, I think that the first part ((?:BR)|\G) could simply be expressed as (?:BR|\G), because the zero-length assertion \G is not going to be stored, anyway ;-))

Finally, we end with these generic expressions :

SEARCH (?s)(?:BR|\G)(?:(?!ER).)*?\K(?:SR) OR (?-s)(?:BR|\G)(?:(?!ER).)*?\K(?:SR)

REPLACE RR

where :

BR ( Begining Regex ) is the regex which defines the start of the specific area to look for a possible Search Regex match
ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines areas of continuous characters, where the Search Regex must occur and not elsewhere !
SR ( Search Regex ) is the regex which defines the expression to search for, if , both, the Begining Regex has been matched and the Excluded Regex has not been matched so far, at any position, between BR and SR
RR ( Replace Regex ) is simply the regex which defines the regex expression replacing the Search Regex

Note, that I rewrote the last part of the the ER and SR definitions !

And, if this ER zone is not needed, these generic regexes can be simplified as :

SEARCH (?s)(?:BR|\G).*?\K(?:SR) OR (?-s)(?:BR|\G).*?\K(?:SR)

IMPORTANT : Because the ER regex implicitly defines several non-contiguous areas where SR may exist, when the regex engine skip from a zone ( the yellow area of my previous post ) to the next non-contiguous zone ( The blue area, after the ending parenthesis ), the \G is not verified anymore and only the first alternative BR must occur first to get, later, a possible match of SR

So, your previous regex could be written as :

SEARCH (?s)(?:foo\(|\G)(?:(?!\);).)*?\K(?:0x(BA|DE))

And using the free-spacing mode (?x), it becomes :


(?xs)  (?: foo\( | \G )  (?: (?! \); ). )*?  \K  (?: 0x(BA|DE) )        TESTED => OK
           ¯¯¯¯¯                 ¯¯¯                 ¯¯¯¯¯¯¯¯¯
            BR                   ER                     SR

Best Regards,

guy038

Alan Kilborn

@guy038 said in Select all exclamation marks ! from a specific html tag:

(?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE) : Your regex

I seem to have found a problem; with this text:

int y = 0xBA;

int z = 0xDE;

int x = foo(0,
    12,
    34,
    0xDE,
    12,
    0xBA,
    34,
    27);  // this is another way I could write my foo function

I get hits on the y = and z = lines, even though I thought they had to be inside the foo( and ); delimiters for there to be such hits…

guy038

Hello, @alan-kilborn and All,

Unfortunately, we should have predicted such behavior !

Basically, your regex (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE) looks, either :

For the literal string foo(, followed by any char till the first literal string 0xBA or 0xDE
For any char , right after the previous match ( \G ) till the first literal string 0xBA or 0xDE

So, given this sample :

0xBA    ( Line A )
0xBA

0xDE

int x = foo(0,

0xBA

0xDE

);

    ( Line B )

0xDE

0xBA

int x = foo(0,

0xDE

0xBA

);

0xBA

0xDE


0xBA

0xDE

int x = foo(0,

0xBA

0xDE

);

0xBA

0xDE

Move the caret to the very beginning of line B, for instance. Normally, as the next 0xDE is still outside a function f00 range, it should not be matched. However, it does match this occurrence ! Why ?

Because of the combination of the (?s) modifier, which considers any char and the \G assertion : wherever your caret is located, the \G assertion is always true when your first execute your regex . Indeed, in this case, the regex engine considers that a virtual previous occurrence occurred and stopped right before the caret location. So, it will always find the nearest literal string 0xBA or 0xDE, at any location ( refer to the regex (?s)\G.*?0x(BA|DE) )

Luckily, I found out a solution, which supposes that three hypotheses are verified :

You must use the N++ version 7.9.1 or a later version, which correctly handles the behavior of the \A assertion
You systematically must move the caret to the very beginning of current file ( implicit for a Find All in Current Document, a Find in all Opened Documents or a Find All operation ! )
You must use the (?!\A)\G syntax, in the overall regex ( instead of \G ! )

So the generic regexes, of my previous post, should be improved as :

SEARCH (?s)(?:BR|(?!\A)\G)(?:(?!ER).)*?\K(?:SR) OR (?-s)(?:BR|(?!\A)\G)(?:(?!ER).)*?\K(?:SR)

And gives, for your specific regex :

(?xs)  (?: foo\( | (?! \A ) \G )  (?: (?! \); ). )*?  \K  (?: 0x(BA|DE) )

You may verify, with the provided sample, that, at soon as the caret is not at the very beginning of the first line ( Line A ), before running this improved regex, it wrongly matches the two strings 0xBA and the string 0xDE, located before the first foo\( string !

Hence, the necessity to respect the second hypothesis above, which ensures that the \A assertion is true, before regex execution. By this means, the second alternative of BR : (?!\A)\G will not be true, at the first execution of the regex ;-)

BR

guy038

Alan Kilborn

@guy038 said in Select all exclamation marks ! from a specific html tag:

So the generic regexes, of my previous post, should be improved as :
SEARCH (?s)(?:BR|(?!\A)\G)(?:(?!ER).)?\K(?:SR) OR (?-s)(?:BR|(?!\A)\G)(?:(?!ER).)?\K(?:SR)

So, Guy, just a note again to say thanks for this.
I have employed it 3 or 4 times in the last week, and I anticipate much more usage in the future.
Very handy!

One good example is in a section of a log file I have to process repeatedly.
The section starts with certain line contents and ends with certain other line contents (thus BR and ER).
Inside this section there are subsection headers (that have a consistent pattern to their format), and also “WARNING”, “ERROR”, “FAILED” , etc. text that follow the subsection headers (identifying problems within that subsection).
By combining the headers and the error text bits in an OR’d together regex (to form the SR),I can create some nice output (in the Search result window) that identifies clearly the subsections that have “problems” and those that are “clean”.

So very nice.