Select all exclamation marks ! from a specific html tag
-
<p class="ONE">LoveThePlanet products are free from ! parabens and palm oil, despatched in paper envelopes !as well</p>
<p class="TWO">LoveThePlanet products are free from ! parabens and palm oil, despatched in paper envelopes !as well</p>
So, I want to select all exclamation marks
!
from the tag<p class="ONE">...</p>
and to replace it with a space before and after itI made a regex, but it select also the second tag, but I want to modify only the first one (
<p class="ONE">
)FIND:
(?:<p class="one">|\G)[^|"]*\K \!\h|\!\h*
REPLACE BY:
\x20!\x20
So, the single problem is that my regex will modify also the second html tag
<p class="TWO">
, and I want to modify only the first html tag<p class="ONE">
-
Hello, @robin-cruise and All,
I suppose that the following regex S/R should do the trick :
SEARCH
(?i-s)(?:<p class="ONE">|\G).*?\K\h*!\h*
REPLACE
\x20!\x20
Here are the key points and steps :
-
We want to look for any exclamation mark, possibly preceded and/or followed with horizontal blanks characters ( the character class
[\t\x20\xA0]
) so the simple regex\h*!\h*
and replace it with an exclamation mark surrounded with a space char so the regex\x20!\x20
-
We also want to do this search ONLY IF the tag is
<p class="ONE">
. As, in your example, you’re using the upper and lower case, I suppose that the correct regex should be(?i)<p class="ONE">
, with a searchI
nsensible to case. If not, use either(?-i)<p class="ONE">
or(?-i)<p class="one">
-
Now, the main idea is :
-
To look for the shortest range of standard characters between the tag
<p class="ONE">
and our searched expression\h*!\h*
, with the lazy quantifier*?
and select only the searched expression, with the\K
feature, so the regex(?i-s)<p class="ONE">.*?\K\h*!\h*
-
To look, again, for the shortest range of standard chars from the position right after the end of the previous match, with the
\G
assertion and our searched expression, with the regex\G.*?\K\h*!\h*
-
And so on… … If we factor the anchor of the characters range, which is either
<p class="ONE">
or\G
, we end with the regex(?i-s)(?:<p class="ONE">|\G).*?\K\h*!\h*
, as above !
-
-
The important point to understand is that the range of chars, before reaching the searched expression, consists of standard characters. So, when the end of tag
<p class="ONE">...........</p>
is reached, the only way to go on is to skip theEOL
characters, located after</p>
. But, in that case, the\G
assertion is not verified anymore and, necessarily, the next match will have, first, to search for the other anchor<p class="ONE">
!
If we use the
free-spacing
mode, our regex can be expressed as :(?xi-s) # FREE-SPACING mode, INSENSIBLE to case, any DOT = 1 STANDARD character ONLY (?: # NON-CAPTURING group <p\ class="ONE"> # LITERAL string <p class="ONE"> ( Note the ESCAPED SPACE char, after <p ) | # The ALTERNATION symbol \G # MATCHES the position RIGHT AFTER the PREVIOUS match, ONLY ) # End of the NON-CAPTURING group .*? # The SHORTEST range, possiblY NUL, of STANDARD characters till ... \K # CANCELS any MATCH so far and RESETS the regex ENGINE position to PRESENT position \h* ! \h* # ... the SEARCHED expression, so an EXCLAMATION MARK, possibly PRECEDED and/or FOLLOWED with HORIZONTAL BLANK characters
Best Regards,
guy038
-
-
thank you very much
-
-
@Alan-Kilborn you can compare also the solutions on both topics, see yourself if is the same and if it can be applied in the same context :)
-
Hello, @alan-kilborn, @robin-cruise and All,
See the updated version of this post, with the @alan-kilborn advices at https://community.notepad-plus-plus.org/post/62123
Alan, you quite right about it. For instance, the three main search regexes that I provided to @robin-cruise, expressed with the free-spacing mode, are, finally :
Regex A (?xs) (?: <My\ Tag> | \G ) ((?!^<).)+? <a\ href=" \K (?=/) Regex B (?xs) (?: <!--\ BEGIN\ --> | \G ) ((?!^<).)+? <a\ href=" \K (?=/) Regex C (?xi-s) (?: <p\ class="ONE"> | \G ) .*? \K \h*!\h*
They follow the generic scheme, below :
SEARCH
(?-s)(
BR|\G)((?!
ER).)*?\K
SR OR(?s)(
BR|\G)((?!
ER).)*?\K
SRREPLACE RR
where :
-
BR ( Begining Regex ) is the regex which defines the start of the specific area to look for a possible Search Regex match
-
ER ( Excluded Regex ) is the regex which defines the characters and/or strings
forbidden
, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur and not elsewhere ! -
SR ( Search Regex ) is the regex which defines the expression to search for, if , both, the Begining Regex has been matched and the Excluded Regex has not been matched so far, at any position
-
RR ( Replace Regex ) is simply the regex which defines the regex expression replacing the Search Regex
Note that, when the ER zone is not needed , these S/R can be simplified as :
SEARCH
(?-s)(
BR|\G).*?\K
SR OR(?s)(
BR|\G).*?\K
SR
For instance :
-
In the regex A, BR =
<My Tag>
, ER =^<
, SR = EMPTY string between<a\ href="
and/
, RR =https://link.ca
-
In the regex B, BR =
<!-- BEGIN -->
, ER =^<
, SR = EMPTY string between<a\ href="
and/
, RR =https://link.ca
-
In the regex C, BR =
<p\ class="ONE">
, ER =None
, SR =\h*!\h*
, RR =\x20!\x20
Note that :
-
In regexes A and B, due to the muti-lines search with the leading
(?s)
modifier, an Excluding Regex is necessary to not overlap through an other section<My Tag>
or<!-- BEGIN -->
, starting at beginning of line. Hence the negative look-ahead(?!^<)
in the expression((?!^<).)+?
-
in regex C, the Excluded Regex is implicit as it could be written with the negative look-ahead
(?![\r\n])
which is applied to each character of the shortest range.*?
, hence the syntax((?![\r\n]).)*?
. Indeed, because of the leading(?-s)
modifier, any char of that range will never be an EOL character. So, it defines, implicitly, a zone after the string<p\ class="ONE">
till the first</p>
included, where to search for\h*!\h*
and the shortest range of any standard characters can just be defined with the simple syntax(?-s).*?
!
Best Regards,
guy038
-
-
@guy038 very well explained, thank you
-
I as well like your explanation.
It could help people start learning how to solve these types of problems.
Perhaps in the future posters (and especially repetitive posters asking the same questions for similar situations) could be directed to this solution to try before asking for more help. -
@guy038 said:
ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur and not elsewhere !
I was trying to use this, but I’m sort of confused about the “ER”, and perhaps it is just trying to decode the sentence above.
What I was needing to do is find, inside a function
foo
for a function parameter of, literally,0xBA
or0xDE
. Thus, I want to match:x = foo(0, 12, 0xBA, 34, 27); // this is my foo function
But
foo
could also be spread across several lines:x = foo(0, 12, 34, 0xDE, 27); // this is another way I could write my foo function
So I set up the technique this way:
BR =
foo\(
ER =\);
SR =0x(BA|DE)
to get a final search regex of
(?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)
It seemed to work, but I really was unsure about my “ER” expression, so @guy038 , if you could comment and shed some additional light on it for me, I’d appreciate it.
-
Hi, @alan-kilborn,
I thought it was better to write this post with Word and provide a screenshot, in order to see colored zones and some writing styles ;-))
The sample text used is :
x = foo(0, 12, 34, 0xDE, 12, 0xBA, 34, 27); // this is another way I could write my foo function 0xDE This is 0xBA a test x = foo(0, 12, 34, 0xDE, 12, 0xBA, 34, 27); // this is another way I could write my foo function 0xDE This is 0xBA a test
Best Regards,
guy038
Alan, as it could be difficult to rewrite all the regexes for tests, here they are, in their order of appearance :
-
(?s)(?!\);).
-
(?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)
: Your regex -
(?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)(?=((?!\);).)*?\);)
-
(?s)(foo\(|\G).*?\K0x(BA|DE)
-
(?s)(foo\(|\G).*?\K0x(BA|DE)(?=.*?\);)
Oh, I just saw the caret of my Word document, located inside the first
(?s)(?!\);).
regex ! Don’t pay any attention ;-)) -
-
Yes, that clarifies things; thank you for that.
Onto a new aspect…
Again, here’s your original general case regex:
(?-s)(
BR|\G)((?!
ER).)*?\K
SRWould it be better to express it this way?:
(?-s)((?:
BR)|\G)((?!
ER).)*?\K(?:
SR)
So that the BR and SR expressions “stay together” if they are “complicated”? Or are they already totally “safe” the way you expressed them in the original? I’m not totally sure of the precedence of the
|
operator, and especially not the\K
– is the\K
of “top priority”?The ER already seems sufficiently “wrapped” via
(?!
…)
and shouldn’t need any more than that, although the outer grouping onER
seems as if it could be non-capturing as well, so maybe:(?-s)((?:
BR)|\G)(?:(?!
ER).)*?\K(?:
SR)
I’m not trying to take this totally off-topic into regex land, but I intend to use this technique with N++ a lot in the future, so (to me) it is worth exploring fully.
-
Hi, @alan-kilborn,
Nice deductions, indeed ! You’re right in many ways : using non-capturing groups, everywhere, should be beneficial in all cases . :
-
Firstly, using the non-capturing group
(?:(?!
ER).)
prevents the regex engine from storing any single character between the BR/current location and the SR, one at a time, which should increase the global performance of the overall regex ( as some code simplification in a loop ! ) -
Secondly, using the non-capturing group
(?:
SR)
can be interesting if you should re-use a part of the SR, in the replacement part and ensures you that you just have to start with group1
! -
Now, I think that the first part
((?:
BR)|\G)
could simply be expressed as(?:
BR|\G)
, because the zero-length assertion\G
is not going to be stored, anyway ;-))
Finally, we end with these generic expressions :
SEARCH
(?s)(?:
BR|\G)(?:(?!
ER).)*?\K(?:
SR)
OR(?-s)(?:
BR|\G)(?:(?!
ER).)*?\K(?:
SR)
REPLACE RR
where :
-
BR ( Begining Regex ) is the regex which defines the start of the specific area to look for a possible Search Regex match
-
ER ( Excluded Regex ) is the regex which defines the characters and/or strings
forbidden
, from the Begining Regex position until a next Search Regex match. It, implicitly, defines areas of continuous characters, where the Search Regex must occur and not elsewhere ! -
SR ( Search Regex ) is the regex which defines the expression to search for, if , both, the Begining Regex has been matched and the Excluded Regex has not been matched so far, at any position, between BR and SR
-
RR ( Replace Regex ) is simply the regex which defines the regex expression replacing the Search Regex
Note, that I rewrote the last part of the the ER and SR definitions !
And, if this ER zone is not needed, these generic regexes can be simplified as :
SEARCH
(?s)(?:
BR|\G).*?\K(?:
SR)
OR(?-s)(?:
BR|\G).*?\K(?:
SR)
IMPORTANT : Because the ER regex implicitly defines several non-contiguous areas where SR may exist, when the regex engine skip from a zone ( the yellow area of my previous post ) to the next non-contiguous zone ( The blue area, after the ending parenthesis ), the
\G
is not verified anymore and only the first alternative BR must occur first to get, later, a possible match of SR
So, your previous regex could be written as :
SEARCH
(?s)(?:foo\(|\G)(?:(?!\);).)*?\K(?:0x(BA|DE))
And using the free-spacing mode
(?x)
, it becomes :(?xs) (?: foo\( | \G ) (?: (?! \); ). )*? \K (?: 0x(BA|DE) ) TESTED => OK ¯¯¯¯¯ ¯¯¯ ¯¯¯¯¯¯¯¯¯ BR ER SR
Best Regards,
guy038
-
-
@guy038 said in Select all exclamation marks ! from a specific html tag:
(?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)
: Your regexI seem to have found a problem; with this text:
int y = 0xBA; int z = 0xDE; int x = foo(0, 12, 34, 0xDE, 12, 0xBA, 34, 27); // this is another way I could write my foo function
I get hits on the
y =
andz =
lines, even though I thought they had to be inside thefoo(
and);
delimiters for there to be such hits… -
Hello, @alan-kilborn and All,
Unfortunately, we should have predicted such behavior !
Basically, your regex
(?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)
looks, either :-
For the literal string
foo(
, followed by any char till the first literal string0xBA
or0xDE
-
For any char , right after the previous match (
\G
) till the first literal string0xBA
or0xDE
So, given this sample :
0xBA ( Line A ) 0xBA 0xDE int x = foo(0, 0xBA 0xDE ); ( Line B ) 0xDE 0xBA int x = foo(0, 0xDE 0xBA ); 0xBA 0xDE 0xBA 0xDE int x = foo(0, 0xBA 0xDE ); 0xBA 0xDE
Move the caret to the very beginning of line
B
, for instance. Normally, as the next0xDE
is still outside a functionf00
range, it should not be matched. However, it does match this occurrence ! Why ?Because of the combination of the
(?s)
modifier, which considers any char and the\G
assertion : wherever your caret is located, the\G
assertion is always true when your first execute your regex . Indeed, in this case, the regex engine considers that a virtual previous occurrence occurred and stopped right before the caret location. So, it will always find the nearest literal string0xBA
or0xDE
, at any location ( refer to the regex(?s)\G.*?0x(BA|DE)
)
Luckily, I found out a solution, which supposes that three hypotheses are verified :
-
You must use the N++ version
7.9.1
or a later version, which correctly handles the behavior of the\A
assertion -
You systematically must move the caret to the very beginning of current file ( implicit for a
Find All in Current Document
, aFind in all Opened Documents
or aFind All
operation ! ) -
You must use the
(?!\A)\G
syntax, in the overall regex ( instead of\G
! )
So the generic regexes, of my previous post, should be improved as :
SEARCH
(?s)(?:
BR|(?!\A)\G)(?:(?!
ER).)*?\K(?:
SR)
OR(?-s)(?:
BR|(?!\A)\G)(?:(?!
ER).)*?\K(?:
SR)
And gives, for your specific regex :
(?xs) (?: foo\( | (?! \A ) \G ) (?: (?! \); ). )*? \K (?: 0x(BA|DE) )
You may verify, with the provided sample, that, at soon as the caret is not at the very beginning of the first line ( Line
A
), before running this improved regex, it wrongly matches the two strings0xBA
and the string0xDE
, located before the firstfoo\(
string !Hence, the necessity to respect the second hypothesis above, which ensures that the
\A
assertion is true, before regex execution. By this means, the second alternative of BR :(?!\A)\G
will not be true, at the first execution of the regex ;-)BR
guy038
-
-
@guy038 said in Select all exclamation marks ! from a specific html tag:
So the generic regexes, of my previous post, should be improved as :
SEARCH (?s)(?:BR|(?!\A)\G)(?:(?!ER).)?\K(?:SR) OR (?-s)(?:BR|(?!\A)\G)(?:(?!ER).)?\K(?:SR)So, Guy, just a note again to say thanks for this.
I have employed it 3 or 4 times in the last week, and I anticipate much more usage in the future.
Very handy!One good example is in a section of a log file I have to process repeatedly.
The section starts with certain line contents and ends with certain other line contents (thus BR and ER).
Inside this section there are subsection headers (that have a consistent pattern to their format), and also “WARNING”, “ERROR”, “FAILED” , etc. text that follow the subsection headers (identifying problems within that subsection).
By combining the headers and the error text bits in an OR’d together regex (to form the SR),I can create some nice output (in the Search result window) that identifies clearly the subsections that have “problems” and those that are “clean”.So very nice.