regex: Find all lines starting with a specific tag and ending with a different tag
-
@guy038 said:
Because alternative branches are tried from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches
Is there something new or surprising here?
In the example you gave, this part:
(?i:friday|saturday|sunday)
It corresponds to my “revised Wiki” entry of:
(?flags:searchpattern)
where in this case:
searchpattern
=friday|saturday|sunday
but it is still just a re pattern…
I feel like I am missing something here because my first thought is of course this is the way it works and nothing is new here.
-
Hi, @robin-cruise, @alan-kilborn and All,
Ah…, yes, Robin, you’re quite right ! It’s the usual drawback of finding out a regex, without having the real text to test the regex against !
I’m not an
HTML
coder and, may be, I’m going to tell a nonsense but let’s suppose the following text, without the ending tag, at line4
. So :<p class=“amigo”>1. Blah blah blah <br> 2. Blah blah blah <br> 3. Blah blah blah <br> 4. Blah blah blah ....
How can I decide between this case A :
<p class=“amigo”>1. Blah blah blah </p> 2. Blah blah blah <br> 3. Blah blah blah <br> 4. Blah blah blah <br> ....
And this case B, below ?
<p class=“amigo”>1. Blah blah blah <br> 2. Blah blah blah <br> 3. Blah blah blah <br> 4. Blah blah blah </p> ....
Of course, feel free to send me an e-mail, with your true text, if you don’t mind, just telling me where you would like to replace the
<br>
with</p>
To @alan-kilborn,
Of course, I didn’t say that it was an hidden rule or a work-around ! I just wanted to point out, for beginners to “regex world”, the fact that, if the search pattern is an alternative, with several branches, either :
-
In a non-capturing group, as
(?i:friday|saturday|sunday)
or(?:(?i)friday|saturday|sunday)
-
In a capturing group, as
((?i)friday|saturday|sunday)
The modifier,
(?i)
in this example, affects any branch of the alternativeBest Regards,
guy038
-
-
I think I see what you are saying. You are saying that to some people with misunderstanding would think that the
(?i)
only affectsfriday
in the example above.In general this brings up a good point, or a question. What is the precedence of regexes?
For example if you have the regex
friday|saturday|sunday
we know that it truly means(friday)|(saturday)|(sunday)
– without the capturing of course. But it could meanfrida(y|s)aturda(y|s)unday
I suppose – but I know it doesn’t. But I don’t know the real rules of when one needs non-capturing parens…and when one doesn’t.Maybe this is a hard question to ask, and it isn’t really Notepad++ related…
-
Hi, @alan-kilborn and All,
Regarding regex operators precedence, taken from the link,
The table, below, gives the hierarchy of these operators, listed from the highest priority to the lowest priority :
- POSIX based Bracket Character set :
[:Class character:]
,[=Equivalent Class=]
, and[.Collating element.]
- Escaped characters :
\...
- Bracket Character set, ( negative or not ) :
[^.....]
and[.....]
- Grouping, ( capturing or not ) :
(.....)
and(?:.....)
- Quantifiers :
*
,+
,?
,{n}
,{m,n}
and{m,}
- Concatenation ( Implicit )
- Anchoring :
^
and$
- Alternation :
|
Here are some examples to verify this hierarchy :
- Between level 1 and level 2 :
The regex
[[=\=]]
matches the reversed slash\
, only and NOT the regex[[==]]
, which is, besides, invalid !- Between level 2 and level 3 :
The regex
\[1]
means the regex\[
, so the string [, followed with the string 1] and NOT the regex\1
, as[1]
represents the 1 digit., which,finally, matches the 1 digit- Between level 3 and level 4 :
The regex
[(123)45]
matches 1, 2, 3, 4 and 5 digits, as well as the parentheses(
and)
, and NOT the number 123, as a group, or the digits 4 or 5, which can be found with the regex(123)|[45]
- Between level 4 and level 5 :
The regex
(123)+
represents the number 123, possibly repeated, and NOT the 12 number, followed with any range of consecutive digit(s) 3, which can be found with the regex123+
- Between level 5 and level 6 :
The regex
123+45+
matches the 12 number, followed with any range of consecutive digit(s) 3, followed with 4 number, followed with any range of consecutive digit(s) 5 and NOT any range of the 123 number, followed with any range of the 45 number, which can be obtained with the regex(123)+(45)+
- Between level 6 and level 7 :
I have not been able to detail differences between implicit concatenation of regexes ( for instance, regex
a
, followed with regexb
resulting in the regexab
) and anchoring which defines zero-length regexes, matching specific locations in file contents !Indeed, if we consider the simple regex
^123
, to my mind, the regex^1
, immediately followed with the regex23
or the regex^12
, immediately followed with the regex3
and the regex^123
, or even the zero-lengh regex^
followed with the regex123
, seem all identical !?A bit off topic : just notice that string concatenation does NOT represent the same concept as regex concatenation ! For instance, the regex
[12]
, followed with the regex[34]
matches all elements of the set{
13, 14, 23, 24}
, whereas the string 12, followed with string 34, represents the single-element set{
1234}
- Between level 7 and level 8 :
The regex
^12|34$
matches the 12 number, beginning a line OR the 34 number, ending a line ( and NOT a line with number 12 OR number 34, only ( which can be found with the regex^(12|34)$
) NEITHER a line beginning with the 1 digit, ending with the 4 digit and between, either, digit 2 OR 3 ( which can be found with the regex^1(2|3)4$
)Best regards,
Merry Christmas and Happy Holidays to all ;-))
guy038
P.S. :
I’ve, also, found out a great article on operators precedence, regarding the main progamming or script languages ;-)) Just click below :
- POSIX based Bracket Character set :