Community
    • Login

    regex: Find all lines starting with a specific tag and ending with a different tag

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    regex
    24 Posts 6 Posters 18.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones
      last edited by

      @Robin-Cruise ,

      Oh, I think I just realized your confusion. @Terry-R used the word “bracketed” to mean “surrounded by parentheses (...)”, but you interpreted it to mean “surrounded by angle brackets <...>”.

      Given different people’s terminology in reference to parentheses/parenthesis, braces, brackets, curly brackets, square brackets, angle brackets, I try to be as explicit as possible, and often give examples of which I mean (though I sometimes fail at this).

      1 Reply Last reply Reply Quote 1
      • PeterJonesP
        PeterJones
        last edited by

        Regexr.com calls (?-s) a “mode modifier”.

        PCRE (Perl Compatible Regular Expresions) have their origin in Perl, and I learned my regex through Perl, so when I’m confused, I go to Perl’s perlre manpage. That shows a fully-expanded non-capturing group with enabled modifiers, disabled modifiers, and a non-capturing pattern: (?adluimnsx-imnsx:pattern)

        • adluimnsx are the possible pattern modifiers to enable (where you can have 0 or more of the modifiers),
        • -imnsx are the possible pattern modifiers to disable (where you can have 0 or more of the modifiers after the -),
        • pattern is the part of the regex pattern that you want to group and match, but not capture.
        • If you don’t have modifiers, it shortens to (?:pattern);
        • if you don’t have a pattern and just want modifiers, it shortens to (?adluimnsx-imnsx).
        1 Reply Last reply Reply Quote 0
        • Terry RT
          Terry R
          last edited by

          Sorry @Robin-Cruise Peter explained the confusion well.
          @PeterJones said:

          @Terry-R used the word “bracketed” to mean…

          Yes I meant the “round brackets” ( and ). As Peter stated there are SO many different varieties it’s very easy to get confused. Your regex101 link above would have shown you what each capture group referred to and there were the “round brackets” mentioned.

          I would strongly suggest you study the various characters used that have special meaning, following the links above that Peter and myself mentioned. Unless you get these basics sorted you will have ALL sorts of problems trying to create regexes successfully.

          Terry

          1 Reply Last reply Reply Quote 0
          • Alan KilbornA
            Alan Kilborn
            last edited by

            Be careful here. Notepad++ doesn’t use PCRE regular expressions, no mater what the N++ wiki says. It uses Boost regular expressions.

            Of these:

            (?adluimnsx-imnsx)

            Boost does not support adlun, reducing what it does support to:

            (?imsx-imsx)

            If used one of the invalid ones, Notepad++ will say “Find: Invalid regular expression”

            1 Reply Last reply Reply Quote 1
            • Alan KilbornA
              Alan Kilborn
              last edited by Alan Kilborn

              But I did learn something new from Peter’s post, that does work with Boost. I didn’t know that you could include a :pattern inside, for example, (?i). So it is perfectly legal to do this:

              (?-i)b(?i:b)b to match bbb or bBb but not Bbb, BBB, bBB, etc.

              Note that the ?i only applies to what is inside the enclosing round brackets. After the closing one, the outer leading (?-i) goes back into being in effect.

              Before this new knowledge I would have achieved the same thing this way: (?-i)b((?i)b)b or even messier (?-i)b(?:(?i)b)b or (?-i)b(?i)b(?-i)b

              Note also that (?i:pattern) is a non-capturing group. And i could be s or whatever is legal (see prior post).

              Before some wiseguy points out that b[bB]b works just as well…these are just made up examples to show a technique, not true real-life searches.

              1 Reply Last reply Reply Quote 3
              • Alan KilbornA
                Alan Kilborn
                last edited by

                Continues to blah, blah, blah on and on…

                So checking the Notepad++ regex wiki: http://docs.notepad-plus-plus.org/index.php/Regular_Expressions

                I see that this syntax IS there, but I guess I never understood it because it is SO f*cked up. :-)

                Here’s what the Wiki says:

                (?:flags-not-flags ...), (?:flags-not-flags:...)
                Applies flags and not-flags to search inside the parentheses. Such a construct may have flags and may have not-flags - if it has neither, it is just a non-marking group, which is just a readability enhancer.
                

                Here’s how I would write it so that it is (hopefully) understandable:

                (?flags-notflags:searchpattern)
                Applies flags and notflags to the searchpattern inside the parentheses and forms a non-capturing group. Such a construct may optionally have flags and may optionally have -notflags ; if it has neither, it is just a simple non-capturing group.  Note that the effect of flags/nonflags only applies to the searchpattern inside the enclosing parentheses.
                
                1 Reply Last reply Reply Quote 2
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, all,

                  May be a bit late but here is a regex S/R which is able to detect and replace any ending tag, different of </p>, with this one, in any mono or multi-lines range <p class.....>.............<...>, whatever its location on current line

                  SEARCH (?s)<p class[^<]+\K(?!</p>)<(?-s).+>

                  REPLACE </p>

                  Remark : I’s important to point out that, due to the \K syntax, this regex S/R works if you click on the Replace All button , exclusively ! ( Any step-by-step replacement , with the Replace button, will not work )

                  So, assuming this test text :

                  bla bla   <p class=“amigo”>My mother is at home.<br>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother
                   is at 
                  
                  home.<br>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother
                   is at home.<h>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother is at home.</p>   ====== NOT CHANGED ======
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother is at home.</h>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My
                   mother
                   is at
                   home.</p>   ====== NOT CHANGED ======
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother is at home.</a>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother is at home.<p>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother is at home.<br>   bla bla
                  
                  bla blah
                  

                  You would obtain :

                  bla bla   <p class=“amigo”>My mother is at home.</p>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother
                   is at 
                  
                  home.</p>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother
                   is at home.</p>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother is at home.</p>   ====== NOT CHANGED ======
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother is at home.</p>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My
                   mother
                   is at
                   home.</p>   ====== NOT CHANGED ======
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother is at home.</p>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother is at home.</p>   bla bla
                  
                  bla blah
                  
                  bla bla   <p class=“amigo”>My mother is at home.</p>   bla bla
                  
                  bla blah
                  

                  Best regards,

                  guy038

                  P.S. : To continue the @alan-kilborn discussion on flags, from the link below :

                  https://gammon.com.au/pcre/pcrepattern.html#SEC11

                  It is said :

                  Because alternative branches are tried from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches

                  So, for instance, the regex (?-i)WEDNESDAY|(?i:friday|saturday|sunday)|Monday would match :

                  • WEDNESDAY and Monday, in that exact case

                  • Friday, as well as Saturday and Sunday in any case

                  1 Reply Last reply Reply Quote 3
                  • Robin CruiseR
                    Robin Cruise
                    last edited by

                    Thanks everyone for help.

                    guyo38 made also a beautiful regex. But, there is a case where the formula does not fit. Suppose:

                    Case 1

                    <p class=“amigo”>1. Blah blah blah <br>
                          2. Blah blah blah <br>
                          3.  Blah blah blah <br>
                          4.  Blah blah blah       </p>
                    

                    Case 2

                    <p class=“amigo”>1. Blah blah blah <br>
                          2. Blah blah blah <br>
                          3.  Blah blah blah <br>
                          4.  Blah blah blah        <br>
                     new sentence here </p>
                    

                    In this 2 cases your regex (?s)<p class[^<]+\K(?!</p>)<(?-s).+> replace by </p> will replace the first instance of <br>, and that will ruin the html code.

                    So, in this 2 cases, I would only like to search and replace only those tags that contains only one instance of <br>. Or, if there are more <br>, the replacement with </p> would not take place anymore.

                    1 Reply Last reply Reply Quote 1
                    • Alan KilbornA
                      Alan Kilborn
                      last edited by

                      @guy038 said:

                      Because alternative branches are tried from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches

                      Is there something new or surprising here?

                      In the example you gave, this part:

                      (?i:friday|saturday|sunday)

                      It corresponds to my “revised Wiki” entry of:

                      (?flags:searchpattern)

                      where in this case:

                      searchpattern = friday|saturday|sunday

                      but it is still just a re pattern…

                      I feel like I am missing something here because my first thought is of course this is the way it works and nothing is new here.

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by

                        Hi, @robin-cruise, @alan-kilborn and All,

                        Ah…, yes, Robin, you’re quite right ! It’s the usual drawback of finding out a regex, without having the real text to test the regex against !

                        I’m not an HTML coder and, may be, I’m going to tell a nonsense but let’s suppose the following text, without the ending tag, at line 4. So :

                        <p class=“amigo”>1. Blah blah blah <br>
                              2. Blah blah blah <br>
                              3.  Blah blah blah <br>
                              4.  Blah blah blah
                              ....
                        

                        How can I decide between this case A :

                        <p class=“amigo”>1. Blah blah blah </p>
                              2. Blah blah blah <br>
                              3. Blah blah blah <br>
                              4. Blah blah blah <br>
                              ....
                        

                        And this case B, below ?

                        <p class=“amigo”>1. Blah blah blah <br>
                              2. Blah blah blah <br>
                              3. Blah blah blah <br>
                              4. Blah blah blah    </p>
                              ....
                        

                        Of course, feel free to send me an e-mail, with your true text, if you don’t mind, just telling me where you would like to replace the <br> with </p>


                        To @alan-kilborn,

                        Of course, I didn’t say that it was an hidden rule or a work-around ! I just wanted to point out, for beginners to “regex world”, the fact that, if the search pattern is an alternative, with several branches, either :

                        • In a non-capturing group, as (?i:friday|saturday|sunday) or (?:(?i)friday|saturday|sunday)

                        • In a capturing group, as ((?i)friday|saturday|sunday)

                        The modifier, (?i) in this example, affects any branch of the alternative

                        Best Regards,

                        guy038

                        1 Reply Last reply Reply Quote 2
                        • Alan KilbornA
                          Alan Kilborn
                          last edited by

                          @guy038

                          I think I see what you are saying. You are saying that to some people with misunderstanding would think that the (?i) only affects friday in the example above.

                          In general this brings up a good point, or a question. What is the precedence of regexes?

                          For example if you have the regex friday|saturday|sunday we know that it truly means (friday)|(saturday)|(sunday) – without the capturing of course. But it could mean frida(y|s)aturda(y|s)unday I suppose – but I know it doesn’t. But I don’t know the real rules of when one needs non-capturing parens…and when one doesn’t.

                          Maybe this is a hard question to ask, and it isn’t really Notepad++ related…

                          1 Reply Last reply Reply Quote 1
                          • guy038G
                            guy038
                            last edited by guy038

                            Hi, @alan-kilborn and All,

                            Regarding regex operators precedence, taken from the link,

                            https://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.operator_precedence

                            The table, below, gives the hierarchy of these operators, listed from the highest priority to the lowest priority :

                            1. POSIX based Bracket Character set : [:Class character:], [=Equivalent Class=], and [.Collating element.]
                            2. Escaped characters : \...
                            3. Bracket Character set, ( negative or not ) : [^.....] and [.....]
                            4. Grouping, ( capturing or not ) : (.....) and (?:.....)
                            5. Quantifiers : *, +, ?, {n}, {m,n} and {m,}
                            6. Concatenation ( Implicit )
                            7. Anchoring : ^ and $
                            8. Alternation : |

                            Here are some examples to verify this hierarchy :

                            • Between level 1 and level 2 :

                            The regex [[=\=]] matches the reversed slash \, only and NOT the regex [[==]], which is, besides, invalid !

                            • Between level 2 and level 3 :

                            The regex \[1] means the regex \[ , so the string [, followed with the string 1] and NOT the regex \1, as [1] represents the 1 digit., which,finally, matches the 1 digit

                            • Between level 3 and level 4 :

                            The regex [(123)45] matches 1, 2, 3, 4 and 5 digits, as well as the parentheses ( and ), and NOT the number 123, as a group, or the digits 4 or 5, which can be found with the regex (123)|[45]

                            • Between level 4 and level 5 :

                            The regex (123)+ represents the number 123, possibly repeated, and NOT the 12 number, followed with any range of consecutive digit(s) 3, which can be found with the regex 123+

                            • Between level 5 and level 6 :

                            The regex 123+45+ matches the 12 number, followed with any range of consecutive digit(s) 3, followed with 4 number, followed with any range of consecutive digit(s) 5 and NOT any range of the 123 number, followed with any range of the 45 number, which can be obtained with the regex (123)+(45)+

                            • Between level 6 and level 7 :

                            I have not been able to detail differences between implicit concatenation of regexes ( for instance, regex a, followed with regex b resulting in the regex ab ) and anchoring which defines zero-length regexes, matching specific locations in file contents !

                            Indeed, if we consider the simple regex ^123, to my mind, the regex ^1, immediately followed with the regex 23 or the regex ^12, immediately followed with the regex 3 and the regex ^123, or even the zero-lengh regex ^ followed with the regex 123, seem all identical !?

                            A bit off topic : just notice that string concatenation does NOT represent the same concept as regex concatenation ! For instance, the regex [12], followed with the regex [34] matches all elements of the set { 13, 14, 23, 24 }, whereas the string 12, followed with string 34, represents the single-element set { 1234 }

                            • Between level 7 and level 8 :

                            The regex ^12|34$ matches the 12 number, beginning a line OR the 34 number, ending a line ( and NOT a line with number 12 OR number 34, only ( which can be found with the regex ^(12|34)$ ) NEITHER a line beginning with the 1 digit, ending with the 4 digit and between, either, digit 2 OR 3 ( which can be found with the regex ^1(2|3)4$ )

                            Best regards,

                            Merry Christmas and Happy Holidays to all ;-))

                            guy038

                            P.S. :

                            I’ve, also, found out a great article on operators precedence, regarding the main progamming or script languages ;-)) Just click below :

                            https://rosettacode.org/wiki/Operator_precedence

                            1 Reply Last reply Reply Quote 5
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors