Community
    • Login

    Select all exclamation marks ! from a specific html tag

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    15 Posts 3 Posters 1.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hello, @alan-kilborn, @robin-cruise and All,

      See the updated version of this post, with the @alan-kilborn advices at    https://community.notepad-plus-plus.org/post/62123


      Alan, you quite right about it. For instance, the three main search regexes that I provided to @robin-cruise, expressed with the free-spacing mode, are, finally :

      Regex A   (?xs)    (?:  <My\ Tag>         |  \G  )    ((?!^<).)+?    <a\ href="    \K               (?=/)
      
      Regex B   (?xs)    (?:  <!--\ BEGIN\ -->  |  \G  )    ((?!^<).)+?    <a\ href="    \K               (?=/)
      
      Regex C   (?xi-s)  (?:  <p\ class="ONE">  |  \G  )        .*?                      \K    \h*!\h*
      

      They follow the generic scheme, below :

      SEARCH (?-s)(BR|\G)((?!ER).)*?\KSR        OR        (?s)(BR|\G)((?!ER).)*?\KSR

      REPLACE RR

      where :

      • BR ( Begining Regex ) is the regex which defines the start of the specific area to look for a possible Search Regex match

      • ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur and not elsewhere !

      • SR ( Search Regex ) is the regex which defines the expression to search for, if , both, the Begining Regex has been matched and the Excluded Regex has not been matched so far, at any position

      • RR ( Replace Regex ) is simply the regex which defines the regex expression replacing the Search Regex

      Note that, when the ER zone is not needed , these S/R can be simplified as :

      SEARCH (?-s)(BR|\G).*?\KSR        OR        (?s)(BR|\G).*?\KSR


      For instance :

      • In the regex A, BR = <My Tag>, ER = ^<, SR = EMPTY string between <a\ href=" and /, RR = https://link.ca

      • In the regex B, BR = <!-- BEGIN -->, ER = ^<, SR = EMPTY string between <a\ href=" and /, RR = https://link.ca

      • In the regex C, BR = <p\ class="ONE">, ER = None, SR = \h*!\h*, RR = \x20!\x20

      Note that :

      • In regexes A and B, due to the muti-lines search with the leading (?s) modifier, an Excluding Regex is necessary to not overlap through an other section <My Tag> or <!-- BEGIN -->, starting at beginning of line. Hence the negative look-ahead (?!^<) in the expression ((?!^<).)+?

      • in regex C, the Excluded Regex is implicit as it could be written with the negative look-ahead (?![\r\n]) which is applied to each character of the shortest range .*? , hence the syntax ((?![\r\n]).)*?. Indeed, because of the leading (?-s) modifier, any char of that range will never be an EOL character. So, it defines, implicitly, a zone after the string <p\ class="ONE"> till the first </p> included, where to search for \h*!\h* and the shortest range of any standard characters can just be defined with the simple syntax (?-s).*? !

      Best Regards,

      guy038

      Robin CruiseR Alan KilbornA 3 Replies Last reply Reply Quote 4
      • Robin CruiseR
        Robin Cruise @guy038
        last edited by

        @guy038 very well explained, thank you

        1 Reply Last reply Reply Quote 1
        • Alan KilbornA
          Alan Kilborn @guy038
          last edited by

          @guy038

          I as well like your explanation.
          It could help people start learning how to solve these types of problems.
          Perhaps in the future posters (and especially repetitive posters asking the same questions for similar situations) could be directed to this solution to try before asking for more help.

          1 Reply Last reply Reply Quote 2
          • Alan KilbornA
            Alan Kilborn @guy038
            last edited by

            @guy038 said:

            ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur and not elsewhere !

            I was trying to use this, but I’m sort of confused about the “ER”, and perhaps it is just trying to decode the sentence above.

            What I was needing to do is find, inside a function foo for a function parameter of, literally, 0xBA or 0xDE. Thus, I want to match:

            x = foo(0, 12, 0xBA, 34, 27);  // this is my foo function
            

            But foo could also be spread across several lines:

            x = foo(0, 
                12, 
                34, 
                0xDE, 
                27);  // this is another way I could write my foo function
            

            So I set up the technique this way:

            BR = foo\(
            ER = \);
            SR = 0x(BA|DE)

            to get a final search regex of (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)

            It seemed to work, but I really was unsure about my “ER” expression, so @guy038 , if you could comment and shed some additional light on it for me, I’d appreciate it.

            1 Reply Last reply Reply Quote 1
            • guy038G
              guy038
              last edited by guy038

              Hi, @alan-kilborn,

              I thought it was better to write this post with Word and provide a screenshot, in order to see colored zones and some writing styles ;-))

              The sample text used is :

              x = foo(0, 
                  12, 
                  34, 
                  0xDE, 
                  12, 
                  0xBA, 
                  34, 
                  27);  // this is another way I could write my foo function
              
              0xDE
              This is
              
              0xBA
              a test
              
              x = foo(0, 
                  12, 
                  34, 
                  0xDE, 
                  12, 
                  0xBA, 
                  34, 
                  27);  // this is another way I could write my foo function
              
              0xDE
              This is
              
              0xBA
              a test
              

              a9772e5c-7f7a-440b-936f-96c4ebe588d9-image.png

              Best Regards,

              guy038

              Alan, as it could be difficult to rewrite all the regexes for tests, here they are, in their order of appearance :

              • (?s)(?!\);).

              • (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)    : Your regex

              • (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)(?=((?!\);).)*?\);)

              • (?s)(foo\(|\G).*?\K0x(BA|DE)

              • (?s)(foo\(|\G).*?\K0x(BA|DE)(?=.*?\);)

              Oh, I just saw the caret of my Word document, located inside the first (?s)(?!\);). regex ! Don’t pay any attention ;-))

              Alan KilbornA 2 Replies Last reply Reply Quote 2
              • Alan KilbornA
                Alan Kilborn @guy038
                last edited by Alan Kilborn

                @guy038

                Yes, that clarifies things; thank you for that.


                Onto a new aspect…

                Again, here’s your original general case regex:

                (?-s)(BR|\G)((?!ER).)*?\KSR

                Would it be better to express it this way?:

                (?-s)((?:BR)|\G)((?!ER).)*?\K(?:SR)

                So that the BR and SR expressions “stay together” if they are “complicated”? Or are they already totally “safe” the way you expressed them in the original? I’m not totally sure of the precedence of the | operator, and especially not the \K – is the \K of “top priority”?

                The ER already seems sufficiently “wrapped” via (?!…) and shouldn’t need any more than that, although the outer grouping on ER seems as if it could be non-capturing as well, so maybe:

                (?-s)((?:BR)|\G)(?:(?!ER).)*?\K(?:SR)

                I’m not trying to take this totally off-topic into regex land, but I intend to use this technique with N++ a lot in the future, so (to me) it is worth exploring fully.

                1 Reply Last reply Reply Quote 1
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, @alan-kilborn,

                  Nice deductions, indeed ! You’re right in many ways : using non-capturing groups, everywhere, should be beneficial in all cases . :

                  • Firstly, using the non-capturing group (?:(?!ER).) prevents the regex engine from storing any single character between the BR/current location and the SR, one at a time, which should increase the global performance of the overall regex ( as some code simplification in a loop ! )

                  • Secondly, using the non-capturing group (?:SR) can be interesting if you should re-use a part of the SR, in the replacement part and ensures you that you just have to start with group 1 !

                  • Now, I think that the first part ((?:BR)|\G) could simply be expressed as (?:BR|\G), because the zero-length assertion \G is not going to be stored, anyway ;-))


                  Finally, we end with these generic expressions :

                  SEARCH (?s)(?:BR|\G)(?:(?!ER).)*?\K(?:SR)    OR    (?-s)(?:BR|\G)(?:(?!ER).)*?\K(?:SR)

                  REPLACE RR

                  where :

                  • BR ( Begining Regex ) is the regex which defines the start of the specific area to look for a possible Search Regex match

                  • ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines areas of continuous characters, where the Search Regex must occur and not elsewhere !

                  • SR ( Search Regex ) is the regex which defines the expression to search for, if , both, the Begining Regex has been matched and the Excluded Regex has not been matched so far, at any position, between BR and SR

                  • RR ( Replace Regex ) is simply the regex which defines the regex expression replacing the Search Regex

                  Note, that I rewrote the last part of the the ER and SR definitions !

                  And, if this ER zone is not needed, these generic regexes can be simplified as :

                  SEARCH (?s)(?:BR|\G).*?\K(?:SR)    OR    (?-s)(?:BR|\G).*?\K(?:SR)

                  IMPORTANT : Because the ER regex implicitly defines several non-contiguous areas where SR may exist, when the regex engine skip from a zone ( the yellow area of my previous post ) to the next non-contiguous zone ( The blue area, after the ending parenthesis ), the \G is not verified anymore and only the first alternative BR must occur first to get, later, a possible match of SR


                  So, your previous regex could be written as :

                  SEARCH (?s)(?:foo\(|\G)(?:(?!\);).)*?\K(?:0x(BA|DE))

                  And using the free-spacing mode (?x), it becomes :

                  
                  (?xs)  (?: foo\( | \G )  (?: (?! \); ). )*?  \K  (?: 0x(BA|DE) )        TESTED => OK
                             ¯¯¯¯¯                 ¯¯¯                 ¯¯¯¯¯¯¯¯¯
                              BR                   ER                     SR
                  

                  Best Regards,

                  guy038

                  1 Reply Last reply Reply Quote 2
                  • Alan KilbornA
                    Alan Kilborn @guy038
                    last edited by

                    @guy038 said in Select all exclamation marks ! from a specific html tag:

                    (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE)    : Your regex

                    I seem to have found a problem; with this text:

                    int y = 0xBA;
                    
                    int z = 0xDE;
                    
                    int x = foo(0,
                        12,
                        34,
                        0xDE,
                        12,
                        0xBA,
                        34,
                        27);  // this is another way I could write my foo function
                    

                    I get hits on the y = and z = lines, even though I thought they had to be inside the foo( and ); delimiters for there to be such hits…

                    1 Reply Last reply Reply Quote 1
                    • guy038G
                      guy038
                      last edited by guy038

                      Hello, @alan-kilborn and All,

                      Unfortunately, we should have predicted such behavior !

                      Basically, your regex (?s)(foo\(|\G)((?!\);).)*?\K0x(BA|DE) looks, either :

                      • For the literal string foo(, followed by any char till the first literal string 0xBA or 0xDE

                      • For any char , right after the previous match ( \G ) till the first literal string 0xBA or 0xDE

                      So, given this sample :

                      0xBA    ( Line A )
                      0xBA
                      
                      0xDE
                      
                      int x = foo(0,
                      
                      0xBA
                      
                      0xDE
                      
                      );
                      
                          ( Line B )
                      
                      0xDE
                      
                      0xBA
                      
                      int x = foo(0,
                      
                      0xDE
                      
                      0xBA
                      
                      );
                      
                      0xBA
                      
                      0xDE
                      
                      
                      0xBA
                      
                      0xDE
                      
                      int x = foo(0,
                      
                      0xBA
                      
                      0xDE
                      
                      );
                      
                      0xBA
                      
                      0xDE
                      

                      Move the caret to the very beginning of line B, for instance. Normally, as the next 0xDE is still outside a function f00 range, it should not be matched. However, it does match this occurrence ! Why ?

                      Because of the combination of the (?s) modifier, which considers any char and the \G assertion : wherever your caret is located, the \G assertion is always true when your first execute your regex . Indeed, in this case, the regex engine considers that a virtual previous occurrence occurred and stopped right before the caret location. So, it will always find the nearest literal string 0xBA or 0xDE, at any location ( refer to the regex (?s)\G.*?0x(BA|DE) )


                      Luckily, I found out a solution, which supposes that three hypotheses are verified :

                      • You must use the N++ version 7.9.1 or a later version, which correctly handles the behavior of the \A assertion

                      • You systematically must move the caret to the very beginning of current file ( implicit for a Find All in Current Document, a Find in all Opened Documents or a Find All operation ! )

                      • You must use the (?!\A)\G syntax, in the overall regex ( instead of \G ! )


                      So the generic regexes, of my previous post, should be improved as :

                      SEARCH (?s)(?:BR|(?!\A)\G)(?:(?!ER).)*?\K(?:SR)    OR    (?-s)(?:BR|(?!\A)\G)(?:(?!ER).)*?\K(?:SR)

                      And gives, for your specific regex :

                      (?xs)  (?: foo\( | (?! \A ) \G )  (?: (?! \); ). )*?  \K  (?: 0x(BA|DE) )
                      

                      You may verify, with the provided sample, that, at soon as the caret is not at the very beginning of the first line ( Line A ), before running this improved regex, it wrongly matches the two strings 0xBA and the string 0xDE, located before the first foo\( string !

                      Hence, the necessity to respect the second hypothesis above, which ensures that the \A assertion is true, before regex execution. By this means, the second alternative of BR : (?!\A)\G will not be true, at the first execution of the regex ;-)

                      BR

                      guy038

                      Alan KilbornA 1 Reply Last reply Reply Quote 3
                      • Alan KilbornA
                        Alan Kilborn @guy038
                        last edited by Alan Kilborn

                        @guy038 said in Select all exclamation marks ! from a specific html tag:

                        So the generic regexes, of my previous post, should be improved as :
                        SEARCH (?s)(?:BR|(?!\A)\G)(?:(?!ER).)?\K(?:SR)    OR    (?-s)(?:BR|(?!\A)\G)(?:(?!ER).)?\K(?:SR)

                        So, Guy, just a note again to say thanks for this.
                        I have employed it 3 or 4 times in the last week, and I anticipate much more usage in the future.
                        Very handy!

                        One good example is in a section of a log file I have to process repeatedly.
                        The section starts with certain line contents and ends with certain other line contents (thus BR and ER).
                        Inside this section there are subsection headers (that have a consistent pattern to their format), and also “WARNING”, “ERROR”, “FAILED” , etc. text that follow the subsection headers (identifying problems within that subsection).
                        By combining the headers and the error text bits in an OR’d together regex (to form the SR),I can create some nice output (in the Search result window) that identifies clearly the subsections that have “problems” and those that are “clean”.

                        So very nice.

                        1 Reply Last reply Reply Quote 3
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors