• Login
Community
  • Login

generic-regex-replacing-in-a-specific-zone-of-text

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
9 Posts 4 Posters 711 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P
    Paul Wormer
    last edited by Paul Wormer May 5, 2023, 9:22 AM May 5, 2023, 9:19 AM

    I’m on a track to learning advanced regexes and this regex is one of my challenges. I understand it halfway and have two remaining questions. For reference sake I repeat the regex here:

    (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)
    

    First question: why is there the alternation (?-si:BSR|(?!\A)\G)? Naively, I would simply start with (?-si:BSR) and place the caret somewhere before the string matched by BSR and then hit Find Next or Replace.

    My second question is more related to Npp than to regexes: why doesn’t Replace work with this regex (even if one wants to replace one string only) and is Replace All necessary?

    1 Reply Last reply Reply Quote 3
    • G
      guy038
      last edited by guy038 May 5, 2023, 10:50 AM May 5, 2023, 10:49 AM

      Hello, @paul-wormer and All,

      I do not have some spare time to fully answer your first question, presently ! Just a matter of some hours !

      However, regarding your second question, it’s quite easy ! It happens that any time you insert a \K syntax, somewhere in a regex, the step by step replacement, with the Replace button is not allowed by the regex engine and the only possibility is to use the Replace All button !

      Best regards,

      guy038

      1 Reply Last reply Reply Quote 3
      • G
        guy038
        last edited by guy038 May 6, 2023, 12:55 AM May 5, 2023, 3:41 PM

        Hello, @paul-wormer and All,

        Let’s test it with the real text, below, that you’ll copy in a new N++ tab :

        
        
        <try>01-23
        456
        7---89
        </pos>
        
        <val>37--001</val>
        <text>This-is
        -a</text>
        <pos>4-1234</pos>
        
        <val>37--002</val>
        <text>-small---example</text>
        <pos>9-0012</pos>
        
        
        <val>37--003</val>
        <text>-of-text-
        which-</text>
        <pos>1-9999</pos>
        
        
        <val>37--004</val>
        
        <text>need
        -to-be-
        modi
        fied</text>
        
        <pos>0-0000</pos>
        

        Note the 2 empty lines at the beginning of the file !


        Now, let’s suppose that we want to replace any range of dashes with a single space char, but ONLY on lines embedded in a multi-lines section <text>.....</text>

        • If we use your formulation of the generic regex (?-si:BSR)(?s-i:(?!ESR).)*?\K(?-si:FR), we end up with the functional search regex (?-si:<text>)(?s-i:(?!</text>).)*?\K(?-si:-+) which can be simplified as :

        • SEARCH (?-i:<text>)(?s-i:(?!</text>).)*?\K-+

        • REPLACE \x20

        • Move the cursor at the very beginning of the new tab

        • Seemingly, the first match is correct : a dash, right after the part <text>This, but the subsequent matches are wrong : it always matches the first occurrence ONLY, of each line surrounded with the multi-line tags <text> and </text>

        Because of the lack of the \G syntax, the other dashes, present in each <text>.....</text> section, are not matched. Thus, we cannot use this regex form !


        You could say : but what about the generic regex (?-si:BSR|\G)(?s-i:(?!ESR).)*?\K(?-si:FR) which can be simplified as :

        • SEARCH (?-i:<text>|\G)(?s-i:(?!</text>).)*?\K-+

        • REPLACE \x20

        • Again, move the cursor at the very beginning of the new tab

        • First, it matches all occurrences of a dash in any line not surrounded with the multi-line tags <text> and </text> ( NOT wanted )

        • Then, as soon as it matches an BSR region ( <Text> ) and, up to the very end of the file, it correctly matches any range of dashes in each line surrounded with the multi-line tags <text> and </text> ONLY

        Because the \G syntax is matched at the very beginning of the file, some first wrong matches occur. Thus, this formulation is not correct, too !


        Finally, let’s use the complete generic regex (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR) which gives the functional one :

        • SEARCH (?-i:<text>|(?!\A)\G)(?s-i:(?!</text>).)*?\K-+

        • REPLACE \x20

        • Again, move the cursor at the very beginning of the new tab

        • As you can see, since the \G syntax is not allowed at the very beginning of current file, it correctly matches all the occurrences of any range of dashes, ONLY in the lines surrounded with the multi-line tags <text> and </text>.

        This is the expected and desired behaviour !


        However, note that, if we decide, on purpose, to start with the cursor on a line, after the very beginning ( as line 2, 3 or else ), our last regex will also finds some initial wrong matches !

        Best Regards,

        guy038

        P 1 Reply Last reply May 5, 2023, 4:03 PM Reply Quote 4
        • P
          Paul Wormer @guy038
          last edited by May 5, 2023, 4:03 PM

          @guy038 Thank you very much for your very elaborate answer (and also for your time). I will study carefully your text, at first glance it seems definitely worth my while. You really are the grandmaster of regular expressions! Thank you again.

          1 Reply Last reply Reply Quote 0
          • G
            guy038
            last edited by guy038 May 6, 2023, 3:27 PM May 6, 2023, 1:01 AM

            Hi, @paul-wormer and All,

            I realize that I forgot to mention the fundamental role of the \G syntax, in this kind of regex !

            In the Boost reference manual, here , it is said :

            Continuation Escape

            The sequence \G matches only at the end of the last match found, or at the start of the text being matched if no previous match was found.

            This escape is useful if you’re iterating over the matches contained within a text, and you want each subsequent match to start where the last one ended.


            What does this mean ?

            Well, when the caret is at the beginning of file, the \G syntax is not allowed because of the negative look-ahead (?!\A). So the only possibility to match is to match the BSR string, i.e. the string <text> with that case, followed by the smallest range, even on several lines, of any char, different from </text> … till a range of dashes

            Then the \G feature takes over and selects from the next char to the nearest range of dashes, again. But, as it cannot go through the </text> string, this means that, necessarily, the next match will not be adjacent to the previous one.

            Thus, the \G feature is not supported anymore and the only possiblility is to match a <text> string again ( the other alternative )

             bla bla <text> this is a-small test to see-if it</text> is OK.<text>We're looking again for a third dash-and a fourth-one and so-on</text>
                     ----------------•-----------------•                   ------------------------------------------•------------•----------•
                         1st match        2nd match                                    3rd match                       4th match    5th match
            

            This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string ( </text> )

            (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)

            which gives the functional regex S/R :

            SEARCH (?-s)(?-i:<text>|(?!\A)\G).*?\K-+

            REPLACE \x20

            If we imagine this text :

            
            Line 1  <text>this is a-small text to see-where are-
            -       ---------------•-----------------•---------•
            Line 3     1st match       2nd match      3rd match
            Line 4
            Line 5  <text>all the-different matches of-a dash</text>
            Line 6  -------------•--------------------•
            Line 7     4th match        5th match
            

            After the third match of the final dash, at the end of line 1 the next match should be at beginning of line 2. But it is not allowed because the regex engine would have to go from line 1 to line 2 and skip the two chars CR + LF

            In that case, the \G feature is not respected anymore and, necessarily, the search goes on starting with the <text> string, in line 5, … till a dash

            Note that the </text> string, at the end of each line, is not mandatory because the implicit gap CRLF, between two consecutive lines, resets, each time, the regex engine to the search of a <text> string first !!

            Thus, to search any FR region in each line, containing a BSR region, simply use the simplified generic regex (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)

            Best Regards,

            guy038

            P 1 Reply Last reply May 6, 2023, 1:42 PM Reply Quote 2
            • P
              Paul Wormer @guy038
              last edited by Paul Wormer May 6, 2023, 1:43 PM May 6, 2023, 1:42 PM

              I worked through Guy’s examples and understand them. I cannot say that I could already compose a regular expression of similar complexity, but at least I can follow Guy’s reasoning.

              There is one open end though:

              @guy038 said in generic-regex-replacing-in-a-specific-zone-of-text:

              This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string ( </text> )
              […]
              SEARCH (?-s)(?-i:<text>|(?!\A)\G).*?\K-+

              I tried to match the string:

              a-a <text>b-c</text> d-e
              

              and placed the caret at the beginning of the file. Using the simplified regex above:

              (?-s)(?-i:<text>|(?!\A)\G).*?\K-+
              

              I found that the second and the third hyphen in the string match. Is this a case of “even Homer sometimes nods”, or do I make a mistake?

              A M 2 Replies Last reply May 6, 2023, 3:13 PM Reply Quote 0
              • A
                Alan Kilborn @Paul Wormer
                last edited by Alan Kilborn May 6, 2023, 3:14 PM May 6, 2023, 3:13 PM

                @Paul-Wormer said in generic-regex-replacing-in-a-specific-zone-of-text:

                do I make a mistake?

                Yes, you do.
                Why wouldn’t it match the last -, as you have nothing about </text> terminating the parsing in the regex…
                The possible matching extends to the end of the line.

                1 Reply Last reply Reply Quote 1
                • G
                  guy038
                  last edited by guy038 May 6, 2023, 3:23 PM May 6, 2023, 3:21 PM

                  Hi, @paul-wormer, @alan-kilborn and All,

                  @paul-wormer, you didn’t make a mistake. It’s just that I did not explain myself properly !

                  When I said :

                  This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string ( </text> )

                  I should have added : … if we suppose that the closing </text> boundaries were implicitly at the very end of the lines containing the <text> string, instead of their present location !


                  For instance, write the text, below, in a new tab :

                  a-a foo b---c <text>bar d-e foo f---g bla</text> H--I blah j-k
                  This --- is
                  a-b foo c---d <text>e--f foo g-h bar i--j foo
                  a test
                  to - verify
                  A---0 bar b-c <text>d-e bla f-----g blah</text> foo h--i bar j-k
                  the ----- regexes
                  

                  If you use the simplified generic regex (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR), so the practical search (?-s)(?-i:<text>|(?!\A)\G).*?\K-+ :

                  It supposes that implicit </text> boundaries exist at the end of the lines, instead of the present ones

                  So, the text could also be rewritten :

                  a-a foo b---c <text>bar d-e foo< f-g bla H---I blah j--k/text>
                  This --- is
                  a-b foo c---d <text>e--f foo g-h bar i--j foo</text>
                  a test
                  to - verify
                  A---0 bar b-c <text>d-e bla f-----g blah foo h--i bar j-k</text>
                  the ----- regexes
                  

                  And, of course, the regex (?-s)(?-i:<text>|(?!\A)\G).*?\K-+, against these two texts, does match any range dash characters, after the opening <text> boundary … till the end of each line


                  But, using again, the original text :

                  a-a foo b---c <text>bar d-e foo f---g bla</text> H--I blah j-k
                  This --- is
                  a-b foo c---d <text>e--f foo g-h bar i--j foo
                  a test
                  to - verify
                  A---0 bar b-c <text>d-e bla f-----g blah</text> foo h--i bar j-k
                  the ----- regexes
                  

                  If we want to restrict the search to the part of each line, within the <text>...........</text> region, we must use, this time, the following regex :

                  (?-s)(?-i:<text>|(?!\A)\G)(?-i:(?!</text>).)*?\K-+

                  Note that, in the line 3, which contains <text> but not </text>, the search of the dashes still runs till the end of line 3 !

                  Best Regards,

                  guy038

                  1 Reply Last reply Reply Quote 0
                  • M
                    Mark Olson @Paul Wormer
                    last edited by Mark Olson May 6, 2023, 3:37 PM May 6, 2023, 3:33 PM

                    @Paul-Wormer
                    Time to unpack the (?s-i:(?!ESR).)*? part of the regex, which is what your new version is currently missing.

                    Essentially this is divided into four parts:

                    the flags: ?s-i

                    These say that:

                    • the . metacharacter should match everything (that’s the s flag)
                    • we want to be case-sensitive. That’s the -i part.

                    consume a character unless the end of the search region is right ahead: (?!ESR).

                    This looks ahead without consuming any characters to see if the ESR is in front of you, and then stops if it is. An example:
                    Search string: cacb ab
                    regex: (?!ab).
                    At the start of the string, we look ahead for ab. The next character is c, so we consume c. (remember, this is SUCCESS because we want to NOT match the ESR).

                    string:     cacb ab
                    want:      _ab
                    match:      !
                    consumed: YES
                    

                    Now we’re after the first c, before the first a. We look ahead for ab, but see ac instead, so we’re clear to advance.

                    string:     cacb ab
                    want:      __ab
                    match:       *!
                    consumed: YES
                    

                    You can see that there are no ab anywhere except at the end of the string, so everything will match.

                    Let’s fast-forward to the end of the string:

                    string:     cacb ab
                    want:      ______ab
                    match:           **
                    consumed: NO
                    

                    We’re now positioned between the blankspace and the ending ab. The next two characters are ab, so this whitespace character will NOT be matched.

                    Do the above thing any number of times: (?s-i:(?!ESR).)*?

                    This just says to keep looking ahead and stopping if the ESR is ahead, then consuming a character, then looking ahead… until the ESR is reached or the entire string is consumed.

                    Interesting note: Rexegg.com refers to this as “tempered greed” because you’re greedily trying to eat the whole string, but checking to see if you’re full before you take each bite.

                    Putting it all together:

                    So as @guy038 illustrated above, the (?-i:<text>|(?!\A)\G)(?s-i:(?!</text>).)*?\K-+ regular expression is going to start with (?-i:<text>|(?!\A)\G) by matching either <text> (the BSR) or the end of the last matched region (unless you wrapped around).
                    Now the (?s-i:(?!</text>).)*? part comes into play. It behaves as I described above: the negative lookahead for </text> ensures that you cannot go past the ESR.

                    The rest is the same, as you correctly identified:

                    • forget everything you matched so far (the \K)
                    • match any number of - characters (-+).

                    For the record, I think that part of the problem with the readability of this regex has to do with the flags. The version of the regex without flags, (?:<text>|(?!\A)\G)(?:(?!</text>).)*?\K-+, is I think a bit less confusing.

                    1 Reply Last reply Reply Quote 1
                    4 out of 9
                    • First post
                      4/9
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors