Community
    • Login

    generic-regex-replacing-in-a-specific-zone-of-text

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 4 Posters 626 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Paul WormerP
      Paul Wormer
      last edited by Paul Wormer

      I’m on a track to learning advanced regexes and this regex is one of my challenges. I understand it halfway and have two remaining questions. For reference sake I repeat the regex here:

      (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)
      

      First question: why is there the alternation (?-si:BSR|(?!\A)\G)? Naively, I would simply start with (?-si:BSR) and place the caret somewhere before the string matched by BSR and then hit Find Next or Replace.

      My second question is more related to Npp than to regexes: why doesn’t Replace work with this regex (even if one wants to replace one string only) and is Replace All necessary?

      1 Reply Last reply Reply Quote 3
      • guy038G
        guy038
        last edited by guy038

        Hello, @paul-wormer and All,

        I do not have some spare time to fully answer your first question, presently ! Just a matter of some hours !

        However, regarding your second question, it’s quite easy ! It happens that any time you insert a \K syntax, somewhere in a regex, the step by step replacement, with the Replace button is not allowed by the regex engine and the only possibility is to use the Replace All button !

        Best regards,

        guy038

        1 Reply Last reply Reply Quote 3
        • guy038G
          guy038
          last edited by guy038

          Hello, @paul-wormer and All,

          Let’s test it with the real text, below, that you’ll copy in a new N++ tab :

          
          
          <try>01-23
          456
          7---89
          </pos>
          
          <val>37--001</val>
          <text>This-is
          -a</text>
          <pos>4-1234</pos>
          
          <val>37--002</val>
          <text>-small---example</text>
          <pos>9-0012</pos>
          
          
          <val>37--003</val>
          <text>-of-text-
          which-</text>
          <pos>1-9999</pos>
          
          
          <val>37--004</val>
          
          <text>need
          -to-be-
          modi
          fied</text>
          
          <pos>0-0000</pos>
          

          Note the 2 empty lines at the beginning of the file !


          Now, let’s suppose that we want to replace any range of dashes with a single space char, but ONLY on lines embedded in a multi-lines section <text>.....</text>

          • If we use your formulation of the generic regex (?-si:BSR)(?s-i:(?!ESR).)*?\K(?-si:FR), we end up with the functional search regex (?-si:<text>)(?s-i:(?!</text>).)*?\K(?-si:-+) which can be simplified as :

          • SEARCH (?-i:<text>)(?s-i:(?!</text>).)*?\K-+

          • REPLACE \x20

          • Move the cursor at the very beginning of the new tab

          • Seemingly, the first match is correct : a dash, right after the part <text>This, but the subsequent matches are wrong : it always matches the first occurrence ONLY, of each line surrounded with the multi-line tags <text> and </text>

          Because of the lack of the \G syntax, the other dashes, present in each <text>.....</text> section, are not matched. Thus, we cannot use this regex form !


          You could say : but what about the generic regex (?-si:BSR|\G)(?s-i:(?!ESR).)*?\K(?-si:FR) which can be simplified as :

          • SEARCH (?-i:<text>|\G)(?s-i:(?!</text>).)*?\K-+

          • REPLACE \x20

          • Again, move the cursor at the very beginning of the new tab

          • First, it matches all occurrences of a dash in any line not surrounded with the multi-line tags <text> and </text> ( NOT wanted )

          • Then, as soon as it matches an BSR region ( <Text> ) and, up to the very end of the file, it correctly matches any range of dashes in each line surrounded with the multi-line tags <text> and </text> ONLY

          Because the \G syntax is matched at the very beginning of the file, some first wrong matches occur. Thus, this formulation is not correct, too !


          Finally, let’s use the complete generic regex (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR) which gives the functional one :

          • SEARCH (?-i:<text>|(?!\A)\G)(?s-i:(?!</text>).)*?\K-+

          • REPLACE \x20

          • Again, move the cursor at the very beginning of the new tab

          • As you can see, since the \G syntax is not allowed at the very beginning of current file, it correctly matches all the occurrences of any range of dashes, ONLY in the lines surrounded with the multi-line tags <text> and </text>.

          This is the expected and desired behaviour !


          However, note that, if we decide, on purpose, to start with the cursor on a line, after the very beginning ( as line 2, 3 or else ), our last regex will also finds some initial wrong matches !

          Best Regards,

          guy038

          Paul WormerP 1 Reply Last reply Reply Quote 4
          • Paul WormerP
            Paul Wormer @guy038
            last edited by

            @guy038 Thank you very much for your very elaborate answer (and also for your time). I will study carefully your text, at first glance it seems definitely worth my while. You really are the grandmaster of regular expressions! Thank you again.

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hi, @paul-wormer and All,

              I realize that I forgot to mention the fundamental role of the \G syntax, in this kind of regex !

              In the Boost reference manual, here, it is said :

              Continuation Escape

              The sequence \G matches only at the end of the last match found, or at the start of the text being matched if no previous match was found.

              This escape is useful if you’re iterating over the matches contained within a text, and you want each subsequent match to start where the last one ended.


              What does this mean ?

              Well, when the caret is at the beginning of file, the \G syntax is not allowed because of the negative look-ahead (?!\A). So the only possibility to match is to match the BSR string, i.e. the string <text> with that case, followed by the smallest range, even on several lines, of any char, different from </text> … till a range of dashes

              Then the \G feature takes over and selects from the next char to the nearest range of dashes, again. But, as it cannot go through the </text> string, this means that, necessarily, the next match will not be adjacent to the previous one.

              Thus, the \G feature is not supported anymore and the only possiblility is to match a <text> string again ( the other alternative )

               bla bla <text> this is a-small test to see-if it</text> is OK.<text>We're looking again for a third dash-and a fourth-one and so-on</text>
                       ----------------•-----------------•                   ------------------------------------------•------------•----------•
                           1st match        2nd match                                    3rd match                       4th match    5th match
              

              This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string ( </text> )

              (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)

              which gives the functional regex S/R :

              SEARCH (?-s)(?-i:<text>|(?!\A)\G).*?\K-+

              REPLACE \x20

              If we imagine this text :

              
              Line 1  <text>this is a-small text to see-where are-
              -       ---------------•-----------------•---------•
              Line 3     1st match       2nd match      3rd match
              Line 4
              Line 5  <text>all the-different matches of-a dash</text>
              Line 6  -------------•--------------------•
              Line 7     4th match        5th match
              

              After the third match of the final dash, at the end of line 1 the next match should be at beginning of line 2. But it is not allowed because the regex engine would have to go from line 1 to line 2 and skip the two chars CR + LF

              In that case, the \G feature is not respected anymore and, necessarily, the search goes on starting with the <text> string, in line 5, … till a dash

              Note that the </text> string, at the end of each line, is not mandatory because the implicit gap CRLF, between two consecutive lines, resets, each time, the regex engine to the search of a <text> string first !!

              Thus, to search any FR region in each line, containing a BSR region, simply use the simplified generic regex (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)

              Best Regards,

              guy038

              Paul WormerP 1 Reply Last reply Reply Quote 2
              • Paul WormerP
                Paul Wormer @guy038
                last edited by Paul Wormer

                I worked through Guy’s examples and understand them. I cannot say that I could already compose a regular expression of similar complexity, but at least I can follow Guy’s reasoning.

                There is one open end though:

                @guy038 said in generic-regex-replacing-in-a-specific-zone-of-text:

                This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string ( </text> )
                […]
                SEARCH (?-s)(?-i:<text>|(?!\A)\G).*?\K-+

                I tried to match the string:

                a-a <text>b-c</text> d-e
                

                and placed the caret at the beginning of the file. Using the simplified regex above:

                (?-s)(?-i:<text>|(?!\A)\G).*?\K-+
                

                I found that the second and the third hyphen in the string match. Is this a case of “even Homer sometimes nods”, or do I make a mistake?

                Alan KilbornA Mark OlsonM 2 Replies Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @Paul Wormer
                  last edited by Alan Kilborn

                  @Paul-Wormer said in generic-regex-replacing-in-a-specific-zone-of-text:

                  do I make a mistake?

                  Yes, you do.
                  Why wouldn’t it match the last -, as you have nothing about </text> terminating the parsing in the regex…
                  The possible matching extends to the end of the line.

                  1 Reply Last reply Reply Quote 1
                  • guy038G
                    guy038
                    last edited by guy038

                    Hi, @paul-wormer, @alan-kilborn and All,

                    @paul-wormer, you didn’t make a mistake. It’s just that I did not explain myself properly !

                    When I said :

                    This behavior also explains why the SIMPLIFIED generic regex, shown below, where the BSR region and the different matches are restricted to one line only, does not need any ESR string ( </text> )

                    I should have added : … if we suppose that the closing </text> boundaries were implicitly at the very end of the lines containing the <text> string, instead of their present location !


                    For instance, write the text, below, in a new tab :

                    a-a foo b---c <text>bar d-e foo f---g bla</text> H--I blah j-k
                    This --- is
                    a-b foo c---d <text>e--f foo g-h bar i--j foo
                    a test
                    to - verify
                    A---0 bar b-c <text>d-e bla f-----g blah</text> foo h--i bar j-k
                    the ----- regexes
                    

                    If you use the simplified generic regex (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR), so the practical search (?-s)(?-i:<text>|(?!\A)\G).*?\K-+ :

                    It supposes that implicit </text> boundaries exist at the end of the lines, instead of the present ones

                    So, the text could also be rewritten :

                    a-a foo b---c <text>bar d-e foo< f-g bla H---I blah j--k/text>
                    This --- is
                    a-b foo c---d <text>e--f foo g-h bar i--j foo</text>
                    a test
                    to - verify
                    A---0 bar b-c <text>d-e bla f-----g blah foo h--i bar j-k</text>
                    the ----- regexes
                    

                    And, of course, the regex (?-s)(?-i:<text>|(?!\A)\G).*?\K-+, against these two texts, does match any range dash characters, after the opening <text> boundary … till the end of each line


                    But, using again, the original text :

                    a-a foo b---c <text>bar d-e foo f---g bla</text> H--I blah j-k
                    This --- is
                    a-b foo c---d <text>e--f foo g-h bar i--j foo
                    a test
                    to - verify
                    A---0 bar b-c <text>d-e bla f-----g blah</text> foo h--i bar j-k
                    the ----- regexes
                    

                    If we want to restrict the search to the part of each line, within the <text>...........</text> region, we must use, this time, the following regex :

                    (?-s)(?-i:<text>|(?!\A)\G)(?-i:(?!</text>).)*?\K-+

                    Note that, in the line 3, which contains <text> but not </text>, the search of the dashes still runs till the end of line 3 !

                    Best Regards,

                    guy038

                    1 Reply Last reply Reply Quote 0
                    • Mark OlsonM
                      Mark Olson @Paul Wormer
                      last edited by Mark Olson

                      @Paul-Wormer
                      Time to unpack the (?s-i:(?!ESR).)*? part of the regex, which is what your new version is currently missing.

                      Essentially this is divided into four parts:

                      the flags: ?s-i

                      These say that:

                      • the . metacharacter should match everything (that’s the s flag)
                      • we want to be case-sensitive. That’s the -i part.

                      consume a character unless the end of the search region is right ahead: (?!ESR).

                      This looks ahead without consuming any characters to see if the ESR is in front of you, and then stops if it is. An example:
                      Search string: cacb ab
                      regex: (?!ab).
                      At the start of the string, we look ahead for ab. The next character is c, so we consume c. (remember, this is SUCCESS because we want to NOT match the ESR).

                      string:     cacb ab
                      want:      _ab
                      match:      !
                      consumed: YES
                      

                      Now we’re after the first c, before the first a. We look ahead for ab, but see ac instead, so we’re clear to advance.

                      string:     cacb ab
                      want:      __ab
                      match:       *!
                      consumed: YES
                      

                      You can see that there are no ab anywhere except at the end of the string, so everything will match.

                      Let’s fast-forward to the end of the string:

                      string:     cacb ab
                      want:      ______ab
                      match:           **
                      consumed: NO
                      

                      We’re now positioned between the blankspace and the ending ab. The next two characters are ab, so this whitespace character will NOT be matched.

                      Do the above thing any number of times: (?s-i:(?!ESR).)*?

                      This just says to keep looking ahead and stopping if the ESR is ahead, then consuming a character, then looking ahead… until the ESR is reached or the entire string is consumed.

                      Interesting note: Rexegg.com refers to this as “tempered greed” because you’re greedily trying to eat the whole string, but checking to see if you’re full before you take each bite.

                      Putting it all together:

                      So as @guy038 illustrated above, the (?-i:<text>|(?!\A)\G)(?s-i:(?!</text>).)*?\K-+ regular expression is going to start with (?-i:<text>|(?!\A)\G) by matching either <text> (the BSR) or the end of the last matched region (unless you wrapped around).
                      Now the (?s-i:(?!</text>).)*? part comes into play. It behaves as I described above: the negative lookahead for </text> ensures that you cannot go past the ESR.

                      The rest is the same, as you correctly identified:

                      • forget everything you matched so far (the \K)
                      • match any number of - characters (-+).

                      For the record, I think that part of the problem with the readability of this regex has to do with the flags. The version of the regex without flags, (?:<text>|(?!\A)\G)(?:(?!</text>).)*?\K-+, is I think a bit less confusing.

                      1 Reply Last reply Reply Quote 1
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors