Community
    • Login

    I don't understand why this simple regex doesn't work

    Scheduled Pinned Locked Moved General Discussion
    14 Posts 5 Posters 834 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Terry RT
      Terry R @pbarney
      last edited by

      @pbarney

      You could look at the problem a slightly different way:

      1. Use (SID=[^\x20]+) and replace with \r\n${1}\r\n. This puts all those SID strings on seperate lines.
      2. Then bookmark lines with the SID= on them.
      3. Then remove non bookmarked lines.

      It might sound slightly simplistic and indeed if you are wanting to learn more about regex (and push your boundaries) then it can be done in the manner you want. Someone can provide you with that solution in time, however sometimes looking at the solution and breaking it down can be advantageous.

      Terry

      pbarneyP 1 Reply Last reply Reply Quote 4
      • pbarneyP
        pbarney @PeterJones
        last edited by

        @PeterJones, thanks for taking a stab at it. The problem here is that is still leaves other, non-SID data in the file.

        So a match isn’t dependent upon SID=\d+ being on that line.

        Yes, that is my intention. If there isn’t a SID=\d+ in the line, it should select the entire line and replace it with nothing, but if there is a SID=\d+ in the line, it should capture it into group 1, replacing everything on the line with group 1.

        My attempt here is to render the sample data as:

        SID=324221815251191
        SID=32422181753241
        SID=324221819525920
        SID=324221821424161
        

        …although I recognize that it will actually contain a number of blank lines, which I’m fine with.

        I’m just not sure why it’s not actually capturing my string into the capture group at all.

        PeterJonesP 1 Reply Last reply Reply Quote 0
        • pbarneyP
          pbarney @Terry R
          last edited by

          @Terry-R said in I don’t understand why this simple regex doesn’t work:

          You could look at the problem a slightly different way […] if you are wanting to learn more about regex (and push your boundaries) then it can be done […] sometimes looking at the solution and breaking it down can be advantageous.

          You’re absolutely right! In this case, yes, I am attempting to square up my RegEx skills here, so this question isn’t so much about “how to get the job done” as it is “how to get the job done with RegEx.” In any case, I’ve actually completed the work, I’d just like to know what I’m misunderstanding about my regex.

          Terry RT 1 Reply Last reply Reply Quote 0
          • Terry RT
            Terry R @pbarney
            last edited by

            @pbarney said in I don’t understand why this simple regex doesn’t work:

            I’d just like to know what I’m misunderstanding about my regex.

            I thought that might be the case. I was thinking on some of the steps I’d take when my regex didn’t produce the output I wanted. In your case I’d start like this:

            5ca2e6f2-34a4-4a1d-9c54-3b0e8a3da9dc-image.png

            which immediately shows me that the line always seems to put it all into group 1 regardless.

            Terry

            1 Reply Last reply Reply Quote 1
            • PeterJonesP
              PeterJones @pbarney
              last edited by PeterJones

              @pbarney said in I don’t understand why this simple regex doesn’t work:

              …although I recognize that it will actually contain a number of blank lines, which I’m fine with.

              Ah, okay, that wasn’t clear because you didn’t show your desired “after” data.

              I’m just not sure why it’s not actually capturing my string into the capture group at all.

              Because 0-or-1 instances of SID=\d+ followed by a greedy number of characters after will allow the characters-after to be greedy and the ? to match 0, thus matching an empty string in the group and the SID and everything after it as part of the .*. Unfortunately, even with a $ anchor at the end, and making the end one non-greedy, one of the two .* seems to be taking precedence over the 0-or-1 – oh, as Terry’s experiments showed, it appears to be the left .*?.

              So with that knowledge, I can fix it by using a negative lookahead trick: the pattern (?!NEG). will match one of any character that is not the start of NEG, and (?:(?!NEG).)*? will match 0-or-more-characters non-greedy, as long as those characters aren’t the start of NEG.

              Taking that concept, replace each .*? with (?:(?!SID=\d+).)*?, and it will work:

              • FIND = (?-s)^(?:(?!SID=\d+).)*?(SID=\d+)?(?:(?!SID=\d+).)*?$
              • REPLACE = $1
              • RESULT
                
                
                SID=324221815251191
                
                
                SID=32422181753241
                
                SID=324221819525920
                SID=324221821424161
                
                … of the 9 lines, 4 are replaced with something and 5 become empty except for newline
              pbarneyP 1 Reply Last reply Reply Quote 2
              • pbarneyP
                pbarney @PeterJones
                last edited by

                @PeterJones, wow! Thank you for taking the time to do that. It’s a real trip around the block for what seems like should be just a quick stop next door.

                Interestingly, I can get the result I want also, if I break it up into two steps:

                Step 1: .*?(SID=\d+).*
                Step 2: .*?(SID=\d+)?.*

                Not sure why this works. The second step by itself seems to be able to identify when the first group isn’t there, but can’t identify when it is there.

                Neil SchipperN 1 Reply Last reply Reply Quote 1
                • Neil SchipperN
                  Neil Schipper @pbarney
                  last edited by

                  Hi @pbarney

                  … but can’t identify when it is there

                  Well I don’t have a rigorous answer (without digging deep into regex engine details) except to say the form of the expression <any text><optional text><any text> gives the engine a lot of freedom, clearly too much, to efficiently meet the match condition.

                  You would benefit from learning about alternates (in matching) and Substitution Conditionals in Substitutions (“Replace with” fields).

                  Consider:

                  water
                  milk
                  water bottle
                  
                  carton of milk
                  water bottle
                  

                  To preserve only watery lines:
                  Find: (^.*?water.*(?:\R|\z))|^.*(?:\R|\z)
                  To preserve only milky lines:
                  Find: (^.*?milk.*(?:\R|\z))|^.*(?:\R|\z)
                  In either of the above cases use:
                  Replace with: ?1$1:

                  Empty lines anywhere, and trailing lines that don’t have an end-of-line, are handled as one would hope.

                  Also, when experimenting with F&R actions, ctl-z is a reliable friend.

                  pbarneyP 1 Reply Last reply Reply Quote 2
                  • guy038G
                    guy038
                    last edited by guy038

                    Hello, @pbarney, @peterjones, @terry-r, @neil-schipper and All,

                    For this specific search, I think we could imagine this process :

                    • All the text is a simple single line, so we’ll use the leading (?s) modifier

                    • Search for anything followed with a first string SID=\d+, stored as group 1

                    • Replace with group 1 followed with a line-break, if group 1 exists

                    • Redo the two previous steps as long as a SID=\d+ string can be found

                    • Finally, delete any remaining text, when the SID=\d+ string cannot be found, anymore


                    Thus, from this INPUT text :

                    Lorem ipsum dolor sit amet, libero turpis non cras ligula, id commodo, aenean 
                    est in volutpat amet sodales, porttitor bibendum facilisi suspendisse, aliquam 
                    ipsum ante morbi sed ipsum SID=324221815251191 mollis. Sollicitudin viverra, vel 
                    varius eget sit mollis. Commodo enim aliquam suspendisse tortor cum diam, commodo 
                    facilisis, rutrum et duis nisl porttitor, vel eleifend odio ultricies ut, orci in 
                    SID=32422181753241& adipiscing felis velit nibh. Consectetuer porttitor feugiat 
                    vestibulum sit feugiat, voluptates dui eros libero. Etiam vestibulum at lectus. 
                    Donec vivamus. Vel donec et scelerisque vestibulum. Condimentum SID=324221819525920
                    aliquam, mollit magna velit nec, SID=324221821424161 tempor cursus vitae sit 
                    

                    The following regex S/R :

                    SEARCH (?xs) .*? ( SID = \d+ ) | .+    or    (?s).*?(SID=\d+)|.+

                    REPLACE ?1\1\r\n

                    Would immediately produce this OUTPUT text :

                    SID=324221815251191
                    SID=32422181753241
                    SID=324221819525920
                    SID=324221821424161
                    

                    Best Regards,

                    guy038

                    pbarneyP 1 Reply Last reply Reply Quote 4
                    • pbarneyP
                      pbarney @Neil Schipper
                      last edited by

                      @Neil-Schipper said:

                      You would benefit from learning about alternates (in matching) and Substitution Conditionals in Substitutions (“Replace with” fields).

                      I agree! I’ve used alternates a bit, but I was completely unaware of substitution conditionals. Thank you for pointing that out. I think that will prove to be a powerful tool for me!

                      1 Reply Last reply Reply Quote 1
                      • pbarneyP
                        pbarney @guy038
                        last edited by

                        @guy038, thank you so much for coming up with that. That approach is a less intuitive approach for me, but I’m continuing to learn that there can many ways to approach problems with regular expressions.

                        I see how yours is working, and it makes complete sense to me now, although I’m still wondering why the regex engine is interpreting my original search in the way that it does. I suppose it will be a mystery for the ages. I’m just happy that you see this stuff clearly enough to point out alternatives.

                        You’re like a jungle guide who knows where the dangers are on the well-traveled trails, so you know alternate trails to take instead.

                        Terry RT 1 Reply Last reply Reply Quote 1
                        • Terry RT
                          Terry R @pbarney
                          last edited by Terry R

                          @pbarney said in I don’t understand why this simple regex doesn’t work:

                          although I’m still wondering why the regex engine is interpreting my original search in the way that it does.

                          What I showed in my test to show which group was capturing the characters was that in every line my group 1 captured the entire line.

                          If you consider your original regex, you gave the first portion a non-greedy option .*?, then group 1 (inside the brackets) you also gave it the option to NOT capture by using the ?.

                          Now in my mind (trying to relate to how the regular expression engine will work) I’m thinking that I start capturing using the .*? in a non-greedy fashion, eventually getting to the end of the line. I then look at group 1 and as it contains the ? as well I feel that I don’t need to backtrack to release characters so group 1 can capture them, so I don’t.

                          As the first portion of your regex satisfied the requirements (matched) and the remainder of the regex didn’t require captures to be made there was no requirement to backtrack.

                          Terry

                          1 Reply Last reply Reply Quote 1
                          • guy038G
                            guy038
                            last edited by guy038

                            Hello, @pbarney and All,

                            I’ll try to explain you why your initial regex ^.*?(SID=\d+)?.* cannot work !

                            To begin with, let’s consider the first part of your regex :

                            ^.*?(SID=\d+)?

                            If you try this regex, against your text :

                            Lorem ipsum dolor sit amet, libero turpis non cras ligula, id commodo, aenean 
                            est in volutpat amet sodales, porttitor bibendum facilisi suspendisse, aliquam 
                            ipsum ante morbi sed ipsum SID=324221815251191 mollis. Sollicitudin viverra, vel 
                            varius eget sit mollis. Commodo enim aliquam suspendisse tortor cum diam, commodo 
                            facilisis, rutrum et duis nisl porttitor, vel eleifend odio ultricies ut, orci in 
                            SID=32422181753241& adipiscing felis velit nibh. Consectetuer porttitor feugiat 
                            vestibulum sit feugiat, voluptates dui eros libero. Etiam vestibulum at lectus. 
                            Donec vivamus. Vel donec et scelerisque vestibulum. Condimentum SID=324221819525920
                            aliquam, mollit magna velit nec, SID=324221821424161 tempor cursus vitae sit 
                            

                            You’ll note that it always matches a zero-length string but the 6-th line, beginning with the SID=.... string. Why ?

                            Well, as you decided to put a lazy quantifier ( *? ( or also {0,}? ), the regex engine begins to match the minimum string, i.e. the empty string, at beginning of line and, of course, cannot see the string SID=... at this beginning. But, it does not matter as the SID=... string is optional. So, the regex engine considers that this zero-length match is a correct match for the current line ! And so on till …

                            The 6th line, where the Sid=... string does begin the line. So, the regex engine considers this string as a correct match for this 6th line. And so on…


                            Now, when you add the final part .*, then, at each beginning of line, due to the lazy quantifier, your regex is equivalent to :

                            • ^.*?.* ( in other words equivalent to .* ), if the SID=... string is not at the beginning of current line. Thus, as the group1 is not taken in account, the regex engine simply replaces the current line, without its line-break, with nothing, as the group 1 is not defined, resulting in an empty line

                            • (SID=\d+).* if the SID=... string begins the current line. In this case the group 1 is defined and the regex engine changes all contents of current line with the string SID=.....


                            Finally, note that your second regex ^.*?(SID=\d+).* matches ONLY the lines containing a SID=... string. Thus, it’s obvious that the other lines remain untouched !

                            Neverthless, it was easy to solve your problem. You ( and I ) could have thought of this regex S/R !

                            SEARCH (?-s)^.*(SID=\d+).*|.+\R

                            REPLACE \1

                            • When a line contains the SID=.... string, it just rewrites that string ( group 1 )

                            • When a line does not contain a SID=.... string, the second alternative of the regex, .+\R grabs all contents of current line WITH its line-break. But, as this second alternative does not refer at all about the group 1, nothing is rewritten during the replacement, and the lines are just deleted

                            Best Regards,

                            guy038

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors