Community
    • Login

    I don't understand why this simple regex doesn't work

    Scheduled Pinned Locked Moved General Discussion
    14 Posts 5 Posters 832 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • pbarneyP
      pbarney
      last edited by

      I have a file containing many lines, and some of the lines contain a single pattern within them like SID=29394030394.

      My desire is to remove everything from the file except for those strings.

      I have the following in the Find/Replace dialog:

      Find what: ^.*?(SID=\d+)?.*
      Replace with: \1

      But it simply removes everything from the file (edit: unless SID=\d+ is the first thing on the line).

      If I remove the second question mark, i.e., ^.*?(SID=\d+).*, then it will successfully replace each found instance with just the string, but it leaves the rest of the lines intact.

      What am I missing?

      Sample data:

      Lorem ipsum dolor sit amet, libero turpis non cras ligula, id commodo, aenean 
      est in volutpat amet sodales, porttitor bibendum facilisi suspendisse, aliquam 
      ipsum ante morbi sed ipsum SID=324221815251191 mollis. Sollicitudin viverra, vel 
      varius eget sit mollis. Commodo enim aliquam suspendisse tortor cum diam, commodo 
      facilisis, rutrum et duis nisl porttitor, vel eleifend odio ultricies ut, orci in 
      SID=32422181753241& adipiscing felis velit nibh. Consectetuer porttitor feugiat 
      vestibulum sit feugiat, voluptates dui eros libero. Etiam vestibulum at lectus. 
      Donec vivamus. Vel donec et scelerisque vestibulum. Condimentum SID=324221819525920
      aliquam, mollit magna velit nec, SID=324221821424161 tempor cursus vitae sit 
      
      PeterJonesP Terry RT 2 Replies Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @pbarney
        last edited by PeterJones

        @pbarney said in I don’t understand why this simple regex doesn’t work:

        But it simply removes everything from the file (edit: unless SID=\d+ is the first thing on the line).

        You told it: start at the beginning of the line, grab as few characters as possible until SID=\d+, put SID=\d+ into group 1 if that sequence exists or put nothing into group 1 if SID=\d+ isn’t on that line, and grab every remaining character. Then replace that match with just the contents of group 1. So a match isn’t dependent upon SID=\d+ being on that line.

        If you have . matches newline checkmarked, then the .*? at the beginning will grab every character, including newline, from the beginning of the file (which obviously starts at the beginning of a line) until the first SID=\d+, wherever that is; then put the SID in the group#1, then grab every character after including newlines all the way to the end of the file. And if you don’t have SID=\d+ anywhere, it won’t care, because you told it to match 0 or 1 instances, so a file without any will just match everything else in the file, set group#1 to empty, and replace everything with that empty string.

        If you unmark . matches newline, one might think it will be better, in that each .*? will be limited to a single line of data. But because (SID=\d+)? has the ?, it will still match 0 or 1 instance of that, so it will still match lines that don’t have SID=\d+. (Though it will leave a line ending for each line… how kind)

        I think what you really want is to make sure . matches newline is unchecked (or use (?-s) in your regex, which turns off that feature regardless of the checkmark), and remove the ? after (SID=\d+), resulting in just keeping the first SID=\d+ for each line:

        FIND = (?-s)^.*?(SID=\d+).*

        your dummy data becomes

        Lorem ipsum dolor sit amet, libero turpis non cras ligula, id commodo, aenean 
        est in volutpat amet sodales, porttitor bibendum facilisi suspendisse, aliquam 
        SID=324221815251191
        varius eget sit mollis. Commodo enim aliquam suspendisse tortor cum diam, commodo 
        facilisis, rutrum et duis nisl porttitor, vel eleifend odio ultricies ut, orci in 
        SID=32422181753241
        vestibulum sit feugiat, voluptates dui eros libero. Etiam vestibulum at lectus. 
        SID=324221819525920
        SID=324221821424161
        
        pbarneyP 1 Reply Last reply Reply Quote 2
        • Terry RT
          Terry R @pbarney
          last edited by

          @pbarney

          You could look at the problem a slightly different way:

          1. Use (SID=[^\x20]+) and replace with \r\n${1}\r\n. This puts all those SID strings on seperate lines.
          2. Then bookmark lines with the SID= on them.
          3. Then remove non bookmarked lines.

          It might sound slightly simplistic and indeed if you are wanting to learn more about regex (and push your boundaries) then it can be done in the manner you want. Someone can provide you with that solution in time, however sometimes looking at the solution and breaking it down can be advantageous.

          Terry

          pbarneyP 1 Reply Last reply Reply Quote 4
          • pbarneyP
            pbarney @PeterJones
            last edited by

            @PeterJones, thanks for taking a stab at it. The problem here is that is still leaves other, non-SID data in the file.

            So a match isn’t dependent upon SID=\d+ being on that line.

            Yes, that is my intention. If there isn’t a SID=\d+ in the line, it should select the entire line and replace it with nothing, but if there is a SID=\d+ in the line, it should capture it into group 1, replacing everything on the line with group 1.

            My attempt here is to render the sample data as:

            SID=324221815251191
            SID=32422181753241
            SID=324221819525920
            SID=324221821424161
            

            …although I recognize that it will actually contain a number of blank lines, which I’m fine with.

            I’m just not sure why it’s not actually capturing my string into the capture group at all.

            PeterJonesP 1 Reply Last reply Reply Quote 0
            • pbarneyP
              pbarney @Terry R
              last edited by

              @Terry-R said in I don’t understand why this simple regex doesn’t work:

              You could look at the problem a slightly different way […] if you are wanting to learn more about regex (and push your boundaries) then it can be done […] sometimes looking at the solution and breaking it down can be advantageous.

              You’re absolutely right! In this case, yes, I am attempting to square up my RegEx skills here, so this question isn’t so much about “how to get the job done” as it is “how to get the job done with RegEx.” In any case, I’ve actually completed the work, I’d just like to know what I’m misunderstanding about my regex.

              Terry RT 1 Reply Last reply Reply Quote 0
              • Terry RT
                Terry R @pbarney
                last edited by

                @pbarney said in I don’t understand why this simple regex doesn’t work:

                I’d just like to know what I’m misunderstanding about my regex.

                I thought that might be the case. I was thinking on some of the steps I’d take when my regex didn’t produce the output I wanted. In your case I’d start like this:

                5ca2e6f2-34a4-4a1d-9c54-3b0e8a3da9dc-image.png

                which immediately shows me that the line always seems to put it all into group 1 regardless.

                Terry

                1 Reply Last reply Reply Quote 1
                • PeterJonesP
                  PeterJones @pbarney
                  last edited by PeterJones

                  @pbarney said in I don’t understand why this simple regex doesn’t work:

                  …although I recognize that it will actually contain a number of blank lines, which I’m fine with.

                  Ah, okay, that wasn’t clear because you didn’t show your desired “after” data.

                  I’m just not sure why it’s not actually capturing my string into the capture group at all.

                  Because 0-or-1 instances of SID=\d+ followed by a greedy number of characters after will allow the characters-after to be greedy and the ? to match 0, thus matching an empty string in the group and the SID and everything after it as part of the .*. Unfortunately, even with a $ anchor at the end, and making the end one non-greedy, one of the two .* seems to be taking precedence over the 0-or-1 – oh, as Terry’s experiments showed, it appears to be the left .*?.

                  So with that knowledge, I can fix it by using a negative lookahead trick: the pattern (?!NEG). will match one of any character that is not the start of NEG, and (?:(?!NEG).)*? will match 0-or-more-characters non-greedy, as long as those characters aren’t the start of NEG.

                  Taking that concept, replace each .*? with (?:(?!SID=\d+).)*?, and it will work:

                  • FIND = (?-s)^(?:(?!SID=\d+).)*?(SID=\d+)?(?:(?!SID=\d+).)*?$
                  • REPLACE = $1
                  • RESULT
                    
                    
                    SID=324221815251191
                    
                    
                    SID=32422181753241
                    
                    SID=324221819525920
                    SID=324221821424161
                    
                    … of the 9 lines, 4 are replaced with something and 5 become empty except for newline
                  pbarneyP 1 Reply Last reply Reply Quote 2
                  • pbarneyP
                    pbarney @PeterJones
                    last edited by

                    @PeterJones, wow! Thank you for taking the time to do that. It’s a real trip around the block for what seems like should be just a quick stop next door.

                    Interestingly, I can get the result I want also, if I break it up into two steps:

                    Step 1: .*?(SID=\d+).*
                    Step 2: .*?(SID=\d+)?.*

                    Not sure why this works. The second step by itself seems to be able to identify when the first group isn’t there, but can’t identify when it is there.

                    Neil SchipperN 1 Reply Last reply Reply Quote 1
                    • Neil SchipperN
                      Neil Schipper @pbarney
                      last edited by

                      Hi @pbarney

                      … but can’t identify when it is there

                      Well I don’t have a rigorous answer (without digging deep into regex engine details) except to say the form of the expression <any text><optional text><any text> gives the engine a lot of freedom, clearly too much, to efficiently meet the match condition.

                      You would benefit from learning about alternates (in matching) and Substitution Conditionals in Substitutions (“Replace with” fields).

                      Consider:

                      water
                      milk
                      water bottle
                      
                      carton of milk
                      water bottle
                      

                      To preserve only watery lines:
                      Find: (^.*?water.*(?:\R|\z))|^.*(?:\R|\z)
                      To preserve only milky lines:
                      Find: (^.*?milk.*(?:\R|\z))|^.*(?:\R|\z)
                      In either of the above cases use:
                      Replace with: ?1$1:

                      Empty lines anywhere, and trailing lines that don’t have an end-of-line, are handled as one would hope.

                      Also, when experimenting with F&R actions, ctl-z is a reliable friend.

                      pbarneyP 1 Reply Last reply Reply Quote 2
                      • guy038G
                        guy038
                        last edited by guy038

                        Hello, @pbarney, @peterjones, @terry-r, @neil-schipper and All,

                        For this specific search, I think we could imagine this process :

                        • All the text is a simple single line, so we’ll use the leading (?s) modifier

                        • Search for anything followed with a first string SID=\d+, stored as group 1

                        • Replace with group 1 followed with a line-break, if group 1 exists

                        • Redo the two previous steps as long as a SID=\d+ string can be found

                        • Finally, delete any remaining text, when the SID=\d+ string cannot be found, anymore


                        Thus, from this INPUT text :

                        Lorem ipsum dolor sit amet, libero turpis non cras ligula, id commodo, aenean 
                        est in volutpat amet sodales, porttitor bibendum facilisi suspendisse, aliquam 
                        ipsum ante morbi sed ipsum SID=324221815251191 mollis. Sollicitudin viverra, vel 
                        varius eget sit mollis. Commodo enim aliquam suspendisse tortor cum diam, commodo 
                        facilisis, rutrum et duis nisl porttitor, vel eleifend odio ultricies ut, orci in 
                        SID=32422181753241& adipiscing felis velit nibh. Consectetuer porttitor feugiat 
                        vestibulum sit feugiat, voluptates dui eros libero. Etiam vestibulum at lectus. 
                        Donec vivamus. Vel donec et scelerisque vestibulum. Condimentum SID=324221819525920
                        aliquam, mollit magna velit nec, SID=324221821424161 tempor cursus vitae sit 
                        

                        The following regex S/R :

                        SEARCH (?xs) .*? ( SID = \d+ ) | .+    or    (?s).*?(SID=\d+)|.+

                        REPLACE ?1\1\r\n

                        Would immediately produce this OUTPUT text :

                        SID=324221815251191
                        SID=32422181753241
                        SID=324221819525920
                        SID=324221821424161
                        

                        Best Regards,

                        guy038

                        pbarneyP 1 Reply Last reply Reply Quote 4
                        • pbarneyP
                          pbarney @Neil Schipper
                          last edited by

                          @Neil-Schipper said:

                          You would benefit from learning about alternates (in matching) and Substitution Conditionals in Substitutions (“Replace with” fields).

                          I agree! I’ve used alternates a bit, but I was completely unaware of substitution conditionals. Thank you for pointing that out. I think that will prove to be a powerful tool for me!

                          1 Reply Last reply Reply Quote 1
                          • pbarneyP
                            pbarney @guy038
                            last edited by

                            @guy038, thank you so much for coming up with that. That approach is a less intuitive approach for me, but I’m continuing to learn that there can many ways to approach problems with regular expressions.

                            I see how yours is working, and it makes complete sense to me now, although I’m still wondering why the regex engine is interpreting my original search in the way that it does. I suppose it will be a mystery for the ages. I’m just happy that you see this stuff clearly enough to point out alternatives.

                            You’re like a jungle guide who knows where the dangers are on the well-traveled trails, so you know alternate trails to take instead.

                            Terry RT 1 Reply Last reply Reply Quote 1
                            • Terry RT
                              Terry R @pbarney
                              last edited by Terry R

                              @pbarney said in I don’t understand why this simple regex doesn’t work:

                              although I’m still wondering why the regex engine is interpreting my original search in the way that it does.

                              What I showed in my test to show which group was capturing the characters was that in every line my group 1 captured the entire line.

                              If you consider your original regex, you gave the first portion a non-greedy option .*?, then group 1 (inside the brackets) you also gave it the option to NOT capture by using the ?.

                              Now in my mind (trying to relate to how the regular expression engine will work) I’m thinking that I start capturing using the .*? in a non-greedy fashion, eventually getting to the end of the line. I then look at group 1 and as it contains the ? as well I feel that I don’t need to backtrack to release characters so group 1 can capture them, so I don’t.

                              As the first portion of your regex satisfied the requirements (matched) and the remainder of the regex didn’t require captures to be made there was no requirement to backtrack.

                              Terry

                              1 Reply Last reply Reply Quote 1
                              • guy038G
                                guy038
                                last edited by guy038

                                Hello, @pbarney and All,

                                I’ll try to explain you why your initial regex ^.*?(SID=\d+)?.* cannot work !

                                To begin with, let’s consider the first part of your regex :

                                ^.*?(SID=\d+)?

                                If you try this regex, against your text :

                                Lorem ipsum dolor sit amet, libero turpis non cras ligula, id commodo, aenean 
                                est in volutpat amet sodales, porttitor bibendum facilisi suspendisse, aliquam 
                                ipsum ante morbi sed ipsum SID=324221815251191 mollis. Sollicitudin viverra, vel 
                                varius eget sit mollis. Commodo enim aliquam suspendisse tortor cum diam, commodo 
                                facilisis, rutrum et duis nisl porttitor, vel eleifend odio ultricies ut, orci in 
                                SID=32422181753241& adipiscing felis velit nibh. Consectetuer porttitor feugiat 
                                vestibulum sit feugiat, voluptates dui eros libero. Etiam vestibulum at lectus. 
                                Donec vivamus. Vel donec et scelerisque vestibulum. Condimentum SID=324221819525920
                                aliquam, mollit magna velit nec, SID=324221821424161 tempor cursus vitae sit 
                                

                                You’ll note that it always matches a zero-length string but the 6-th line, beginning with the SID=.... string. Why ?

                                Well, as you decided to put a lazy quantifier ( *? ( or also {0,}? ), the regex engine begins to match the minimum string, i.e. the empty string, at beginning of line and, of course, cannot see the string SID=... at this beginning. But, it does not matter as the SID=... string is optional. So, the regex engine considers that this zero-length match is a correct match for the current line ! And so on till …

                                The 6th line, where the Sid=... string does begin the line. So, the regex engine considers this string as a correct match for this 6th line. And so on…


                                Now, when you add the final part .*, then, at each beginning of line, due to the lazy quantifier, your regex is equivalent to :

                                • ^.*?.* ( in other words equivalent to .* ), if the SID=... string is not at the beginning of current line. Thus, as the group1 is not taken in account, the regex engine simply replaces the current line, without its line-break, with nothing, as the group 1 is not defined, resulting in an empty line

                                • (SID=\d+).* if the SID=... string begins the current line. In this case the group 1 is defined and the regex engine changes all contents of current line with the string SID=.....


                                Finally, note that your second regex ^.*?(SID=\d+).* matches ONLY the lines containing a SID=... string. Thus, it’s obvious that the other lines remain untouched !

                                Neverthless, it was easy to solve your problem. You ( and I ) could have thought of this regex S/R !

                                SEARCH (?-s)^.*(SID=\d+).*|.+\R

                                REPLACE \1

                                • When a line contains the SID=.... string, it just rewrites that string ( group 1 )

                                • When a line does not contain a SID=.... string, the second alternative of the regex, .+\R grabs all contents of current line WITH its line-break. But, as this second alternative does not refer at all about the group 1, nothing is rewritten during the replacement, and the lines are just deleted

                                Best Regards,

                                guy038

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors