• Login
Community
  • Login

Find a specific sequence on every line

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
12 Posts 4 Posters 808 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • G
    George Martinez
    last edited by George Martinez Oct 18, 2021, 4:11 AM Oct 18, 2021, 4:09 AM

    I have a large file that looks like the following:

    sdfsdf3 sdff9dg port: 2522 dgfgdfg
    d*f@@fsdf #sdfd sdf s port: 52
    sdf g2dfg53354 !df gdfgdf port: 81 sdf gfdgdfg

    Every line has port: followed by a number.

    I need to get rid of everything on every line except for port:#. How do I do this?
    I know I need to use to regex but don’t know how to do something this complex.
    Thanks, and I appreciate it.

    1 Reply Last reply Reply Quote 0
    • N
      Neil Schipper
      last edited by Oct 18, 2021, 10:39 AM

      This works for me:

      find what: .*(port:\h\d+)\s.*
      replace with: $1

      “. matches newline” not checked

      1 Reply Last reply Reply Quote 0
      • G
        guy038
        last edited by guy038 Oct 18, 2021, 12:19 PM Oct 18, 2021, 12:18 PM

        Hello, @george-martinez, @neil-schipper and All,

        Neil, I would prefer this regex S/R :

        SEARCH (?-si)^.+port:\h*(\d+).*

        REPLACE port $1

        See the difference with your version, against the text below :

        sdfsdf3 sdff9dg port:      2522 dgfgdfg
        d*f@@fsdf #sdfd sdf s port:52
        sdf g2dfg53354 !df gdfgdf port: 81 sdf gfdgdfg
        

        Note that my first attempt was the regex S/R below, which is correct but does not output aligned numbers, yet ;-))

        SEARCH (?-si)^.+(port:\h*\d+).*

        REPLACE $1

        Best Regards,

        guy038

        P.S. :

        Neil, you don’t need to add the comment “. matches newline not checked”, if you begin your regexes with the (?-s) modifier. This modifier forces the regex engine to consider any regex . as a single standard char, EVEN IF the . matches newline option is ticked ! And vice-versa for the (?s) modifier

        N 1 Reply Last reply Oct 18, 2021, 6:55 PM Reply Quote 1
        • N
          Neil Schipper @guy038
          last edited by Oct 18, 2021, 6:55 PM

          Hello @guy038,

          Your solution is probably more robust and general than mine, but we can’t really say because the spec is incomplete, and all three of our solutions could fail on some data.

          • spec does not say whether lines may start with 'port: ’ which my solution accepts and both of your solutions skip
          • spec does not say whether 'port: ’ (when not at start) must be preceded by whitespace, so all three of our solutions process sdf g2fport: 5354 !df df port: 82 sdf into port 82, possibly failing to capture port 5354
          • spec says neither “exactly one space after the colon” (me) nor “any run of whitespace” (you) after the colon
          • (this one is trivial, but) spec doesn’t say whether output should keep the colon (which your preferred solution, probably inadvertently, leaves out)

          So we’re all guessing to some degree.

          And based on the loose way things are often specified, people taking our advice are probably not brutally testing our solutions against reams of test cases, and we can only hope and pray that they’re not involved with anything life-critical.

          In regard to selecting “line oriented” vs. “sea of bytes oriented” by prefixing every search string with a (? directive, I do recognize its power (and only recently, thanks to posts by yourself and the other 2 or 3 regex solution-providers at this site). I also think that including it is a disincentive for regex noobs to dig in and learn what the whole search string is doing. It’s an extra subtlety on top of what are already somewhat subtle concepts & constructs, like ‘+’ vs. ‘*’, and dozens more.

          Actually I regard placing in the find dialog, the option that modifies ‘.’ something of a design flaw, and I feel it should really be hidden away in Preferences. My reasoning is that a super-majority of users are non-sophisticates, and a super-majority of use cases are line-oriented; a good number of this kind of user will be able and willing to learn a few regex tricks now and again and build their competence over time.

          OTOH, people dealing with “sea of bytes” situations are more likely to be computer/data savvy who are in a position to “read the docs” and become regex sophisticates; if not they must seek advice from reliable sophisticates and apply their solutions in innocence/ignorance.

          N 1 Reply Last reply Oct 18, 2021, 7:24 PM Reply Quote 0
          • N
            Neil Schipper @Neil Schipper
            last edited by Oct 18, 2021, 7:24 PM

            @Neil-Schipper said in Find a specific sequence on every line:

            spec doesn’t say whether output should keep the colon

            Oops: he did include the colon in required output. He also (maybe inadvertently) omitted the space between colon and the number.

            So @george-martinez: test, test, test.

            1 Reply Last reply Reply Quote 0
            • G
              guy038
              last edited by Oct 18, 2021, 9:31 PM

              Hi, @george-martinez, @neil-schipper and All,

              Regarding the P.S. part of my previous post, it was more specifically provided for your personal knowledge of N++ Boost regexes, as you seem to master the basics of regexes and not for regex noobs, of course !

              You may find some info on these modifiers, at the beginning of my other post :

              https://community.notepad-plus-plus.org/post/70509


              Oh…, as you spoke about the colon char, I’ve just realized that I omitted the colon char in the Replace regex :-(

              So my final version should be :

              SEARCH (?-si)^.+port:\h*(\d+).*

              REPLACE port: $1

              BR

              guy038

              N 1 Reply Last reply Oct 19, 2021, 8:36 AM Reply Quote 0
              • N
                Neil Schipper @guy038
                last edited by Oct 19, 2021, 8:36 AM

                Hi @guy038,

                Your #70509 was quite extraordinary in depth and detail. I started a reply but did not publish because I’m still absorbing some aspects of it, and, I have some uncertainties. I do intend to reply, I hope in the next day or so.

                In regard to this thread, you did not respond to the main points of my last post.

                First point is that because the spec is loose, it’s tough to be sure whether any of our three solutions are truly complete (and you only referred to the least interesting of the four bullets).

                For example, both of your solutions would leave this line unchanged:

                port: 184 06 dfghjk
                

                because they don’t match port: in column 1. But maybe that can’t occur in the data files @george-martinez has to process, and your solution is fine. We just don’t know. (Also, I could have had a 5th bullet to mention that your solutions force case sensitivity, but this was not specified.) Anyway, I’ll try to remember when I offer solutions to state if I think there are unspecified aspects of the data described/provided by the request-maker that could make the solution misfire.

                The second point questions whether it’s a good idea to prefix every solution with a (? directive on pedagogical grounds. I tried to convey that I learned it recently (from you, plural, since I don’t remember if I saw it first in a post by you or TJ, PJ, or AK), and that I recognize that it’s a robust technique.

                1 Reply Last reply Reply Quote 0
                • G
                  guy038
                  last edited by guy038 Oct 19, 2021, 11:46 AM Oct 19, 2021, 11:37 AM

                  Hi, @george-martinez, @neil-schipper and All,

                  I agree that the right regex solution highly depends on OP’s needs ! So I’ll try to be a bit more exhaustive ;-))


                  The regexes, below, matches ANY line, without its line-break chars, containing the string port:, followed with an INTEGER, with the following conditions :

                  Case A  The string 'port....###', with this EXACT case, may occur at ANY position of the current line
                  Case B  The string 'port....###', with this EXACT case, is FOLLOWED                                      with some STANDARD characters
                  Case C  The string 'port....###', with this EXACT case, is PRECEDED                                      with some STANDARD characters
                  Case D  The string 'port....###', with this EXACT case, is PRECEDED and FOLLOWED                         with some STANDARD characters
                  Case E  The string 'port....###', with this EXACT case, BEGINS          the current line
                  Case F  The string 'port....###', with this EXACT case, BEGINS          the current line and is FOLLOWED with some STANDARD characters
                  Case G  The string 'port....###', with this EXACT case, ENDS            the current line
                  Case H  The string 'port....###', with this EXACT case, ENDS            the current line and is PRECEDED with some STANDARD characters
                  Case I  The string 'port....###', with this EXACT case, BEGINS and ENDS the current line
                  
                  Case J  The string 'port....###', WHATEVER its case,    may occur at ANY position of the current line
                  Case K  The string 'port....###', WHATEVER its case,    is FOLLOWED                                      with some STANDARD characters
                  Case L  The string 'port....###', WHATEVER its case,    is PRECEDED                                      with some STANDARD characters
                  Case M  The string 'port....###', WHATEVER its case,    is PRECEDED and FOLLOWED                         with some STANDARD characters
                  Case N  The string 'port....###', WHATEVER its case,    BEGINS          the current line
                  Case O  The string 'port....###', WHATEVER its case,    BEGINS          the current line and is FOLLOWED with some STANDARD characters
                  Case P  The string 'port....###', WHATEVER its case,    ENDS            the current line
                  Case Q  The string 'port....###', WHATEVER its case,    ENDS            the current line and is PRECEDED with some STANDARD characters
                  Case R  The string 'port....###', WHATEVER its case,    BEGINS and ENDS the current line
                  
                  and :
                  
                  Case 1  ANY range of consecutive HORIZONTAL BLANK char(s), even NONE, between the string 'port:' and the INTEGER
                  Case 2  ANY range of consecutive HORIZONTAL BLANK char(s),            between the string 'port:' and the INTEGER
                  Case 3  A SINGLE SPACE      char                                      between the string 'port:' and the INTEGER
                  Case 4  A SINGLE TABULATION char                                      between the string 'port:' and the INTEGER
                  
                  

                  Thus, here is the table of the different search regexes, according to their respective conditions

                  At the INTERSECTION of the column, relative to the character(s) between the string port: and the integer and the line, relative to the possible locations of the string port:.......#### and its case, you’ll find the appropriate search regex :


                           Case 1                      Case 2                       Case 3                      Case 4
                  
                  (?-is)^.*port:\h*(\d+).*    (?-is)^.*port:\h+(\d+).*    (?-is)^.*port:\x20(\d+).*    (?-is)^.*port:\t(\d+).*    Case A
                  (?-is)^.*port:\h*(\d+).+    (?-is)^.*port:\h+(\d+).+    (?-is)^.*port:\x20(\d+).+    (?-is)^.*port:\t(\d+).+    Case B
                  
                  (?-is)^.+port:\h*(\d+).*    (?-is)^.+port:\h+(\d+).*    (?-is)^.+port:\x20(\d+).*    (?-is)^.+port:\t(\d+).*    Case C
                  (?-is)^.+port:\h*(\d+).+    (?-is)^.+port:\h+(\d+).+    (?-is)^.+port:\x20(\d+).+    (?-is)^.+port:\t(\d+).+    Case D
                  
                  (?-is)^port:\h*(\d+).*      (?-is)^port:\h+(\d+).*      (?-is)^port:\x20(\d+).*      (?-is)^port:\t(\d+).*      Case E
                  (?-is)^port:\h*(\d+).+      (?-is)^port:\h+(\d+).+      (?-is)^port:\x20(\d+).+      (?-is)^port:\t(\d+).+      Case F
                  
                  (?-is)^.*port:\h*(\d+)$     (?-is)^.*port:\h+(\d+)$     (?-is)^.*port:\x20(\d+)$     (?-is)^.*port:\t(\d+)$     Case G
                  (?-is)^.+port:\h*(\d+)$     (?-is)^.+port:\h+(\d+)$     (?-is)^.+port:\x20(\d+)$     (?-is)^.+port:\t(\d+)$     Case H
                  
                  (?-is)^port:\h*(\d+)$       (?-is)^port:\h+(\d+)$       (?-is)^port:\x20(\d+)$       (?-is)^port:\t(\d+)$       Case I
                  
                  (?i-s)^.*port:\h*(\d+).*    (?i-s)^.*port:\h+(\d+).*    (?i-s)^.*port:\x20(\d+).*    (?i-s)^.*port:\t(\d+).*    Case J
                  (?i-s)^.*port:\h*(\d+).+    (?i-s)^.*port:\h+(\d+).+    (?i-s)^.*port:\x20(\d+).+    (?i-s)^.*port:\t(\d+).+    Case K
                  
                  (?i-s)^.+port:\h*(\d+).*    (?i-s)^.+port:\h+(\d+).*    (?i-s)^.+port:\x20(\d+).*    (?i-s)^.+port:\t(\d+).*    Case L
                  (?i-s)^.+port:\h*(\d+).+    (?i-s)^.+port:\h+(\d+).+    (?i-s)^.+port:\x20(\d+).+    (?i-s)^.+port:\t(\d+).+    Case M
                  
                  (?i-s)^port:\h*(\d+).*      (?i-s)^port:\h+(\d+).*      (?i-s)^port:\x20(\d+).*      (?i-s)^port:\t(\d+).*      Case N
                  (?i-s)^port:\h*(\d+).+      (?i-s)^port:\h+(\d+).+      (?i-s)^port:\x20(\d+).+      (?i-s)^port:\t(\d+).+      Case O
                  
                  (?i-s)^.*port:\h*(\d+)$     (?i-s)^.*port:\h+(\d+)$     (?i-s)^.*port:\x20(\d+)$     (?i-s)^.*port:\t(\d+)$     Case P
                  (?i-s)^.+port:\h*(\d+)$     (?i-s)^.+port:\h+(\d+)$     (?i-s)^.+port:\x20(\d+)$     (?i-s)^.+port:\t(\d+)$     Case Q
                  
                  (?i-s)^port:\h*(\d+)$       (?i-s)^port:\h+(\d+)$       (?i-s)^port:\x20(\d+)$       (?i-s)^port:\t(\d+)$       Case R
                  

                  Best Regards,

                  guy038

                  A N 2 Replies Last reply Oct 19, 2021, 11:41 AM Reply Quote 1
                  • A
                    Alan Kilborn @guy038
                    last edited by Alan Kilborn Oct 19, 2021, 11:41 AM Oct 19, 2021, 11:41 AM

                    @guy038 said in Find a specific sequence on every line:

                    right regex solution highly depends on OP’s needs ! So I’ll try to be a bit more exhaustive

                    I think you are going to exhaust yourself if you solve every possible problem that someone could be asking for. :-)

                    Maybe our new moderator could come up with a set of rules for asking data manipulation questions. If you don’t as your question correctly, you just get redirected back to the instructions, until you meet the criteria for an adequately stated problem. Hmm, maybe that would exhaust him as well. :-)

                    N 1 Reply Last reply Oct 19, 2021, 7:43 PM Reply Quote 1
                    • N
                      Neil Schipper @guy038
                      last edited by Oct 19, 2021, 7:34 PM

                      @guy038 Many of your posts are amazing, and they seem to be getting more amazing.

                      I’m not sure if you (or anyone) has done something like this before – ie, coding an entire family of regex search expressions for all conceivable variations on a given requirement – but it’s intriguing to imagine an engine behind a friendly interface asking users to specify key aspects of their requirements in natural language, such that the engine would then generate a list like you’ve done here.

                      It could even be built into Np++ or an add-on. That would make a lot of requests here redundant. And some folk would have to find a new hobby.

                      A 1 Reply Last reply Oct 19, 2021, 8:21 PM Reply Quote 1
                      • N
                        Neil Schipper @Alan Kilborn
                        last edited by Oct 19, 2021, 7:43 PM

                        @Alan-Kilborn said in Find a specific sequence on every line:

                        I think you are going to exhaust yourself if you solve every possible problem that someone could be asking for. :-)

                        I worry that if someone inadvertently posted nothing more than “I need a regex”, @guy038, cool and composed as ever, would charge forth, and how the terabytes would fly!

                        1 Reply Last reply Reply Quote 0
                        • A
                          Alan Kilborn @Neil Schipper
                          last edited by Oct 19, 2021, 8:21 PM

                          @Neil-Schipper said in Find a specific sequence on every line:

                          it’s intriguing to imagine an engine behind a friendly interface asking users to specify key aspects of their requirements in natural language

                          That would be, well, pure “magic”:

                          444080f6-9b1c-4dc6-bad7-357552cc007e-image.png

                          1 Reply Last reply Reply Quote 1
                          9 out of 12
                          • First post
                            9/12
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors