Community
    • Login

    Removing all lines that don't contain multiple texts in multiple files

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    8 Posts 4 Posters 372 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • APPROVED DTXA
      APPROVED DTX
      last edited by

      Hi, need help on this. I have over 500 text files that contains a lot of lines, and I do not wish to use the bookmark method as I have to do one file at a time.

      Example:

      #01128: 3N
      #01175: 003P000000000000
      #011A8: 3O
      #01224: 45000047000000000000000000000000
      #01228: 45000047000000004B00004D00000000
      #0142D: 0000005200000000
      #0142C: 0000005300000000
      #017AA: 45000047000000004B00004D00000000

      Let’s say I want to keep lines containing 28:, A8:, 2C:, 2D: (which are lines 1,3,5,6,7) and delete the rest of the lines that don’t contain these 4 specific texts. Is there a solution to this without doing one file at a time? Thank you.

      1 Reply Last reply Reply Quote 0
      • Terry RT
        Terry R
        last edited by

        @APPROVED-DTX said in Removing all lines that don't contain multiple texts in multiple files:

        Is there a solution to this without doing one file at a time? Thank you.

        Yes there is. You can use the “Find in Files” function. The regex to use would be:
        Find What:(?-s)^(?!.*?(28:|A8:|2C:|2D:)).+\R*
        Replace With: this field should be empty.
        As a regex the search mode is “regular expression”. You may want to open 1 file first and test just so you can be happy with the result. Complete the rest of the files using “Find in Files”. I will leave you to decide on the directory, “in all sub folders” and filter options.

        There is one possible side effect. If the last line is one removed, then you will finish with a blank line as the last line. I’ve tried a couple of options including looking for a “previous” line ending to the line being deleted but currently no perfect option has been found.

        Try it and please do come back to us with your findings.

        Terry

        1 Reply Last reply Reply Quote 3
        • APPROVED DTXA
          APPROVED DTX
          last edited by

          Oh thank you so much! It works just how I wanted. Yes, there is still a blank line at the end but it doesn’t affect what I need to work on. Thanks once again!

          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hi, @approved-dtx, @terry-r and All,

            @terry-r

            This easier syntax (?-s)^(?!.*?(28|A8|2C|2D):).+\R* could be used, too

            And, as there is only one colon, per line, the lazy quantifier *? is not needed, anyway !

            So, the regex (?-s)^(?!.*(28|A8|2C|2D):).+\R is a good candidate !

            BR

            guy038

            1 Reply Last reply Reply Quote 1
            • Terry RT
              Terry R
              last edited by

              @guy038 said in Removing all lines that don't contain multiple texts in multiple files:

              This easier syntax (?-s)^(?!.?(28|A8|2C|2D):).+\R could be used, too

              I did consider shortening it slightly as your first alternative shows but considered it a further complication for newcomers to regex to understand for little gain.

              Interestingly your second alternative is something I wouldn’t actually do. I just confirmed my suspicions on it using the “regex101” testing website. For the examples the OP provided mine (and your first alternative) took 718 steps to complete. Your second alternative however took 1058 steps to achieve (I included a * at the end otherwise the last line wasn’t selected). I did note that the “time” to process were ~0ms in all 3 cases.

              So I suppose disregarding the time to process, if we assume all steps are roughly equal your 2nd regex could take approximately 50% longer to process (in terms of steps) a large file than either of the first 2. It all depends on whether there are more longer lines or more shorter lines. As mine will “avoid” a line that DOES have the match (and therefore stays) in 7 character checks, your greedy regex will take more character checks on a long line, and less on a short line. As in the examples there are ONLY 2 lines with short strings where your greedy regex would take less steps. This is backed up by the regex101 numbers 718 steps vs 1058 steps.

              Of course where a line does NOT meet the match and therefore is flagged for deletion both our regexes would take equal number of steps to process that line as they BOTH have to check EVERY character.

              Does my logic make sense?

              Terry

              1 Reply Last reply Reply Quote 3
              • guy038G
                guy038
                last edited by guy038

                Hello, @terry-r and All,

                In my previous post, I affirmed that this regex (?-s)^(?!.*(28|A8|2C|2D):).+\R with a simple greedy quantifier * may be used, instead of the same regex (?-s)^(?!.*(28|A8|2C|2D):).+\R with the lazy quantifier *?, because there is only one colon character in each line of OP’s text

                To simplify, let’s suppose that we want to study this similar regex (?-s)^(?!.*A8:).+\R. But we certainly don’t have to bother about the literal A8 string and about the EOL chars, too. Thus, let simply consider the two general regexes :

                (?-s)^(?!.*:).+    and    (?-s)^(?!.*?:).+


                When I tried to go on the REgex101 site, I got this message :

                Unfortunately it seems your browser does not meed the criteria to properly render and utilize this website.

                Please upgrade your browser and come back

                Two months ago, I was able to join that site. So, my XP machine is definitively an outdated laptop :-((

                Thus, I decided to do tests with Notepad++ itself.

                I created two files, A and B, of 50,000 lines, each composed of 1000 digits. Then :

                • In file A, I inserted 1 colon character, at column 11, in every odd line

                • In file B, I inserted 1 colon character, at column 991, in every odd line

                So, files A and B contain, each, 25,000 lines with a colon and 25,000 lines without colon !


                After numerous repeated tests, here are the results, after a click on the Replace All button, on my old Win XP SP3 laptop :

                   •----------•--------------------•---------------•--------•
                   |   File   |   SEARCH regex     |  REPLACEMENT  |  TIME  |   
                   •----------•--------------------•---------------•--------•
                   |  File A  |  (?-s)^(?!.*:).+   |     EMPTY     |   8s   |
                   |          |                    |               |        |
                   |  File A  |  (?-s)^(?!.*?:).+  |     EMPTY     |  5.5s  |
                   |          |                    |               |        |
                   |  File B  |  (?-s)^(?!.*:).+   |     EMPTY     |   7s   |
                   |          |                    |               |        |
                   |  File B  |  (?-s)^(?!.*?:).+  |     EMPTY     |   7s   |
                   •----------•--------------------•---------------•--------•
                

                And, indeed, we can say that the use of the lazy quantifier *? seems to produce, globally, a quicker execution time then the greedy quantifier * !

                Note the irony of such a sentence: we need a lazy quantifier for faster execution ;-))


                Then I built a third file C, containing 50,000 lines, each composed of 1000 digits. And I inserted 1 colon character in column 11 of each line ( => 50,000 colon chars ) !

                I did tests, with N++ 7.9.2 ( the last release which supports Win XP !) against File C, using these two regex S/R :

                • I :

                  • SEARCH (\d+):(\d+)

                  • REPLACE \2:\1

                • II

                  • SEARCH (\d+?):(\d+)

                  • REPLACE \2:\1

                Note that running these S/R, successively moves the colon from column 11 to column 991, in each line, then move it from column 991 to column 11 in each line, too

                This time, after a first then a second click on the Replace All, whatever the S/R used ( I or II ), results are quite similar : 22,5s about, in all cases !


                Now, Terry, you’re perfectly right : In order to identify nice regex S/R, we should, globally, take execution time and number of steps in account, more often !

                Best Regards,

                guy038

                Alan KilbornA Terry RT 2 Replies Last reply Reply Quote 1
                • Alan KilbornA
                  Alan Kilborn @guy038
                  last edited by

                  @guy038 said in Removing all lines that don't contain multiple texts in multiple files:

                  In order to identify nice regex S/R, we should, globally, take execution time and number of steps in account, more often !

                  It is a good thing to do, in general, but most posters here it won’t matter for. They just want to transform some data, usually not a “huge” amount.

                  So IMO if everyone starts analyzing each regex question’s answer for performance, it’s rather a waste of time. Of course, it’s YOUR time, so feel free.

                  Now, if they’re going to put a regex into some sort of “production” use on some “massive” data set, where performance would be a critical factor, then they aren’t going to be using Notepad++ for it (and likely not the Boost engine, but could be) and it really gets off-topic fast.

                  1 Reply Last reply Reply Quote 4
                  • Terry RT
                    Terry R @guy038
                    last edited by

                    @guy038 said in Removing all lines that don't contain multiple texts in multiple files:

                    In order to identify nice regex S/R, we should, globally, take execution time and number of steps in account, more often !

                    @guy038, I commend you for your in depth study of the effect of lazy and greedy quantifiers on execution times. It’s good to know that it did back up my theory. I suppose I hadn’t; until this time; actually spent much time in considering the effect, but actually I had subconsciously been applying the theory anyway.

                    When you say “Note the irony of such a sentence: we need a lazy quantifier for faster execution ;-))” I think that it is actually quite self explanatory.

                    I suppose I’ve thought that a greedy quantifier must still process the line 1 character at a time until the end of the line is reached (unless there is some magical way it can grab instantly ALL characters on a line), just doing an end of line test (I’m only talking (?-s) option). Then it must process 1 character at a time in reverse for the actual regex match it is looking for. The lazy quantifier JUST processes 1 character at a time proceeding across the line until the match is found.

                    It’s possible (as I stated earlier) that the (.*) might be an extremely fast process, but even if the match string is evenly positioned across a line there (to my mind) would still be some processing overhead for a greedy quantifier with respect to a lazy one. At this point I’m actually being contrary to a previous statement I made:
                    “As mine will “avoid” a line that DOES have the match (and therefore stays) in 7 character checks, your greedy regex will take more character checks on a long line, and less on a short line.”
                    I now think that a greedy quantifier will ALWAYS make more tests (steps) in processing the same line, regardless of whether there is a match or not.

                    And @Alan-Kilborn , indeed you are right to some degree, all this doesn’t matter for most OPs. But I only have to turn to the many topics on which many of the seasoned members have contributed ever increasing tighter code as alternative answers to OP questions. One might show an answer, then another picks that regex apart, see what makes it tick and realises there are some efficiencies to be made and presents an updated version. Now with @guy038’s tests we can see yet another tool we can add to our toolbox, the consideration that a lazy quantifier isn’t so lazy after all.

                    Cheers
                    Terry

                    PS and sorry to you @guy038 for your problems with XP not being able to use the regex101 site. I will have to say though the old addage “flogging a dead horse” comes to mind.
                    https://en.wikipedia.org/wiki/Flogging_a_dead_horse

                    1 Reply Last reply Reply Quote 4
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors