Removing all lines that don't contain multiple texts in multiple files



  • Hi, need help on this. I have over 500 text files that contains a lot of lines, and I do not wish to use the bookmark method as I have to do one file at a time.

    Example:

    #01128: 3N
    #01175: 003P000000000000
    #011A8: 3O
    #01224: 45000047000000000000000000000000
    #01228: 45000047000000004B00004D00000000
    #0142D: 0000005200000000
    #0142C: 0000005300000000
    #017AA: 45000047000000004B00004D00000000

    Let’s say I want to keep lines containing 28:, A8:, 2C:, 2D: (which are lines 1,3,5,6,7) and delete the rest of the lines that don’t contain these 4 specific texts. Is there a solution to this without doing one file at a time? Thank you.



  • @APPROVED-DTX said in Removing all lines that don't contain multiple texts in multiple files:

    Is there a solution to this without doing one file at a time? Thank you.

    Yes there is. You can use the “Find in Files” function. The regex to use would be:
    Find What:(?-s)^(?!.*?(28:|A8:|2C:|2D:)).+\R*
    Replace With: this field should be empty.
    As a regex the search mode is “regular expression”. You may want to open 1 file first and test just so you can be happy with the result. Complete the rest of the files using “Find in Files”. I will leave you to decide on the directory, “in all sub folders” and filter options.

    There is one possible side effect. If the last line is one removed, then you will finish with a blank line as the last line. I’ve tried a couple of options including looking for a “previous” line ending to the line being deleted but currently no perfect option has been found.

    Try it and please do come back to us with your findings.

    Terry



  • Oh thank you so much! It works just how I wanted. Yes, there is still a blank line at the end but it doesn’t affect what I need to work on. Thanks once again!



  • Hi, @approved-dtx, @terry-r and All,

    @terry-r

    This easier syntax (?-s)^(?!.*?(28|A8|2C|2D):).+\R* could be used, too

    And, as there is only one colon, per line, the lazy quantifier *? is not needed, anyway !

    So, the regex (?-s)^(?!.*(28|A8|2C|2D):).+\R is a good candidate !

    BR

    guy038



  • @guy038 said in Removing all lines that don't contain multiple texts in multiple files:

    This easier syntax (?-s)^(?!.?(28|A8|2C|2D):).+\R could be used, too

    I did consider shortening it slightly as your first alternative shows but considered it a further complication for newcomers to regex to understand for little gain.

    Interestingly your second alternative is something I wouldn’t actually do. I just confirmed my suspicions on it using the “regex101” testing website. For the examples the OP provided mine (and your first alternative) took 718 steps to complete. Your second alternative however took 1058 steps to achieve (I included a * at the end otherwise the last line wasn’t selected). I did note that the “time” to process were ~0ms in all 3 cases.

    So I suppose disregarding the time to process, if we assume all steps are roughly equal your 2nd regex could take approximately 50% longer to process (in terms of steps) a large file than either of the first 2. It all depends on whether there are more longer lines or more shorter lines. As mine will “avoid” a line that DOES have the match (and therefore stays) in 7 character checks, your greedy regex will take more character checks on a long line, and less on a short line. As in the examples there are ONLY 2 lines with short strings where your greedy regex would take less steps. This is backed up by the regex101 numbers 718 steps vs 1058 steps.

    Of course where a line does NOT meet the match and therefore is flagged for deletion both our regexes would take equal number of steps to process that line as they BOTH have to check EVERY character.

    Does my logic make sense?

    Terry



  • Hello, @terry-r and All,

    In my previous post, I affirmed that this regex (?-s)^(?!.*(28|A8|2C|2D):).+\R with a simple greedy quantifier * may be used, instead of the same regex (?-s)^(?!.*(28|A8|2C|2D):).+\R with the lazy quantifier *?, because there is only one colon character in each line of OP’s text

    To simplify, let’s suppose that we want to study this similar regex (?-s)^(?!.*A8:).+\R. But we certainly don’t have to bother about the literal A8 string and about the EOL chars, too. Thus, let simply consider the two general regexes :

    (?-s)^(?!.*:).+    and    (?-s)^(?!.*?:).+


    When I tried to go on the REgex101 site, I got this message :

    Unfortunately it seems your browser does not meed the criteria to properly render and utilize this website.

    Please upgrade your browser and come back

    Two months ago, I was able to join that site. So, my XP machine is definitively an outdated laptop :-((

    Thus, I decided to do tests with Notepad++ itself.

    I created two files, A and B, of 50,000 lines, each composed of 1000 digits. Then :

    • In file A, I inserted 1 colon character, at column 11, in every odd line

    • In file B, I inserted 1 colon character, at column 991, in every odd line

    So, files A and B contain, each, 25,000 lines with a colon and 25,000 lines without colon !


    After numerous repeated tests, here are the results, after a click on the Replace All button, on my old Win XP SP3 laptop :

       •----------•--------------------•---------------•--------•
       |   File   |   SEARCH regex     |  REPLACEMENT  |  TIME  |   
       •----------•--------------------•---------------•--------•
       |  File A  |  (?-s)^(?!.*:).+   |     EMPTY     |   8s   |
       |          |                    |               |        |
       |  File A  |  (?-s)^(?!.*?:).+  |     EMPTY     |  5.5s  |
       |          |                    |               |        |
       |  File B  |  (?-s)^(?!.*:).+   |     EMPTY     |   7s   |
       |          |                    |               |        |
       |  File B  |  (?-s)^(?!.*?:).+  |     EMPTY     |   7s   |
       •----------•--------------------•---------------•--------•
    

    And, indeed, we can say that the use of the lazy quantifier *? seems to produce, globally, a quicker execution time then the greedy quantifier * !

    Note the irony of such a sentence: we need a lazy quantifier for faster execution ;-))


    Then I built a third file C, containing 50,000 lines, each composed of 1000 digits. And I inserted 1 colon character in column 11 of each line ( => 50,000 colon chars ) !

    I did tests, with N++ 7.9.2 ( the last release which supports Win XP !) against File C, using these two regex S/R :

    • I :

      • SEARCH (\d+):(\d+)

      • REPLACE \2:\1

    • II

      • SEARCH (\d+?):(\d+)

      • REPLACE \2:\1

    Note that running these S/R, successively moves the colon from column 11 to column 991, in each line, then move it from column 991 to column 11 in each line, too

    This time, after a first then a second click on the Replace All, whatever the S/R used ( I or II ), results are quite similar : 22,5s about, in all cases !


    Now, Terry, you’re perfectly right : In order to identify nice regex S/R, we should, globally, take execution time and number of steps in account, more often !

    Best Regards,

    guy038



  • @guy038 said in Removing all lines that don't contain multiple texts in multiple files:

    In order to identify nice regex S/R, we should, globally, take execution time and number of steps in account, more often !

    It is a good thing to do, in general, but most posters here it won’t matter for. They just want to transform some data, usually not a “huge” amount.

    So IMO if everyone starts analyzing each regex question’s answer for performance, it’s rather a waste of time. Of course, it’s YOUR time, so feel free.

    Now, if they’re going to put a regex into some sort of “production” use on some “massive” data set, where performance would be a critical factor, then they aren’t going to be using Notepad++ for it (and likely not the Boost engine, but could be) and it really gets off-topic fast.



  • @guy038 said in Removing all lines that don't contain multiple texts in multiple files:

    In order to identify nice regex S/R, we should, globally, take execution time and number of steps in account, more often !

    @guy038, I commend you for your in depth study of the effect of lazy and greedy quantifiers on execution times. It’s good to know that it did back up my theory. I suppose I hadn’t; until this time; actually spent much time in considering the effect, but actually I had subconsciously been applying the theory anyway.

    When you say “Note the irony of such a sentence: we need a lazy quantifier for faster execution ;-))” I think that it is actually quite self explanatory.

    I suppose I’ve thought that a greedy quantifier must still process the line 1 character at a time until the end of the line is reached (unless there is some magical way it can grab instantly ALL characters on a line), just doing an end of line test (I’m only talking (?-s) option). Then it must process 1 character at a time in reverse for the actual regex match it is looking for. The lazy quantifier JUST processes 1 character at a time proceeding across the line until the match is found.

    It’s possible (as I stated earlier) that the (.*) might be an extremely fast process, but even if the match string is evenly positioned across a line there (to my mind) would still be some processing overhead for a greedy quantifier with respect to a lazy one. At this point I’m actually being contrary to a previous statement I made:
    “As mine will “avoid” a line that DOES have the match (and therefore stays) in 7 character checks, your greedy regex will take more character checks on a long line, and less on a short line.”
    I now think that a greedy quantifier will ALWAYS make more tests (steps) in processing the same line, regardless of whether there is a match or not.

    And @Alan-Kilborn , indeed you are right to some degree, all this doesn’t matter for most OPs. But I only have to turn to the many topics on which many of the seasoned members have contributed ever increasing tighter code as alternative answers to OP questions. One might show an answer, then another picks that regex apart, see what makes it tick and realises there are some efficiencies to be made and presents an updated version. Now with @guy038’s tests we can see yet another tool we can add to our toolbox, the consideration that a lazy quantifier isn’t so lazy after all.

    Cheers
    Terry

    PS and sorry to you @guy038 for your problems with XP not being able to use the regex101 site. I will have to say though the old addage “flogging a dead horse” comes to mind.
    https://en.wikipedia.org/wiki/Flogging_a_dead_horse


Log in to reply