• Login
Community
  • Login

How to remove paragraphs with specific pattern ?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
11 Posts 4 Posters 1.9k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H
    Harl Xu
    last edited by Apr 28, 2020, 6:21 PM

    I work with hundreds of txt files that formatted as follow :

    LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
    Room ‘xxx’ Seat #2 is occupied
    Seat 1: Mr.Hotseat
    Seat 2: könönen84
    *** NOTES ***
    seated at, amount ($$)

    LOG #7: 2020/04/15 0:48:55 CUST [2020/04/15 13:48:55 ET]
    Room ‘xxx’ Seat #2 is occupied
    Seat 1: Mr.Hotseat
    Seat 2: könönen84
    Seat 3: -
    Seat 4: -
    *** NOTES ***
    seated at, amount ($$)***

    I wish to delete entire paragraphs that word ‘Seat’ occurs exactly 3 times or less (in these case 1st paragraph).
    Can someone please provide some suggestions and thoughts on this?

    Thank you very much.

    A 1 Reply Last reply Apr 28, 2020, 8:34 PM Reply Quote 0
    • A
      astrosofista @Harl Xu
      last edited by Apr 28, 2020, 8:34 PM

      Hi @Harl-Xu, All

      Try this:

      Place the caret at the beginning of the file. Then open the Find panel (Control + F) and copy the following line in the Find what: field:

      (?-s)^LOG #.*\R^Room.*\R(?:^Seat \d+:.*\R){1,2}^\*.*\R^seated.*\R\R?

      Leave empty the Replace with: field.

      Select the Regular expression search mode, and click on the Replace All button.

      The regex will delete all paragraphs not containing the Seat 3: string.

      Hope this helps.

      A 1 Reply Last reply Apr 28, 2020, 9:03 PM Reply Quote 1
      • A
        Alan Kilborn @astrosofista
        last edited by Apr 28, 2020, 9:03 PM

        @astrosofista

        It’s not exactly to the OP’s spec, but it may be fulfilling the OP’s need! We will see. :-)

        A 1 Reply Last reply Apr 28, 2020, 9:15 PM Reply Quote 0
        • A
          astrosofista @Alan Kilborn
          last edited by Apr 28, 2020, 9:15 PM

          @Alan-Kilborn

          I guess we are reading again a message that is ambiguous in a different way. I count three times the term Seat in the paragraph to be deleted, but OP may have meant that the three seats should be at the beginning of a line.

          It doesn’t matter much anyway, since the regex is very easy to adapt to how many times Seat should appear.

          Let’s see :)

          H 1 Reply Last reply Apr 29, 2020, 5:18 AM Reply Quote 0
          • H
            Harl Xu @astrosofista
            last edited by Harl Xu Apr 29, 2020, 5:18 AM Apr 29, 2020, 5:18 AM

            @astrosofista

            Hello Sir,

            Thank you for the help. I’m sorry if you found my post ambiguous. I’m trying hard to compose my post in English.

            The only thing that constant between those paragraphs i’m working on is they always start with ‘LOG #’. And there are always blank line to separate those paragraphs.

            The wording or number of lines in a paragraph will varies, hence the code doesn’t work with other paragraphs. ‘Seat’ could be placed anywhere.

            All I want is to select ‘LOG #’ until blank line, count the word ‘Seat’, then delete entire selection if they matched my criteria.

            Thank you.

            A 1 Reply Last reply Apr 29, 2020, 4:39 PM Reply Quote 0
            • A
              astrosofista @Harl Xu
              last edited by Apr 29, 2020, 4:39 PM

              Hi @Harl-Xu

              Don’t worry about languages issues, as English isn’t my first language either. When in troubles, try to use a translator service as DeepL.com , if it is available for your language.

              Your message is ambiguous in a crucial sense, because we aren’t sure how to count the Seat instances. Let me show you what I mean, say:

              LOG #7: 2020/04/15 0:48:55 CUST [2020/04/15 13:48:55 ET]
              Room ‘xxx’ Seat #2 is occupied
              Seat 1: Mr.Hotseat
              Seat 2: könönen84
              Seat 3: -
              *** NOTES ***
              seated at, amount ($$)***
              

              If I take into account Seat #2 —mentioned in line 2—, then the paragraph includes 4 instances of the word Seat, so, applying the provided rule, the paragraph LOG #7 should not be deleted. However, if Seat #2 should not be counted, then LOG #7 includes only 3 instances of the word Seat and by the rule it should be deleted. See our problem?

              So, in order to better help you, I (we) need to know exactly how to count those instances. Also, please provide at least 3 examples of paragraphs that match the posted regex and 3 examples that fail to match. The examples are necessary to try to catch some regularity in them, which in turn will make a regex approach possible.

              Best Regards.

              1 Reply Last reply Reply Quote 4
              • G
                guy038
                last edited by guy038 Apr 29, 2020, 5:50 PM Apr 29, 2020, 5:41 PM

                Hello, @harl-xu, @Astrosofista, @alan-kilborn and All,

                @harl-xu, one @astrosofista’s statement is fundamental. He said :

                Also, please provide at least 3 examples of paragraphs that match the posted regex and 3 examples that fail to match. The examples are necessary to try to catch some regularity in them, which in turn will make a regex approach possible.

                Statement which could be simplify as :

                A faily number of examples of WHAT must be catched and WHAT must be ignored, to find out some regularity in these two sets of examples ! This approch helps us to build up the perfect regular expression, adapted to your personal case !

                Now, I was waiting for an @astrosofista’s reply to propose my own solution


                I tried to guess your needs and I supposed that you want to count the Seat words only if they begin a line and are followed with a space char

                • If we also assume that all the lines Seat <number>:, in a LOG # section, are consecutive, here is my first version :

                  • SEARCH (?s-i)^LOG\x20#((?:(?!^Seat\x20).)+?)(?-s:^Seat.+\R){0,3}?(?1)\R{2,}

                  • REPLACE Leave EMPTY

                • Later, I found out a second improved version which supports that the lines Seat <Number>: may be located anythere in a section, after the line LOG #.......

                  • SEARCH (?s-i)^LOG\x20#(?:((?:(?!^Seat\x20).)+?)^Seat\x20){0,3}?(?1)\R{2,}

                  • REPLACE Leave EMPTY

                Notes :

                • This 2nd version still counts lines which begin with Seat <number>:, ONLY

                • You may modify the number of required lines, changing the lazy quantifier {0,3}?. Note that this regex S/R will also delete any section without any line, beginning with Seat <Number>, with that exact case. If not desired, change the quantifier to {1,3}?

                • Moreover, any LOG # section can be separated, from an other section, by any positive number of pure empty lines !


                Here is an extended version of the second version, using the FREE-spacing regex mode, with some explanations in comments :

                (?xs-i)                  # Search in FREE-SPACING, SINGLE line and NON-INSENSITIVE modes
                ^LOG\x20\#               # String "LOG #", BEGINNING of line
                (?:                      # START of the first NON-CAPTURING group
                (                        #   START of Group 1
                (?: (?!^Seat\x20) . )+?  #     SHORTEST NON-NULL Range of ANY char, WITHOUT "Seat\x20" at BEGINNING of line
                )                        #   END of Group 1
                ^Seat\x20                #   followed with the STRING "Seat " at BEGINNING of line 
                )                        # END of the first NON-CAPTURING group
                {0,3}?                   # present a MINIMUM of 0 to 3 TIMES
                (?1)                     # followed, again, with ANOTHER group 1 ( a SUBROUTINE CALL to the group 1 REGEX )
                \R{2,}                   # ENDING with, at least, TWO CONSECUTIVE line-breaks 
                

                Finally, from the last @astrosofista’s post, if we consider that we must count any Seat <Number> string, whatever its location in a section, after the LOG # string, here is my third version regex version :

                • SEARCH (?s-i)^LOG\x20#(?:((?:(?!Seat\x20).)+?)Seat\x20){0,3}?(?1)\R{2,}

                • REPLACE LEave EMPTY

                Best Regards,

                guy038

                H 1 Reply Last reply Apr 30, 2020, 7:49 AM Reply Quote 5
                • G
                  guy038
                  last edited by guy038 Apr 29, 2020, 9:00 PM Apr 29, 2020, 8:31 PM

                  Hi, @harl-xu, @Astrosofista, @alan-kilborn and All,

                  To simplify and understand the general architecture, we can decompose, for instance, the second version of the search regex, according to this schema :

                            ANY char                  ANY char                  ANY char                  ANY char          
                               V                         V                         V                         V              
                  LOG #.................^Seat\x20.................^Seat\x20.................^Seat\x20.................\R{2,}
                       \_______________/         \_______________/         \_______________/         \_______________/      
                               v                         v                         v                         v              
                            Group 1                    Group 1                  Group 1                (?1) = Group 1       
                       \________________________/\________________________/\________________________/\_______________/      
                                   v                         v                         v                                    
                          NON-capturing group       NON-capturing group       NON-capturing group                           
                       ______________________________________________________________________________                       
                                         REPEATED a MINIMUM, from ZERO to THREE times                                       
                  
                  Note : ALL the GROUP 1 do NOT contain any string "^Seat ", due to the LOOK-AHEAD structure (?!^Seat\x20)
                  

                  Hope you like it !

                  Cheers,

                  guy038

                  1 Reply Last reply Reply Quote 5
                  • H
                    Harl Xu @guy038
                    last edited by Harl Xu Apr 30, 2020, 7:51 AM Apr 30, 2020, 7:49 AM

                    Hi @guy038, @astrosofista, All

                    I want to match 'Seat ', wherever their positions are. So I go with solution#3. But upon testing, solution#2 seems to have same hit with solution#3. But at least I can continue with my project now…

                    @astrosofista, the word ROOM and seated in my explanation are irrelevant, because they might not be there. That’s my bad, sorry.

                    You all are my saviors. Thank you so much.

                    1 Reply Last reply Reply Quote 1
                    • G
                      guy038
                      last edited by guy038 Apr 30, 2020, 11:13 AM Apr 30, 2020, 10:50 AM

                      Hi, @harl-xu, @astrosofista, @alan-kilborn, @ekopalypse, @michael-vincent and All,

                      @harl-xu, there is, indeed, a difference between solutions 2 and 3, below :

                      • Regex 2 : (?s-i)^LOG\x20#(?:((?:(?!^Seat\x20).)+?)^Seat\x20){0,3}?(?1)\R{2,}

                      • Regex 3 : (?s-i)^LOG\x20#(?:((?:(?!Seat\x20).)+?)Seat\x20){0,3}?(?1)\R{2,}


                      For instance, against this short example, below, which contains four LOG # sections :

                      LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                      Room ‘xxx’ Seat #2 is occupied
                      *** NOTES ***
                      seated at, amount ($$)
                      
                      LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                      Room ‘xxx’ Seat #2 is occupied
                      Seat 1: Mr.Hotseat
                      *** NOTES ***
                      seated at, amount ($$)
                      
                      LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                      Room ‘xxx’ Seat #2 is occupied
                      Seat 1: Mr.Hotseat
                      Seat 2: könönen84
                      *** NOTES ***
                      seated at, amount ($$)
                      
                      LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                      Room ‘xxx’ Seat #2 is occupied
                      Seat 1: Mr.Hotseat
                      Seat 2: könönen84
                      Seat 3: Blah blah
                      *** NOTES ***
                      seated at, amount ($$)
                      

                      The regex 2 matches all the 4 sections whereas the regex 3 does not match the last section ! Why ?

                      • With regex 2, it looks for not more than 3 strings "Seat ", beginning a line, within a LOG # section

                      • With regex 3, it looks for not more than 3 strings "Seat ", anytwhere in a line, within a LOG # section

                      So, because of the line Room ‘xxx’ Seat #2 is occupied, in all sections, which contains the string Seat #2, the last LOG # section has, finally, FOUR strings "Seat ". Thus, the regex 3 cannot match the last LOG # section. Elementary !

                      Best Regards,

                      guy038

                      P.S. :

                      With regexes 2 or 3 , a LOG # section will be considered as having 3 sections even if  the lines "Seat " are not consecutives

                      The regex 1, below, was more restrictive because, both, the strings "Seat " must begin a line and all these lines must also be consecutive !

                      • Regex 1 ; (?s-i)^LOG\x20#((?:(?!^Seat\x20).)+?)(?-s:^Seat.+\R){0,3}?(?1)\R{2,}

                      For instance, the regex 1 would only match the second LOG #, below :

                      LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                      Seat 1: Mr.Hotseat
                      *** NOTES ***
                      Seat 2: könönen84
                      seated at, amount ($$)
                      Seat 3: Blah blah
                      
                      LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                      Seat 1: Mr.Hotseat
                      Seat 2: könönen84
                      Seat 3: Blah blah
                      *** NOTES ***
                      seated at, amount ($$)
                      
                      H 1 Reply Last reply Apr 30, 2020, 11:43 AM Reply Quote 4
                      • H
                        Harl Xu @guy038
                        last edited by Harl Xu Apr 30, 2020, 11:44 AM Apr 30, 2020, 11:43 AM

                        Hi @guy038 , and All

                        Thank you for taking some extra work to explain the differences. Those schematic and details… You are so cool… :)
                        Then like I said in above post, regex 3 is what I need.

                        Best Regards,

                        Harl

                        1 Reply Last reply Reply Quote 3
                        3 out of 11
                        • First post
                          3/11
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors