Community
    • Login

    How to remove paragraphs with specific pattern ?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    11 Posts 4 Posters 1.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Harl XuH
      Harl Xu
      last edited by

      I work with hundreds of txt files that formatted as follow :

      LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
      Room ‘xxx’ Seat #2 is occupied
      Seat 1: Mr.Hotseat
      Seat 2: könönen84
      *** NOTES ***
      seated at, amount ($$)

      LOG #7: 2020/04/15 0:48:55 CUST [2020/04/15 13:48:55 ET]
      Room ‘xxx’ Seat #2 is occupied
      Seat 1: Mr.Hotseat
      Seat 2: könönen84
      Seat 3: -
      Seat 4: -
      *** NOTES ***
      seated at, amount ($$)***

      I wish to delete entire paragraphs that word ‘Seat’ occurs exactly 3 times or less (in these case 1st paragraph).
      Can someone please provide some suggestions and thoughts on this?

      Thank you very much.

      astrosofistaA 1 Reply Last reply Reply Quote 0
      • astrosofistaA
        astrosofista @Harl Xu
        last edited by

        Hi @Harl-Xu, All

        Try this:

        Place the caret at the beginning of the file. Then open the Find panel (Control + F) and copy the following line in the Find what: field:

        (?-s)^LOG #.*\R^Room.*\R(?:^Seat \d+:.*\R){1,2}^\*.*\R^seated.*\R\R?

        Leave empty the Replace with: field.

        Select the Regular expression search mode, and click on the Replace All button.

        The regex will delete all paragraphs not containing the Seat 3: string.

        Hope this helps.

        Alan KilbornA 1 Reply Last reply Reply Quote 1
        • Alan KilbornA
          Alan Kilborn @astrosofista
          last edited by

          @astrosofista

          It’s not exactly to the OP’s spec, but it may be fulfilling the OP’s need! We will see. :-)

          astrosofistaA 1 Reply Last reply Reply Quote 0
          • astrosofistaA
            astrosofista @Alan Kilborn
            last edited by

            @Alan-Kilborn

            I guess we are reading again a message that is ambiguous in a different way. I count three times the term Seat in the paragraph to be deleted, but OP may have meant that the three seats should be at the beginning of a line.

            It doesn’t matter much anyway, since the regex is very easy to adapt to how many times Seat should appear.

            Let’s see :)

            Harl XuH 1 Reply Last reply Reply Quote 0
            • Harl XuH
              Harl Xu @astrosofista
              last edited by Harl Xu

              @astrosofista

              Hello Sir,

              Thank you for the help. I’m sorry if you found my post ambiguous. I’m trying hard to compose my post in English.

              The only thing that constant between those paragraphs i’m working on is they always start with ‘LOG #’. And there are always blank line to separate those paragraphs.

              The wording or number of lines in a paragraph will varies, hence the code doesn’t work with other paragraphs. ‘Seat’ could be placed anywhere.

              All I want is to select ‘LOG #’ until blank line, count the word ‘Seat’, then delete entire selection if they matched my criteria.

              Thank you.

              astrosofistaA 1 Reply Last reply Reply Quote 0
              • astrosofistaA
                astrosofista @Harl Xu
                last edited by

                Hi @Harl-Xu

                Don’t worry about languages issues, as English isn’t my first language either. When in troubles, try to use a translator service as DeepL.com, if it is available for your language.

                Your message is ambiguous in a crucial sense, because we aren’t sure how to count the Seat instances. Let me show you what I mean, say:

                LOG #7: 2020/04/15 0:48:55 CUST [2020/04/15 13:48:55 ET]
                Room ‘xxx’ Seat #2 is occupied
                Seat 1: Mr.Hotseat
                Seat 2: könönen84
                Seat 3: -
                *** NOTES ***
                seated at, amount ($$)***
                

                If I take into account Seat #2 —mentioned in line 2—, then the paragraph includes 4 instances of the word Seat, so, applying the provided rule, the paragraph LOG #7 should not be deleted. However, if Seat #2 should not be counted, then LOG #7 includes only 3 instances of the word Seat and by the rule it should be deleted. See our problem?

                So, in order to better help you, I (we) need to know exactly how to count those instances. Also, please provide at least 3 examples of paragraphs that match the posted regex and 3 examples that fail to match. The examples are necessary to try to catch some regularity in them, which in turn will make a regex approach possible.

                Best Regards.

                1 Reply Last reply Reply Quote 4
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, @harl-xu, @Astrosofista, @alan-kilborn and All,

                  @harl-xu, one @astrosofista’s statement is fundamental. He said :

                  Also, please provide at least 3 examples of paragraphs that match the posted regex and 3 examples that fail to match. The examples are necessary to try to catch some regularity in them, which in turn will make a regex approach possible.

                  Statement which could be simplify as :

                  A faily number of examples of WHAT must be catched and WHAT must be ignored, to find out some regularity in these two sets of examples ! This approch helps us to build up the perfect regular expression, adapted to your personal case !

                  Now, I was waiting for an @astrosofista’s reply to propose my own solution


                  I tried to guess your needs and I supposed that you want to count the Seat words only if they begin a line and are followed with a space char

                  • If we also assume that all the lines Seat <number>:, in a LOG # section, are consecutive, here is my first version :

                    • SEARCH (?s-i)^LOG\x20#((?:(?!^Seat\x20).)+?)(?-s:^Seat.+\R){0,3}?(?1)\R{2,}

                    • REPLACE Leave EMPTY

                  • Later, I found out a second improved version which supports that the lines Seat <Number>: may be located anythere in a section, after the line LOG #.......

                    • SEARCH (?s-i)^LOG\x20#(?:((?:(?!^Seat\x20).)+?)^Seat\x20){0,3}?(?1)\R{2,}

                    • REPLACE Leave EMPTY

                  Notes :

                  • This 2nd version still counts lines which begin with Seat <number>:, ONLY

                  • You may modify the number of required lines, changing the lazy quantifier {0,3}?. Note that this regex S/R will also delete any section without any line, beginning with Seat <Number>, with that exact case. If not desired, change the quantifier to {1,3}?

                  • Moreover, any LOG # section can be separated, from an other section, by any positive number of pure empty lines !


                  Here is an extended version of the second version, using the FREE-spacing regex mode, with some explanations in comments :

                  (?xs-i)                  # Search in FREE-SPACING, SINGLE line and NON-INSENSITIVE modes
                  ^LOG\x20\#               # String "LOG #", BEGINNING of line
                  (?:                      # START of the first NON-CAPTURING group
                  (                        #   START of Group 1
                  (?: (?!^Seat\x20) . )+?  #     SHORTEST NON-NULL Range of ANY char, WITHOUT "Seat\x20" at BEGINNING of line
                  )                        #   END of Group 1
                  ^Seat\x20                #   followed with the STRING "Seat " at BEGINNING of line 
                  )                        # END of the first NON-CAPTURING group
                  {0,3}?                   # present a MINIMUM of 0 to 3 TIMES
                  (?1)                     # followed, again, with ANOTHER group 1 ( a SUBROUTINE CALL to the group 1 REGEX )
                  \R{2,}                   # ENDING with, at least, TWO CONSECUTIVE line-breaks 
                  

                  Finally, from the last @astrosofista’s post, if we consider that we must count any Seat <Number> string, whatever its location in a section, after the LOG # string, here is my third version regex version :

                  • SEARCH (?s-i)^LOG\x20#(?:((?:(?!Seat\x20).)+?)Seat\x20){0,3}?(?1)\R{2,}

                  • REPLACE LEave EMPTY

                  Best Regards,

                  guy038

                  Harl XuH 1 Reply Last reply Reply Quote 5
                  • guy038G
                    guy038
                    last edited by guy038

                    Hi, @harl-xu, @Astrosofista, @alan-kilborn and All,

                    To simplify and understand the general architecture, we can decompose, for instance, the second version of the search regex, according to this schema :

                              ANY char                  ANY char                  ANY char                  ANY char          
                                 V                         V                         V                         V              
                    LOG #.................^Seat\x20.................^Seat\x20.................^Seat\x20.................\R{2,}
                         \_______________/         \_______________/         \_______________/         \_______________/      
                                 v                         v                         v                         v              
                              Group 1                    Group 1                  Group 1                (?1) = Group 1       
                         \________________________/\________________________/\________________________/\_______________/      
                                     v                         v                         v                                    
                            NON-capturing group       NON-capturing group       NON-capturing group                           
                         ______________________________________________________________________________                       
                                           REPEATED a MINIMUM, from ZERO to THREE times                                       
                    
                    Note : ALL the GROUP 1 do NOT contain any string "^Seat ", due to the LOOK-AHEAD structure (?!^Seat\x20)
                    

                    Hope you like it !

                    Cheers,

                    guy038

                    1 Reply Last reply Reply Quote 5
                    • Harl XuH
                      Harl Xu @guy038
                      last edited by Harl Xu

                      Hi @guy038, @astrosofista, All

                      I want to match 'Seat ', wherever their positions are. So I go with solution#3. But upon testing, solution#2 seems to have same hit with solution#3. But at least I can continue with my project now…

                      @astrosofista, the word ROOM and seated in my explanation are irrelevant, because they might not be there. That’s my bad, sorry.

                      You all are my saviors. Thank you so much.

                      1 Reply Last reply Reply Quote 1
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi, @harl-xu, @astrosofista, @alan-kilborn, @ekopalypse, @michael-vincent and All,

                        @harl-xu, there is, indeed, a difference between solutions 2 and 3, below :

                        • Regex 2 : (?s-i)^LOG\x20#(?:((?:(?!^Seat\x20).)+?)^Seat\x20){0,3}?(?1)\R{2,}

                        • Regex 3 : (?s-i)^LOG\x20#(?:((?:(?!Seat\x20).)+?)Seat\x20){0,3}?(?1)\R{2,}


                        For instance, against this short example, below, which contains four LOG # sections :

                        LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                        Room ‘xxx’ Seat #2 is occupied
                        *** NOTES ***
                        seated at, amount ($$)
                        
                        LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                        Room ‘xxx’ Seat #2 is occupied
                        Seat 1: Mr.Hotseat
                        *** NOTES ***
                        seated at, amount ($$)
                        
                        LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                        Room ‘xxx’ Seat #2 is occupied
                        Seat 1: Mr.Hotseat
                        Seat 2: könönen84
                        *** NOTES ***
                        seated at, amount ($$)
                        
                        LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                        Room ‘xxx’ Seat #2 is occupied
                        Seat 1: Mr.Hotseat
                        Seat 2: könönen84
                        Seat 3: Blah blah
                        *** NOTES ***
                        seated at, amount ($$)
                        

                        The regex 2 matches all the 4 sections whereas the regex 3 does not match the last section ! Why ?

                        • With regex 2, it looks for not more than 3 strings "Seat ", beginning a line, within a LOG # section

                        • With regex 3, it looks for not more than 3 strings "Seat ", anytwhere in a line, within a LOG # section

                        So, because of the line Room ‘xxx’ Seat #2 is occupied, in all sections, which contains the string Seat #2, the last LOG # section has, finally, FOUR strings "Seat ". Thus, the regex 3 cannot match the last LOG # section. Elementary !

                        Best Regards,

                        guy038

                        P.S. :

                        With regexes 2 or 3 , a LOG # section will be considered as having 3 sections even if  the lines "Seat " are not consecutives

                        The regex 1, below, was more restrictive because, both, the strings "Seat " must begin a line and all these lines must also be consecutive !

                        • Regex 1 ; (?s-i)^LOG\x20#((?:(?!^Seat\x20).)+?)(?-s:^Seat.+\R){0,3}?(?1)\R{2,}

                        For instance, the regex 1 would only match the second LOG #, below :

                        LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                        Seat 1: Mr.Hotseat
                        *** NOTES ***
                        Seat 2: könönen84
                        seated at, amount ($$)
                        Seat 3: Blah blah
                        
                        LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
                        Seat 1: Mr.Hotseat
                        Seat 2: könönen84
                        Seat 3: Blah blah
                        *** NOTES ***
                        seated at, amount ($$)
                        
                        Harl XuH 1 Reply Last reply Reply Quote 4
                        • Harl XuH
                          Harl Xu @guy038
                          last edited by Harl Xu

                          Hi @guy038 , and All

                          Thank you for taking some extra work to explain the differences. Those schematic and details… You are so cool… :)
                          Then like I said in above post, regex 3 is what I need.

                          Best Regards,

                          Harl

                          1 Reply Last reply Reply Quote 3
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors