How to remove paragraphs with specific pattern ?



  • I work with hundreds of txt files that formatted as follow :

    LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
    Room ‘xxx’ Seat #2 is occupied
    Seat 1: Mr.Hotseat
    Seat 2: könönen84
    *** NOTES ***
    seated at, amount ($$)

    LOG #7: 2020/04/15 0:48:55 CUST [2020/04/15 13:48:55 ET]
    Room ‘xxx’ Seat #2 is occupied
    Seat 1: Mr.Hotseat
    Seat 2: könönen84
    Seat 3: -
    Seat 4: -
    *** NOTES ***
    seated at, amount ($$)***

    I wish to delete entire paragraphs that word ‘Seat’ occurs exactly 3 times or less (in these case 1st paragraph).
    Can someone please provide some suggestions and thoughts on this?

    Thank you very much.



  • Hi @Harl-Xu, All

    Try this:

    Place the caret at the beginning of the file. Then open the Find panel (Control + F) and copy the following line in the Find what: field:

    (?-s)^LOG #.*\R^Room.*\R(?:^Seat \d+:.*\R){1,2}^\*.*\R^seated.*\R\R?

    Leave empty the Replace with: field.

    Select the Regular expression search mode, and click on the Replace All button.

    The regex will delete all paragraphs not containing the Seat 3: string.

    Hope this helps.



  • @astrosofista

    It’s not exactly to the OP’s spec, but it may be fulfilling the OP’s need! We will see. :-)



  • @Alan-Kilborn

    I guess we are reading again a message that is ambiguous in a different way. I count three times the term Seat in the paragraph to be deleted, but OP may have meant that the three seats should be at the beginning of a line.

    It doesn’t matter much anyway, since the regex is very easy to adapt to how many times Seat should appear.

    Let’s see :)



  • @astrosofista

    Hello Sir,

    Thank you for the help. I’m sorry if you found my post ambiguous. I’m trying hard to compose my post in English.

    The only thing that constant between those paragraphs i’m working on is they always start with ‘LOG #’. And there are always blank line to separate those paragraphs.

    The wording or number of lines in a paragraph will varies, hence the code doesn’t work with other paragraphs. ‘Seat’ could be placed anywhere.

    All I want is to select ‘LOG #’ until blank line, count the word ‘Seat’, then delete entire selection if they matched my criteria.

    Thank you.



  • Hi @Harl-Xu

    Don’t worry about languages issues, as English isn’t my first language either. When in troubles, try to use a translator service as DeepL.com, if it is available for your language.

    Your message is ambiguous in a crucial sense, because we aren’t sure how to count the Seat instances. Let me show you what I mean, say:

    LOG #7: 2020/04/15 0:48:55 CUST [2020/04/15 13:48:55 ET]
    Room ‘xxx’ Seat #2 is occupied
    Seat 1: Mr.Hotseat
    Seat 2: könönen84
    Seat 3: -
    *** NOTES ***
    seated at, amount ($$)***
    

    If I take into account Seat #2 —mentioned in line 2—, then the paragraph includes 4 instances of the word Seat, so, applying the provided rule, the paragraph LOG #7 should not be deleted. However, if Seat #2 should not be counted, then LOG #7 includes only 3 instances of the word Seat and by the rule it should be deleted. See our problem?

    So, in order to better help you, I (we) need to know exactly how to count those instances. Also, please provide at least 3 examples of paragraphs that match the posted regex and 3 examples that fail to match. The examples are necessary to try to catch some regularity in them, which in turn will make a regex approach possible.

    Best Regards.



  • Hello, @harl-xu, @Astrosofista, @alan-kilborn and All,

    @harl-xu, one @astrosofista’s statement is fundamental. He said :

    Also, please provide at least 3 examples of paragraphs that match the posted regex and 3 examples that fail to match. The examples are necessary to try to catch some regularity in them, which in turn will make a regex approach possible.

    Statement which could be simplify as :

    A faily number of examples of WHAT must be catched and WHAT must be ignored, to find out some regularity in these two sets of examples ! This approch helps us to build up the perfect regular expression, adapted to your personal case !

    Now, I was waiting for an @astrosofista’s reply to propose my own solution


    I tried to guess your needs and I supposed that you want to count the Seat words only if they begin a line and are followed with a space char

    • If we also assume that all the lines Seat <number>:, in a LOG # section, are consecutive, here is my first version :

      • SEARCH (?s-i)^LOG\x20#((?:(?!^Seat\x20).)+?)(?-s:^Seat.+\R){0,3}?(?1)\R{2,}

      • REPLACE Leave EMPTY

    • Later, I found out a second improved version which supports that the lines Seat <Number>: may be located anythere in a section, after the line LOG #.......

      • SEARCH (?s-i)^LOG\x20#(?:((?:(?!^Seat\x20).)+?)^Seat\x20){0,3}?(?1)\R{2,}

      • REPLACE Leave EMPTY

    Notes :

    • This 2nd version still counts lines which begin with Seat <number>:, ONLY

    • You may modify the number of required lines, changing the lazy quantifier {0,3}?. Note that this regex S/R will also delete any section without any line, beginning with Seat <Number>, with that exact case. If not desired, change the quantifier to {1,3}?

    • Moreover, any LOG # section can be separated, from an other section, by any positive number of pure empty lines !


    Here is an extended version of the second version, using the FREE-spacing regex mode, with some explanations in comments :

    (?xs-i)                  # Search in FREE-SPACING, SINGLE line and NON-INSENSITIVE modes
    ^LOG\x20\#               # String "LOG #", BEGINNING of line
    (?:                      # START of the first NON-CAPTURING group
    (                        #   START of Group 1
    (?: (?!^Seat\x20) . )+?  #     SHORTEST NON-NULL Range of ANY char, WITHOUT "Seat\x20" at BEGINNING of line
    )                        #   END of Group 1
    ^Seat\x20                #   followed with the STRING "Seat " at BEGINNING of line 
    )                        # END of the first NON-CAPTURING group
    {0,3}?                   # present a MINIMUM of 0 to 3 TIMES
    (?1)                     # followed, again, with ANOTHER group 1 ( a SUBROUTINE CALL to the group 1 REGEX )
    \R{2,}                   # ENDING with, at least, TWO CONSECUTIVE line-breaks 
    

    Finally, from the last @astrosofista’s post, if we consider that we must count any Seat <Number> string, whatever its location in a section, after the LOG # string, here is my third version regex version :

    • SEARCH (?s-i)^LOG\x20#(?:((?:(?!Seat\x20).)+?)Seat\x20){0,3}?(?1)\R{2,}

    • REPLACE LEave EMPTY

    Best Regards,

    guy038



  • Hi, @harl-xu, @Astrosofista, @alan-kilborn and All,

    To simplify and understand the general architecture, we can decompose, for instance, the second version of the search regex, according to this schema :

              ANY char                  ANY char                  ANY char                  ANY char          
                 V                         V                         V                         V              
    LOG #.................^Seat\x20.................^Seat\x20.................^Seat\x20.................\R{2,}
         \_______________/         \_______________/         \_______________/         \_______________/      
                 v                         v                         v                         v              
              Group 1                    Group 1                  Group 1                (?1) = Group 1       
         \________________________/\________________________/\________________________/\_______________/      
                     v                         v                         v                                    
            NON-capturing group       NON-capturing group       NON-capturing group                           
         ______________________________________________________________________________                       
                           REPEATED a MINIMUM, from ZERO to THREE times                                       
    
    Note : ALL the GROUP 1 do NOT contain any string "^Seat ", due to the LOOK-AHEAD structure (?!^Seat\x20)
    

    Hope you like it !

    Cheers,

    guy038



  • Hi @guy038, @astrosofista, All

    I want to match 'Seat ', wherever their positions are. So I go with solution#3. But upon testing, solution#2 seems to have same hit with solution#3. But at least I can continue with my project now…

    @astrosofista, the word ROOM and seated in my explanation are irrelevant, because they might not be there. That’s my bad, sorry.

    You all are my saviors. Thank you so much.



  • Hi, @harl-xu, @astrosofista, @alan-kilborn, @ekopalypse, @michael-vincent and All,

    @harl-xu, there is, indeed, a difference between solutions 2 and 3, below :

    • Regex 2 : (?s-i)^LOG\x20#(?:((?:(?!^Seat\x20).)+?)^Seat\x20){0,3}?(?1)\R{2,}

    • Regex 3 : (?s-i)^LOG\x20#(?:((?:(?!Seat\x20).)+?)Seat\x20){0,3}?(?1)\R{2,}


    For instance, against this short example, below, which contains four LOG # sections :

    LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
    Room ‘xxx’ Seat #2 is occupied
    *** NOTES ***
    seated at, amount ($$)
    
    LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
    Room ‘xxx’ Seat #2 is occupied
    Seat 1: Mr.Hotseat
    *** NOTES ***
    seated at, amount ($$)
    
    LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
    Room ‘xxx’ Seat #2 is occupied
    Seat 1: Mr.Hotseat
    Seat 2: könönen84
    *** NOTES ***
    seated at, amount ($$)
    
    LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
    Room ‘xxx’ Seat #2 is occupied
    Seat 1: Mr.Hotseat
    Seat 2: könönen84
    Seat 3: Blah blah
    *** NOTES ***
    seated at, amount ($$)
    

    The regex 2 matches all the 4 sections whereas the regex 3 does not match the last section ! Why ?

    • With regex 2, it looks for not more than 3 strings "Seat ", beginning a line, within a LOG # section

    • With regex 3, it looks for not more than 3 strings "Seat ", anytwhere in a line, within a LOG # section

    So, because of the line Room ‘xxx’ Seat #2 is occupied, in all sections, which contains the string Seat #2, the last LOG # section has, finally, FOUR strings "Seat ". Thus, the regex 3 cannot match the last LOG # section. Elementary !

    Best Regards,

    guy038

    P.S. :

    With regexes 2 or 3 , a LOG # section will be considered as having 3 sections even if  the lines "Seat " are not consecutives

    The regex 1, below, was more restrictive because, both, the strings "Seat " must begin a line and all these lines must also be consecutive !

    • Regex 1 ; (?s-i)^LOG\x20#((?:(?!^Seat\x20).)+?)(?-s:^Seat.+\R){0,3}?(?1)\R{2,}

    For instance, the regex 1 would only match the second LOG #, below :

    LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
    Seat 1: Mr.Hotseat
    *** NOTES ***
    Seat 2: könönen84
    seated at, amount ($$)
    Seat 3: Blah blah
    
    LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
    Seat 1: Mr.Hotseat
    Seat 2: könönen84
    Seat 3: Blah blah
    *** NOTES ***
    seated at, amount ($$)
    


  • Hi @guy038 , and All

    Thank you for taking some extra work to explain the differences. Those schematic and details… You are so cool… :)
    Then like I said in above post, regex 3 is what I need.

    Best Regards,

    Harl


Log in to reply