How to remove paragraphs with specific pattern ?
-
I work with hundreds of txt files that formatted as follow :
LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET]
Room ‘xxx’ Seat #2 is occupied
Seat 1: Mr.Hotseat
Seat 2: könönen84
*** NOTES ***
seated at, amount ($$)LOG #7: 2020/04/15 0:48:55 CUST [2020/04/15 13:48:55 ET]
Room ‘xxx’ Seat #2 is occupied
Seat 1: Mr.Hotseat
Seat 2: könönen84
Seat 3: -
Seat 4: -
*** NOTES ***
seated at, amount ($$)***I wish to delete entire paragraphs that word ‘Seat’ occurs exactly 3 times or less (in these case 1st paragraph).
Can someone please provide some suggestions and thoughts on this?Thank you very much.
-
Hi @Harl-Xu, All
Try this:
Place the caret at the beginning of the file. Then open the Find panel (Control + F) and copy the following line in the
Find what:
field:(?-s)^LOG #.*\R^Room.*\R(?:^Seat \d+:.*\R){1,2}^\*.*\R^seated.*\R\R?
Leave empty the
Replace with:
field.Select the
Regular expression search mode
, and click on theReplace All
button.The regex will delete all paragraphs not containing the
Seat 3:
string.Hope this helps.
-
It’s not exactly to the OP’s spec, but it may be fulfilling the OP’s need! We will see. :-)
-
I guess we are reading again a message that is ambiguous in a different way. I count three times the term Seat in the paragraph to be deleted, but OP may have meant that the three seats should be at the beginning of a line.
It doesn’t matter much anyway, since the regex is very easy to adapt to how many times Seat should appear.
Let’s see :)
-
Hello Sir,
Thank you for the help. I’m sorry if you found my post ambiguous. I’m trying hard to compose my post in English.
The only thing that constant between those paragraphs i’m working on is they always start with ‘LOG #’. And there are always blank line to separate those paragraphs.
The wording or number of lines in a paragraph will varies, hence the code doesn’t work with other paragraphs. ‘Seat’ could be placed anywhere.
All I want is to select ‘LOG #’ until blank line, count the word ‘Seat’, then delete entire selection if they matched my criteria.
Thank you.
-
Hi @Harl-Xu
Don’t worry about languages issues, as English isn’t my first language either. When in troubles, try to use a translator service as DeepL.com, if it is available for your language.
Your message is ambiguous in a crucial sense, because we aren’t sure how to count the Seat instances. Let me show you what I mean, say:
LOG #7: 2020/04/15 0:48:55 CUST [2020/04/15 13:48:55 ET] Room ‘xxx’ Seat #2 is occupied Seat 1: Mr.Hotseat Seat 2: könönen84 Seat 3: - *** NOTES *** seated at, amount ($$)***
If I take into account Seat #2 —mentioned in line 2—, then the paragraph includes
4 instances
of the word Seat, so, applying the provided rule, the paragraph LOG #7 should not be deleted. However, if Seat #2 should not be counted, then LOG #7 includes only3 instances
of the word Seat and by the rule it should be deleted. See our problem?So, in order to better help you, I (we) need to know exactly how to count those instances. Also, please provide at least
3 examples
of paragraphs that match the posted regex and3 examples
that fail to match. The examples are necessary to try to catch some regularity in them, which in turn will make a regex approach possible.Best Regards.
-
Hello, @harl-xu, @Astrosofista, @alan-kilborn and All,
@harl-xu, one @astrosofista’s statement is fundamental. He said :
Also, please provide at least 3 examples of paragraphs that match the posted regex and 3 examples that fail to match. The examples are necessary to try to catch some regularity in them, which in turn will make a regex approach possible.
Statement which could be simplify as :
A faily number of examples of WHAT must be catched and WHAT must be ignored, to find out some regularity in these two sets of examples ! This approch helps us to build up the perfect regular expression, adapted to your personal case !
Now, I was waiting for an @astrosofista’s reply to propose my own solution
I tried to guess your needs and I supposed that you want to count the
Seat
words only if they begin a line and are followed with a space char-
If we also assume that all the lines
Seat <number>:
, in aLOG #
section, are consecutive, here is my first version :-
SEARCH
(?s-i)^LOG\x20#((?:(?!^Seat\x20).)+?)(?-s:^Seat.+\R){0,3}?(?1)\R{2,}
-
REPLACE
Leave EMPTY
-
-
Later, I found out a second improved version which supports that the lines
Seat <Number>:
may be located anythere in a section, after the lineLOG #.......
-
SEARCH
(?s-i)^LOG\x20#(?:((?:(?!^Seat\x20).)+?)^Seat\x20){0,3}?(?1)\R{2,}
-
REPLACE
Leave EMPTY
-
Notes :
-
This
2nd
version still counts lines which begin withSeat <number>:
, ONLY -
You may modify the number of required lines, changing the lazy quantifier
{0,3}?
. Note that this regex S/R will also delete any section without any line, beginning withSeat <Number>
, with that exact case. If not desired, change the quantifier to{1,3}?
-
Moreover, any
LOG #
section can be separated, from an other section, byany
positive number of pure empty lines !
Here is an extended version of the second version, using the FREE-spacing regex mode, with some explanations in comments :
(?xs-i) # Search in FREE-SPACING, SINGLE line and NON-INSENSITIVE modes ^LOG\x20\# # String "LOG #", BEGINNING of line (?: # START of the first NON-CAPTURING group ( # START of Group 1 (?: (?!^Seat\x20) . )+? # SHORTEST NON-NULL Range of ANY char, WITHOUT "Seat\x20" at BEGINNING of line ) # END of Group 1 ^Seat\x20 # followed with the STRING "Seat " at BEGINNING of line ) # END of the first NON-CAPTURING group {0,3}? # present a MINIMUM of 0 to 3 TIMES (?1) # followed, again, with ANOTHER group 1 ( a SUBROUTINE CALL to the group 1 REGEX ) \R{2,} # ENDING with, at least, TWO CONSECUTIVE line-breaks
Finally, from the last @astrosofista’s post, if we consider that we must count any
Seat <Number>
string, whatever its location in a section, after theLOG #
string, here is my third version regex version :-
SEARCH
(?s-i)^LOG\x20#(?:((?:(?!Seat\x20).)+?)Seat\x20){0,3}?(?1)\R{2,}
-
REPLACE
LEave EMPTY
Best Regards,
guy038
-
-
Hi, @harl-xu, @Astrosofista, @alan-kilborn and All,
To simplify and understand the general architecture, we can decompose, for instance, the
second
version of the search regex, according to this schema :ANY char ANY char ANY char ANY char V V V V LOG #.................^Seat\x20.................^Seat\x20.................^Seat\x20.................\R{2,} \_______________/ \_______________/ \_______________/ \_______________/ v v v v Group 1 Group 1 Group 1 (?1) = Group 1 \________________________/\________________________/\________________________/\_______________/ v v v NON-capturing group NON-capturing group NON-capturing group ______________________________________________________________________________ REPEATED a MINIMUM, from ZERO to THREE times Note : ALL the GROUP 1 do NOT contain any string "^Seat ", due to the LOOK-AHEAD structure (?!^Seat\x20)
Hope you like it !
Cheers,
guy038
-
Hi @guy038, @astrosofista, All
I want to match 'Seat ', wherever their positions are. So I go with solution#3. But upon testing, solution#2 seems to have same hit with solution#3. But at least I can continue with my project now…
@astrosofista, the word ROOM and seated in my explanation are irrelevant, because they might not be there. That’s my bad, sorry.
You all are my saviors. Thank you so much.
-
Hi, @harl-xu, @astrosofista, @alan-kilborn, @ekopalypse, @michael-vincent and All,
@harl-xu, there is, indeed, a difference between solutions
2
and3
, below :-
Regex
2
:(?s-i)^LOG\x20#(?:((?:(?!^Seat\x20).)+?)^Seat\x20){0,3}?(?1)\R{2,}
-
Regex
3
:(?s-i)^LOG\x20#(?:((?:(?!Seat\x20).)+?)Seat\x20){0,3}?(?1)\R{2,}
For instance, against this short example, below, which contains four
LOG #
sections :LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET] Room ‘xxx’ Seat #2 is occupied *** NOTES *** seated at, amount ($$) LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET] Room ‘xxx’ Seat #2 is occupied Seat 1: Mr.Hotseat *** NOTES *** seated at, amount ($$) LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET] Room ‘xxx’ Seat #2 is occupied Seat 1: Mr.Hotseat Seat 2: könönen84 *** NOTES *** seated at, amount ($$) LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET] Room ‘xxx’ Seat #2 is occupied Seat 1: Mr.Hotseat Seat 2: könönen84 Seat 3: Blah blah *** NOTES *** seated at, amount ($$)
The regex
2
matches all the4
sections whereas the regex3
does not match the last section ! Why ?-
With regex
2
, it looks for not more than3
strings"Seat "
, beginning a line, within aLOG #
section -
With regex
3
, it looks for not more than3
strings"Seat "
, anytwhere in a line, within aLOG #
section
So, because of the line Room ‘xxx’ Seat #2 is occupied, in all sections, which contains the string
Seat #2
, the lastLOG #
section has, finally, FOUR strings"Seat "
. Thus, the regex3
cannot match the lastLOG #
section. Elementary !Best Regards,
guy038
P.S. :
With regexes
2
or3
, aLOG #
section will be considered as having3
sections even if the lines"Seat "
are not consecutivesThe regex
1
, below, was more restrictive because, both, the strings"Seat "
must begin a line and all these lines must also be consecutive !- Regex
1
;(?s-i)^LOG\x20#((?:(?!^Seat\x20).)+?)(?-s:^Seat.+\R){0,3}?(?1)\R{2,}
For instance, the regex
1
would only match the secondLOG #
, below :LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET] Seat 1: Mr.Hotseat *** NOTES *** Seat 2: könönen84 seated at, amount ($$) Seat 3: Blah blah LOG #2: 2020/04/14 0:48:55 CUST [2020/04/14 13:48:55 ET] Seat 1: Mr.Hotseat Seat 2: könönen84 Seat 3: Blah blah *** NOTES *** seated at, amount ($$)
-
-
Hi @guy038 , and All
Thank you for taking some extra work to explain the differences. Those schematic and details… You are so cool… :)
Then like I said in above post, regex 3 is what I need.Best Regards,
Harl