Show (or keep) subsets of a file
-
@Coises posed several questions that I’ll answer here:
Can you tell us whether the delimiter strings are always the same? Do they always start at the beginning of a new line?
For a single search, the delimiter strings for all blocks to be found are the same, not “either string1 or string2”. I would change the delimiter strings for different invocations. (“Yesterday I searched for blocks that were delimited by string1 and string2, but today, in another file, I need to search for blocks delimited by string3 and string4.”)
[Start of edit]
And yes, they are always at the beginning of a new line. It wasn’t clear to me if I could specify only enough to make the string unique or if I had to specify the entire line contents.
[End of edit]If each delimiter string is not exactly the same, every time, we need to know enough details to determine the patterns they follow that differentiate them from other lines in the file.
I assume(!) that using the entire line for the start string and the end string should be sufficient; otherwise the file would be ill-formed because the delimiter strings wouldn’t really be delimiters. But yes, I see your next point.
Do the delimiter strings contain any characters other than letters, numbers, spaces? Rather than get into every last detail, if they do contain characters other than letters, numbers and spaces, could they ever contain the specific sequence \E (backslash followed by capital E)? If they cannot, the strings can be enclosed in a \Q …\E pair to “quote” them so there is no need to worry about exactly what special characters need escaping.
They would contain an asterisk, if I have to specify the entire line. If only a unique portion of the line is needed then I can skip specifying the asterisk, but I’m not sure if the syntax of the regular expression is trying to match the entire line or not. But they would never contain a backslash, so “\E” would not be in the delimiters (or in the blocks themselves). The \Q…\E syntax is something I’ll probably use every time, whether it’s needed or not.
A hex string (involving only letters, numbers and possibly spaces) will be no problem. But we need to be precise about what “would vary” means.
It would be your first case below, where I would have a single start/end string for each search, not the “either string1 or string2” case.
Do you mean that each time you do this search, you will start with a copy of the whole file and search for a specific target string?
Yes.
Or do you mean that there will be several different target strings, and you will want to get all the blocks that contain any of them into a single file?
No.
Or something else?
No.
-
I noticed that I had misspecified one of the occurrences of the starting string – I had used uppercase where the starting string did not. I retried the search using the correct case and also the \Q…\E syntax mentioned by @Coises :
(?s)^\Q*Block start\E((?!\Q*Block start\E).)+?\Q80 00010000\E.+?^\Q*Block end\E\R(*SKIP)(*F)|(?-s)^.*\RBut I got the same result where 36 occurrences were replaced and the resulting file contained only a single line of “*Block end”.
-
Ooh, getting close. Apparently some of the lines had trailing blanks on them. (I inadvertently added them when I was “sanitizing” the file so I could post it.) After removing the trailing blanks, the previous search I showed correctly identified blocks 1 and 2, but it did not identify block 5. I’m trying to parse the regular expression to see why, but the learning curve is steep…
-
Okay, I identified a bypass. As long as the ending string delimiting the last block isn’t the last line in the file, all blocks are located. I can make sure I add a trailing line before I run the Replace, so I shouldn’t have a problem. Thanks for your help. Back to work…
-
@Mark-Boonie said in Show (or keep) subsets of a file:
As long as the ending string delimiting the last block isn’t the last line in the file
use
(\R|\Z)at the end to allow the match to end with a newline or the end of the file.Most regulars here assume that all lines end with a newline
-
@PeterJones said in Show (or keep) subsets of a file:
(\R|\Z)
It didn’t quite work, @PeterJones, although it’s certainly possible that I messed up the syntax. I used this search string:
(?s)^\Q*Block start\E((?!\Q*Block start\E).)+?\Q80 00010000\E.+?^\Q*Block end\E\R(*SKIP)(*F)|(?-s)^.*(\R|\Z)And this file:
*Block start 00000000013FC200 00200280 00010000 00000000 00000001 00000000013FC210 00000002 CC5CDDA0 00000000 00000000 *Block end Extra stuff *Block start 00000000013FC200 00200280 00010000 00000000 00000002 00000000013FC210 00000002 CC5CDDA0 00000000 00000000 00000000013FC220 00000000 00000000 01266100 01266100 00000000013FC230 00808000 013FC2B8 00000000 00000000 *Block end *Block start 00000000013FC200 00200280 00020000 00000000 00000003 00000000013FC210 00000002 CC5CDDA0 00000000 00000000 00000000013FC220 00000000 00000000 01266100 01266100 00000000013FC230 00808000 013FC2B8 00000000 00000000 *Block end *Block start 00000000013FC200 00200280 00030000 00000000 00000004 00000000013FC210 00000002 CC5CDDA0 00000000 00000000 00000000013FC220 00000000 00000000 01266100 01266100 00000000013FC230 00808000 013FC2B8 00000000 00000000 *Block end Extra stuff *Block start 00000000013FC200 00200280 00010000 00000000 00000005 *Block end *Block start 00000000013FC200 00200280 00010000 00000000 00000006 *Block endNote that the last delimited block, with the ‘6’ as the last character before the ending string, is not found but should be.
-
Sorry, I hadn’t noticed there was more than one
\Rin the original regex. You would have to use the alternate just before the(*SKIP)as well. -
@PeterJones - Perfect! Thanks, everyone, for your help.
-
Hello, @mark-boonie, @coises, @peterjones and All,
Oh…, I’m terribly sorry ! I ate and watched TV for a few hours, forgetting all about you !
But, anyway, @mark-boonie, you cleverly found out the true solution !
So, considering your INPUT text, below, pasted in a new tab :
*Block start 00000000013FC200 00200280 00010000 00000000 00000001 00000000013FC210 00000002 CC5CDDA0 00000000 00000000 *Block end Extra stuff *Block start 00000000013FC200 00200280 00010000 00000000 00000002 00000000013FC210 00000002 CC5CDDA0 00000000 00000000 00000000013FC220 00000000 00000000 01266100 01266100 00000000013FC230 00808000 013FC2B8 00000000 00000000 *Block end *Block start 00000000013FC200 00200280 00020000 00000000 00000003 00000000013FC210 00000002 CC5CDDA0 00000000 00000000 00000000013FC220 00000000 00000000 01266100 01266100 00000000013FC230 00808000 013FC2B8 00000000 00000000 *Block end *Block start 00000000013FC200 00200280 00030000 00000000 00000004 00000000013FC210 00000002 CC5CDDA0 00000000 00000000 00000000013FC220 00000000 00000000 01266100 01266100 00000000013FC230 00808000 013FC2B8 00000000 00000000 *Block end Extra stuff *Block start 00000000013FC200 00200280 00010000 00000000 00000005 *Block end *Block start 00000000013FC200 00200280 00010000 00000000 00000006 *Block endHere is the final and minimum search regex which works in all cases, whatever a block or extra stuff ends your text without a final line-break :
(?s)^\*Block start\h*(?:(?!\*Block start).)+?80 00010000.+?^\*Block end\h*\R?(*SKIP)(*F)|(?-s)^.*\R?This way, the strings
*Block startand*Block endmust begin a line but may have possible trailing blank characters , too !
Of course, the
\Q.....\Esyntax is safer but, generally, just one or two characters must be escaped within your delimiters and/or your target string !You just have to remember to escape any of the
10characters* + ? ( ) ^ $ | [ ]and the\itself. For example, you do not need to escape your present target string at all, as it simply contains digits and space chars !But if your delimiters would have been
[*start|string*]and[*end|string*]or your target string(0x3F5B+0x7), surrounding them with the\Q...\Ewould have been preferable !Best Regards,
guy038
-
Hello, @mark-boonie and All,
I said in this post that we can translate the regex’s logic to :
What_I_do_not_want(*SKIP)(*F)|What_I_want. See also the excellent article, on this topic, at https://www.rexegg.com/backtracking-control-verbs.php#skipfail !But, regarding your present example, @mark-boonie, I suppose that we should invert the logic and tell :
What_I_want_to_keep(*SKIP)(*F)|What_I_want_to_deleteThis means that any multi-lines block, with delimiters
Block startandBlock endcontaining the string80 00010000is not considered ( text is skipped ) and that any single line contents, with its line-break, due to the(?-s)modifier, must be deleted
Note that the use of the Backtracking Control Verbs
(*SKIP)and(*F)is not mandatory at all ! we could have used this syntax, instead, for similar results :-
SEARCH
(?s)^\*Block start\h*((?!\*Block start).)+?80 00010000.+?^\*Block end\h*\R?|(?-s)^.*\R? -
REPLACE
(?1$0) -
We simply change the non-capturing group
(?:(?!\*Block start).)+?into a capturing group((?!\*Block start).)+? -
We tell that, in replacement, we must rewrite any block entirely (
$0), if the group1exists, thus the(?1$0)syntax -
And, as there is no colon char and text after
(?1$0, nothing must be taken in account if the group1is absent, which is the case in the(?-s)^.*\R?part !
Best regards,
guy038
-