Match consecutive lines that start with the same word

Reply to Match consecutive lines that start with the same word on Tue, 16 Jul 2024 08:06:14 GMT

Ross Brown — Tue, 16 Jul 2024 08:06:14 GMT

@PeterJones Thanks Peter
Noted and thanks for the helpful links.

Reply to Match consecutive lines that start with the same word on Tue, 16 Jul 2024 08:05:10 GMT

Ross Brown — Tue, 16 Jul 2024 08:05:10 GMT

@Mark-Olson Thanks Mark
It works perfectly and you have provided a really clear explanation. I was going to bookmark and remove the rows but your additional code was a bonus. I can follow the logic (helped by your clear explanation) it’s the groups I am getting stuck on. I will do some more research in this area. Thanks again!

Reply to Match consecutive lines that start with the same word on Mon, 15 Jul 2024 16:05:03 GMT

PeterJones — Mon, 15 Jul 2024 16:05:03 GMT

@Ross-Brown ,

unfortunately I can’t post the link as I’m a newbie.

Asking questions in a way that makes it easy for us to help you would probably earn more upvotes. (But since this was enough for Mark to figure out what you wanted, I gave another upvote.)

But in the future, it would make it a lot easier for us to help you if you would give us your example data as text using the button when you are creating your post, so we can copy/paste, rather than making us try to type the same thing we see in a screenshot. That way it ends up in the code box with the “copy code” button, like in Mark’s reply.

(I had started an answer that was similar to Mark’s, but he posted before I got very far, so I stopped that part of my reply, and didn’t include any specifics for your situation; he explained it better than I was doing.)

----

Useful References

Reply to Match consecutive lines that start with the same word on Mon, 15 Jul 2024 16:01:09 GMT

Mark Olson — Mon, 15 Jul 2024 16:01:09 GMT

Hi @Ross-Brown

Keep up the good work learning regular expressions!

This is a task where lookahead and lookbehind can be useful, because you want to check whether the next line has some text, without moving forward to that line.

I came up with the regular expression (?-s)^(TEXT_TO_MATCH)(.*)(\R|\z)(?=\1) to solve your problem, assuming you want to keep only lines that start with TEXT_TO_MATCH and that are not followed by another line that starts with TEXT_TO_MATCH.

This regular expression does the following:

^(TEXT_TO_MATCH) attempts to find TEXT_TO_MATCH at the beginning of a line, then stores it as the first capture group
(.*) consumes the rest of the line (since (?-s) was specified at the start of the regex) and stores it as the second capture group
(\R|\z)(?=\1) stores the line ending (CRLF, CR, or LF) as the third capture group, but then fails the match if it sees that the next line does not start with the first capture group.

For example, let’s say you wanted to clear lines (remove their text but leave them empty) if they start with AAA or BBB followed by a normal space character and the next line has the same beginning.

Then you would replace TEXT_TO_MATCH in our original regex with (?:AAA|BBB)\x20, since that matches AAA or BBB followed by a normal space character, and we get the regex (?-s)^((?:AAA|BBB)\x20)(.*)(\R|\z)(?=\1)

We can test this out on this example:

AAA A   [Header row 1]
BBB B   [Data row 1]
AAA A   [Header row 2]
AAA A   [Header row 3]
AAA A   [Header row 4]
AAA A   [Header row 5]
BBB B   [Data row 2]
AAA A   [Header row 6]
BBB B   [Data row 3]
AAA A   [Header row 7]
BBB B   [Data row 4]
BBB B   [Data row 5]
AAA A   [Header row 8]

If we replace (?-s)^((?:AAA|BBB)\x20)(.*)(\R|\z)(?=\1) with ${3}, we clear everything except the line ending from each matched line, and get:

AAA A   [Header row 1]
BBB B   [Data row 1]



AAA A   [Header row 5]
BBB B   [Data row 2]
AAA A   [Header row 6]
BBB B   [Data row 3]
AAA A   [Header row 7]

BBB B   [Data row 5]
AAA A   [Header row 8]

I hope that helped!