Delete all text outside a specific HTML tag pair

pelocalizer

Hi everybody!

I have a set of HTML files where the content in scope is defined by the tag pair

So I would need to delete all text outside this tag pair.

Any idea about using a Search/Replace Regex or patters in order to achieve this?

Thank you very much!

PeterJones

@pelocalizer said:

Any idea about using a Search/Replace Regex or patters in order to achieve this?

Plenty of ideas. There’s a lot you aren’t telling us, so I have to assume that you’ll be happy with my solution here.

start with a file:

<span>out of bounds</span>
<p class="text"> and </p>
<span>this is out of bounds, too</span>

want it to be

<p class="text"> and </p>

Find What = (?s)\A.*(<p class="text">.*?</p>).*\Z
Replace With = $1
Mode = regular expression

I made the assumption that the forum converted your real quotes into smart quotes. I made the assumption that you wanted only one instance of what’s in scope, so I’m deleting everything before the start of the p and everything after the end of the p.

If you want better help than that, give us better information, and show that you’ve read and understood my boilerplate below, including the links mentioned.

-----
FYI: I often add this to my response in regex threads, unless I am sure the original poster has seen it before. Here is some helpful information for finding out more about regular expressions, and for formatting posts in this forum (especially quoting data) so that we can fully understand what you’re trying to ask:

This forum is formatted using Markdown, with a help link buried on the little grey ? in the COMPOSE window/pane when writing your post. For more about how to use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end. It is very important that you use these formatting tips – using single backtick marks around small snippets, and using code-quoting for pasting multiple lines from your example data files – because otherwise, the forum will change normal quotes ("") to curly “smart” quotes (“”), will change hyphens to dashes, will sometimes hide asterisks (or if your text is c:\folder\*.txt, it will show up as c:\folder*.txt, missing the backslash). If you want to clearly communicate your text data to us, you need to properly format it.

If you have further search-and-replace (“matching”, “marking”, “bookmarking”, regular expression, “regex”) needs, study this FAQ and the documentation it points to. Before asking a new regex question, understand that for future requests, many of us will expect you to show what data you have (exactly), what data you want (exactly), what regex you already tried (to show that you’re showing effort), why you thought that regex would work (to prove it wasn’t just something randomly typed), and what data you’re getting with an explanation of why that result is wrong. When you show that effort, you’ll see us bend over backward to get things working for you. If you need help formatting, see the paragraph above.

Please note that for all regex and related queries, it is best if you are explicit about what needs to match, and what shouldn’t match, and have multiple examples of both in your example dataset. Often, what shouldn’t match helps define the regular expression as much or more than what should match.