Removing everything but the content of certain HTML tags

  • Say, I open an HTML page in Notepad++.

    This page has a lot of stuff, but especially these two tags:

    <div id=“first id” class=“first class”>CONTENT</div>

    <div id=“second id” class=“second class”>CONTENT</div>

    I’d like to remove everything from the file, but the CONTENT of these two tags. How could I do that in the most efficient manner?

  • I’m not sure, if this is an efficient way, but at least it is one way: You could use a regular expressions and replace everything with the parentheses placeholders. Open the replace dialog (Ctrl + H) and enter in “Find what” following regular expression:

    (.*?)(<div id=\"first id\" class=\"first class\">)(.*?)(<\/div>)(.*?)(<div id=\"second id\" class=\"second class\">)(.*?)(<\/div>)(.*)

    And in the “Replace with” field: ${3}${7}
    Or, alternatively: ${3}\r\n${7}

    The second one will add a line break between the two contents. You must also set the “Search Mode” to “Regular expression” and check the checkmark “. matches newline”. Finally, click “Replace all”.

  • This worked well! Thank you!

Log in to reply