HTML - Only save the heading and paragraphs text, remove images and links
-
Hi :) Pretty new to notepad++ but it feels amazing with functions.
I wonder if it´s possible to use somekind of regexp to clean html code so the only contain text within the heading (<h> to </h>) and paragraphs (<p> to </p>)
Removing everything else between images etc and also attributes like links, classes, ids etc within a tag?
Example:-----------------------------raw version------------------------------------
<p><img alt=“” class=“aligncenter size-large wp-image-129073” height=“658” loading=“lazy” sizes=“(max-width: 1170px) 100vw, 1170px” src=“https://331mrnu3ylm2k3db3s1xd1hg-wpengine.netdna-ssl.com/wp-content/uploads/2017/12/coffee-film-1170x658.jpg” srcset=“https://331mrnu3ylm2k3db3s1xd1hg-wpengine.netdna-ssl.com/wp-content/uploads/2017/12/coffee-film-1170x658.jpg 1170w, https://331mrnu3ylm2k3db3s1xd1hg-wpengine.netdna-ssl.com/wp-content/uploads/2017/12/coffee-film-770x433.jpg 770w, https://331mrnu3ylm2k3db3s1xd1hg-wpengine.netdna-ssl.com/wp-content/uploads/2017/12/coffee-film-768x432.jpg 768w, https://331mrnu3ylm2k3db3s1xd1hg-wpengine.netdna-ssl.com/wp-content/uploads/2017/12/coffee-film-830x467.jpg 830w, https://331mrnu3ylm2k3db3s1xd1hg-wpengine.netdna-ssl.com/wp-content/uploads/2017/12/coffee-film.jpg 1920w” width=“1170”></p>
<p>Earlier this month we unveiled the nominees for the Ninth Annual Sprudgie Awards. <a class=“addbackground” href=“https://sprudge.com/vote”>Voting is now open</a> across a dozen categories, honoring the very best in coffee. Voting ends December 31st, 2017 at 11:59 PM.</p>
<p>In this feature, we’re spotlighting the 2017 nominees for Best Coffee Film/Video, one of the tightest races in all Sprudgie Award categories. Past winners for Best Coffee Film/Video include the <em><a class=“addbackground” href=“https://www.netflix.com/title/80109415”>Gilmore Girls: A Year In The Life,</a> <a href=“https://www.facebook.com/BaristaFilm/”>Barista</a>, <a class=“addbackground” href=“https://sprudge.com/dunkin-love-everything-54551.html”>Dunkin Love</a>, <a class=“addbackground” href=“https://sprudge.com/now-watching-hey-girl-guide-coffeeing-maria-hill.html”>Hey Girl Guide To Coffeeing</a>, <a class=“addbackground” href=“http://comediansincarsgettingcoffee.com/”>Comedians in Cars Getting Coffee</a>, </em>and <em><a class=“addbackground” href=“http://info.stumptowncoffee.com/kenya-video/”>“Kenya”</a> (Stumptown Coffee Roasters).</em></p>
<p>Let’s meet this year’s nominees!</p><!-- Either there are no banners, they are disabled or none qualified for this location! -->
<h3 id=“rb-The-Young-and-the-Spoonless-by-Cafe-Imports”>The Young and the Spoonless by Cafe Imports</h3>
<p><iframe allowfullscreen=“allowfullscreen” frameborder=“0” height=“675” loading=“lazy” mozallowfullscreen=“mozallowfullscreen” src=“https://player.vimeo.com/video/215119548” title=“Introducing: The Young and the Spoonless” webkitallowfullscreen=“webkitallowfullscreen” width=“1200”></iframe></p>
<h3 id=“rb-I-Yelp-By-The-Way-by-Dapper-amp-Wise”>I Yelp By The Way by Dapper & Wise</h3>
<blockquote class=“instagram-media” data-instgrm-captioned=“data-instgrm-captioned” data-instgrm-permalink=“https://www.instagram.com/p/BbNDSpaFT_L/” data-instgrm-version=“8” style=“background:#FFF; border:0; border-radius:3px; box-shadow:0 0 1px 0 rgba(0,0,0,0.5),0 1px 10px 0 rgba(0,0,0,0.15); margin: 1px; max-width:658px; padding:0; width:99.375%; width:-webkit-calc(100% - 2px); width:calc(100% - 2px);”>
<div style=“padding:8px;”>
<div style=“background:#F8F8F8; line-height:0; margin-top:40px; padding:28.10185185185185% 0; text-align:center; width:100%;”>
<div style=“background:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACwAAAAsCAMAAAApWqozAAAABGdBTUEAALGPC/xhBQAAAAFzUkdCAK7OHOkAAAAMUExURczMzPf399fX1+bm5mzY9AMAAADiSURBVDjLvZXbEsMgCES5/P8/t9FuRVCRmU73JWlzosgSIIZURCjo/ad+EQJJB4Hv8BFt+IDpQoCx1wjOSBFhh2XssxEIYn3ulI/6MNReE07UIWJEv8UEOWDS88LY97kqyTliJKKtuYBbruAyVh5wOHiXmpi5we58Ek028czwyuQdLKPG1Bkb4NnM+VeAnfHqn1k4+GPT6uGQcvu2h2OVuIf/gWUFyy8OWEpdyZSa3aVCqpVoVvzZZ2VTnn2wU8qzVjDDetO90GSy9mVLqtgYSy231MxrY6I2gGqjrTY0L8fxCxfCBbhWrsYYAAAAAElFTkSuQmCC); display:block; height:44px; margin:0 auto -44px; position:relative; top:-22px; width:44px;”></div>
</div>-----------------------------cleaned version------------------------------------
<p>Earlier this month we unveiled the nominees for the Ninth Annual Sprudgie Awards.Voting is now open across a dozen categories, honoring the very best in coffee. Voting ends December 31st, 2017 at 11:59 PM.</p>
<p>In this feature, we’re spotlighting the 2017 nominees for Best Coffee Film/Video, one of the tightest races in all Sprudgie Award categories. Past winners for Best Coffee Film/Video include the Gilmore Girls: A Year In The Life,Barista, Dunkin Love, Hey Girl Guide To Coffeeing, Comedians in Cars Getting Coffee, and “Kenya" (Stumptown Coffee Roasters).</em></p>
<p>Let’s meet this year’s nominees!</p>
<h3>The Young and the Spoonless by Cafe Imports</h3>
<h3>I Yelp By The Way by Dapper & Wise</h3> -
Hello, @daniel-norin,
Sorry, but I have to go away for at least 2 hours !
As soon as possible, I will be able to give you a first version !
Best Regards
guy038
-
Hello, @daniel-norin and All,
This is a first version, which will surely require modifications later on !
Firstly, we delete some unwanted parts in some lines that we wxant to keep :
-
Open the Replace dialog (
Ctrl + H
) -
SEARCH
(?s-i)<a .+?>|</a>|</?em>
-
REPLACE
Leave EMPTY
-
Tick the
Wrap around
option, only -
Select the
Regular expression
search mode -
Click on the
Replace All
button -
Close the dialog (
Esc
)
Secondly, we mark some text which will be copied in the clipboard :
- Open the Mark dialog (
Ctrl + M
)
SEARCH
(?s-i)<h([1-6]).+>\K.+?(?=</h\1>)|<p>\K\w.+?(?=</p>)
-
Tick the
Wrap around
option, only -
Select the
Regular expression
search mode -
Click on the
Mark All
button -
Click on the
Copy Marked Text
button -
Close the dialog (
Esc
)
Now, select all the text with
Ctrl + A
and hit theCtrl + V
shortcutThe expected cleaned text should appear :
Earlier this month we unveiled the nominees for the Ninth Annual Sprudgie Awards. Voting is now open across a dozen categories, honoring the very best in coffee. Voting ends December 31st, 2017 at 11:59 PM. In this feature, we’re spotlighting the 2017 nominees for Best Coffee Film/Video, one of the tightest races in all Sprudgie Award categories. Past winners for Best Coffee Film/Video include the Gilmore Girls: A Year In The Life, Barista, Dunkin Love, Hey Girl Guide To Coffeeing, Comedians in Cars Getting Coffee, and "Kenya" (Stumptown Coffee Roasters). Let’s meet this year’s nominees! The Young and the Spoonless by Cafe Imports I Yelp By The Way by Dapper & Wise
Best Regards,
guy038
-
-
@daniel-norin, @guy038, all
Just for the sake of variety, here is my take. It seems to work fine on input data, hope it can deal with other inputs.
Note that it does not remove empty paragraphs, such as the first and sixth. Anyway, it is easy to delete them in a second step.
Instructions are similar to the ones provided by @guy038. Open the
Replace
dialog (Ctrl + H
) and type:Search: (?s)<(?!/?p>|/?h3>?).*?>| id=.*?(?=>) Replace: [leave empty]
Put the caret at the very beginning of the document, select the
Regular Expression mode
and click onReplace
orReplace All
.Output:
<p></p> <p>Earlier this month we unveiled the nominees for the Ninth Annual Sprudgie Awards. Voting is now open across a dozen categories, honoring the very best in coffee. Voting ends December 31st, 2017 at 11:59 PM.</p> <p>In this feature, we’re spotlighting the 2017 nominees for Best Coffee Film/Video, one of the tightest races in all Sprudgie Award categories. Past winners for Best Coffee Film/Video include the Gilmore Girls: A Year In The Life, Barista, Dunkin Love, Hey Girl Guide To Coffeeing, Comedians in Cars Getting Coffee, and “Kenya” (Stumptown Coffee Roasters).</p> <p>Let’s meet this year’s nominees!</p> <h3>The Young and the Spoonless by Cafe Imports</h3> <p></p> <h3>I Yelp By The Way by Dapper & Wise</h3>
Stay healthy