Hello, @andrew-anderson and All,
After two days, working from time to time, I found out a way to isolate non-HTML tags, easily enough, with three consecutive regex S/R ;-))
To begin with, I got, from that site, the list of all regular HTML tags :
https://www.w3schools.com/TAGs/
Secondly, from some HTML documents, I restricted that list to the most common HTML tags used ( to my mind ! ), below, by alphabetic order :
<!--.......-->
<!DOCTYPE ...>
<a>
<b>
<body>
<br>
<dd>
<div>
<dl>
<dt>
<font>
<form>
<h1>
<h2>
<h3>
<h4>
<h5>
<h6>
<head>
<hr>
<html>
<i>
<iframe>
<img>
<input>
<li>
<link>
<meta>
<ol>
<option>
<p>
<script>
<select>
<source>
<span>
<style>
<table>
<tbody>
<td>
<th>
<title>
<tr>
<u>
<ul>
Finally, I built three regex S/R, which modify, after consecutive execution, any HTML document in a short list of tags, all different from any of those, above and from which it should be easy to point out the Non-HTML remaining tags !
Well, let’s go !
Copy the HTML document to analyse, in a new tab ( IMPORTANT )
Select that new tab ( You do not, even, need to change the language to HTML ! )
With the first S/R, we get rid of all comments and of the DOCTYPE declaration
SEARCH (?s)<!(--.+?--|DOCTYPE.+?)>
REPLACE Leave EMPTY
With the
second S/R, we ONLY keep
any of the
three forms :
<tag> ,
</tag> and
<tag, followed by a
space character, rewritten,
one per line
SEARCH (?s).*?<(?|(\w+) |/?(\w+)?>)|.*\z
REPLACE ?1\1\r\n
With the
third and last S/R, we, simply,
delete any
HTML tag, belonging to the
common HTML list, given
above
SEARCH (?-i)^(a|body|br|b|dd|div|dl|dt|font|form|h[1-6]|head|hr|html|iframe|img|input|i|link|li|meta|ol|option|p|script|select|source|span|style|table|tbody|td|th|title|tr|ul|u)\R
REPLACE Leave EMPTY
Now, let’s put into practice these regexes !
Here is, below, the main page source code of N++ site. I just added three non-regular tags to that code, right after the <body> tag :
<!DOCTYPE html>
<html lang="en" class="home midcol">
<head>
<meta charset="utf-8" />
<title>Notepad++ Home</title>
<meta name="description" content="Notepad++: a free source code editor which supports several programming languages running under the MS Windows environment."/>
<meta name="keywords" content="Notepad++, telechargement, gratuit, free source code editor, remplacant de Notepad++, Notepad2, netpad, open source, web editor, html editor, xml editor, php editor, asp editor, javascript editor, java editor, c++ editor, c# editor, objective-c editor, NFO editor, VB editor, CSS, SQL, Pascal, Perl, Python, Lua, Regular Expression Search"/>
<link rel="alternate" type="application/rss+xml" title="Follow Notepad++ with RSS" href="/feed.rss"/>
<link rel="stylesheet" type="text/css" href="/assets/css/npp_c1.css"/>
<link rel="stylesheet" type="text/css" href="/assets/css/fonts/droidserif.css"/>
<link rel="shortcut icon" href="/assets/images/favicon.ico" type="image/x-icon" />
<!--[if lte IE 7]><link rel="stylesheet" type="text/css" href="/assets/css/ie67.css"/><![endif]-->
<script type="text/javascript">
window.___gcfg = {lang: 'en'};
(function()
{var po = document.createElement("script");
po.type = "text/javascript"; po.async = true;po.src = "https://apis.google.com/js/plusone.js";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(po, s);
})();</script>
<script type="text/javascript" src="https://code.jquery.com/jquery-1.5.min.js"></script>
<script type="text/javascript" src="/assets/js/npp_c1.js"></script>
<script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script>
<base href="/" />
</head>
<body>
<bla bla>
<guy title="Test">
</abcde>
<div id="wrapper">
<div id="content">
<p id="skip"><a href="#content" title="Skip to main content">Skip to main content</a></p>
<div id="left">
<ul id="translate">
<li class="english"><a href="/en/">English</a></li>
<li class="french"><a href="/fr/">French</a></li>
<li class="chinese"><a href="/zh/">Chinese</a></li>
</ul><p id="transmore"><a href="choose-your-language.html" title="choose from more languages">more languages</a></p>
<h1><a href="/">Notepad++</a></h1>
<ul id="nav"><li class="first active"><a href="https://notepad-plus-plus.org/" title="Home" >Home</a></li>
<li><a href="download/" title="Download" >Download</a></li>
<li><a href="news/" title="News" >News</a></li>
<li><a href="features/" title="Features" >Features</a></li>
<li><a href="resources.html" title="Resources" >Resources</a></li>
<li><a href="contribute/" title="Contribute" >Contribute</a></li>
<li><a href="donate/" title="Donate" >Donate</a></li>
<li><a href="/community" title="Community" >Community</a></li>
<li><a href="contributors/" title="Contributors" >Contributors</a></li>
<li class="last"><a href="links.html" title="Links" >Links</a></li>
</ul>
<p id="download"><a href="download/v7.5.html">Download</a><br>Current Version: <span>7.5</span></p>
<style>
#carbonads {
display: block;
//overflow: hidden;
margin-top: 3em;
padding: 2em;
//border-top: solid 1px #cd8e2f;
//border-bottom: solid 1px #a67326;
//background-color: hsla(204, 15%, 19%, .6);
font-size: 11px;
font-family: Verdana, "Helvetica Neue", Helvetica, sans-serif;
line-height: 1.5;
}
#carbonads span {
display: block;
overflow: hidden;
}
.carbon-text {
display: block;
margin-bottom: 1em;
text-align: left;
//width:240px;
}
.carbon-img {
display: block;
margin: 30px auto 1em;
text-align: center;
}
.carbon-poweredby {
display: block;
text-align: right;
font-size: 10px;
}
</style>
<script async type="text/javascript" src="//cdn.carbonads.com/carbon.js?zoneid=1673&serve=C6AILKT&placement=notepadplusplusorg" id="_carbonads_js"></script>
</div>
<div id="midcol">
<h2>News</h2>
<ol id="news">
<li class="first"><a href="news/notepad-7.5-released.html">Notepad++ 7.5 released</a> Aug 16 2017</li>
<li><a href="news/notepad-7.4.2-released.html">Notepad++ 7.4.2 released</a> Jun 18 2017</li>
<li><a href="news/back-to-v7.3.3.html">Back to v7.3.3</a> Jun 08 2017</li>
<li><a href="news/notepad-7.4.1-released.html">Notepad++ 7.4.1 released</a> May 18 2017</li>
<li><a href="news/notepad-7.4-released.html">Notepad++ 7.4 released</a> May 14 2017</li>
<li><a href="news/notepad-7.3.3-fix-cia-hacking-issue.html">Fix CIA Hacking Issue</a> Mar 08 2017</li>
<li><a href="news/notepad-7.3.2-released.html">Notepad++ 7.3.2 released</a> Feb 13 2017</li>
<li><a href="news/notepad-7.3.1-released.html">Notepad++ 7.3.1 released</a> Jan 17 2017</li>
<li><a href="news/notepad-7.3-released.html">Notepad++ 7.3 released</a> Jan 01 2017</li>
<li><a href="news/notepad-7.2.2-released.html">Notepad++ 7.2.2 released</a> Nov 27 2016</li>
</ol>
<p id="morenews"><a href="news/">More news »</a></p>
</div>
<div id="main">
<h2>About</h2>
<p>Notepad++ is a free (as in "free speech" and also as in "free beer") source code editor and Notepad replacement that supports several languages. Running in the MS Windows environment, its use is governed by <a href="http://www.gnu.org/copyleft/gpl.html" target="_blank">GPL</a> License.</p>
<p>Based on the powerful editing component <a href="http://www.scintilla.org/" target="_blank">Scintilla</a>, <span>Notepad++</span> is written in C++ and uses pure Win32 API and STL which ensures a higher execution speed and smaller program size. By optimizing as many routines as possible without losing user friendliness, <span>Notepad++</span> is trying to reduce the world carbon dioxide emissions. When using less CPU power, the PC can throttle down and reduce power consumption, resulting in a greener environment.<br /> </p>
<p><img title="Screenshot" src="/assets/images/notepad4ever.gif" alt="Screenshot" /></p>
<p>You're encouraged to <a href="/contribute/binary-translation-howto.html">translate Notepad++</a> into your native language if there's not already a translation present in the <a href="/contribute/binary-translations.html">Binary Translations page</a>.</p>
<p><span>I hope you enjoy Notepad++ as much as I enjoy coding it.</span></p>
</div>
</div>
<div id="footer">
<!-- start ecreate box -->
<div id="ecCredit">
<div id="ecBG"></div><div id="ecLinkBG"></div><div id="ecSeeOurWork"></div>
<div id="ecBox">
<p><a href="http://www.ecreate.com.au">Ecreate is a Perth based Web and graphic design agency.</a></p>
<ul>
<li id="ecURL">»» <a href="http://www.ecreate.com.au" target="_blank">www.ecreate.com.au</a></li>
<li id="ecTwitter"><a href="http://www.twitter.com/ecreate" target="_blank">Follow Ecreate on Twitter</a></li>
<li id="ecFacebook"><a href="http://www.facebook.com/ecreate.com.au" target="_blank">Like us on Facebook</a></li>
</ul>
</div>
<p id="ecLink"><a href="http://www.ecreate.com.au" target="_blank">Website kindly donated by <span>Ecreate</span></a></p>
</div>
<!-- end ecreate box -->
<p id="share">
<a href="https://plus.google.com/+notepad-plus-plus/" rel="publisher" class="gplus" target="_blank">Notepad++ on Google+</a>
<!-- a href="http://www.facebook.com/Notepad.plus.plus" target="_blank">Like Notepad++ on Facebook</a -->
<a href="http://twitter.com/notepad_plus" class="twitter" target="_blank">Follow Notepad++ on Twitter</a>
<a href="feed.rss" class="rss">RSS News Feed</a>
</p>
<div id="plusone">
<!-- Place this tag where you want the +1 button to render. -->
<div class="g-plusone"></div>
<!-- Place this tag after the last +1 button tag. -->
<script type="text/javascript">
(function() {
var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;
po.src = 'https://apis.google.com/js/plusone.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);
})();
</script>
</div>
<p id="copy">Copyright © Don Ho 2016</p>
<p id="validate">
<a href="http://validator.w3.org/check?uri=referer">HTML</a> • <a href="http://jigsaw.w3.org/css-validator/check/referer">CSS</a></p>
<!-- Google Analytics Begin -->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-47715314-1', 'notepad-plus-plus.org');
ga('send', 'pageview');
</script>
<!-- Google Analytics End -->
</div>
</div>
</body>
</html>
After the first regex has beeen executed, the DOCTYPE tag and eight comments tags are removed
After execution of the second regex S/R, you get the text below :
html
head
meta
title
title
meta
meta
link
link
link
link
script
script
script
script
script
script
script
script
base
head
body
bla
guy
abcde
div
div
p
a
a
p
div
ul
li
a
a
li
li
a
a
li
li
a
a
li
ul
p
a
a
p
h1
a
a
h1
ul
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
ul
p
a
a
br
span
span
p
style
style
script
script
div
div
h2
h2
ol
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
ol
p
a
a
p
div
div
h2
h2
p
a
a
p
p
a
a
span
span
span
span
br
p
p
img
p
p
a
a
a
a
p
p
span
span
p
div
div
div
div
div
div
div
div
div
div
div
p
a
a
p
ul
li
a
a
li
li
a
a
li
li
a
a
li
ul
div
p
a
span
span
a
p
div
p
a
a
a
a
a
a
p
div
div
div
script
script
div
p
p
p
a
a
a
a
p
script
script
div
div
body
html
And, after running the third regex S/R, you should obtain the very short list :
base
bla
guy
abcde
Obviously, as the <base> tag is true HTML tag, this implies that this particular code contains three non-HTMl tags, only, written in black foreground colour !
NON-HTML tags :
bla
guy
abcde
Best Regards,
guy038
P.S. :
You may choose, for the
third regex, a
shortened list. For instance :
(?-i)^(a|body|br|b|div|font|form|h[1-6]|head|hr|html|img|input|i|li|ol|p|script|span|style|table|td|th|title|tr|ul|u)\R
Of course, the resulting list will be longer but it shouldn’t be very difficult to sort the non-HTML tags out !
It’s important to point out the right order of terms, in a list of alternatives. For instance, to match the three tags <br> , <body> and <b>, the regex must be <(body|br|b)> and NOT <(b|body|br)> neither <(b|br|body)> !!
I’ll give detailed explanations of these 3 regexes, very soon :;))