How to search for inactive HTML tags?

Andrew Anderson

This question is a bit complicated, so here we go.

You know how when you have something on Notepad++ designated as an HTML document, all the tags (i.e. stuff between < and >) is in different colors? But anything that is between < and > but is not an HTML tag has no color–those characters are just a regular black.

My question is this: Is there a way to search for all the black < and > characters in an HTML document? With thousands of colored HTML tags, I really don’t have the time to look through every single one to see if there are any black non-tags I missed. (They also don’t show up at all when you look at the document in the form of a web page–so you can’t search for them that way.)

Thanks in advance!

PeterJones

In this example, which of the following would you expect to find?

just the brackets for spaced
the brackets for spaced and leftspace
the brackets for spaced and leftspace and rightspace
the brackets for everything but html

And what about ?

Also, it appears that Scintilla’s underlying HTML parser gets it wrong:

This stackoverflow post points out that at least one HTML spec (xhtml) requires no space between the < and the name of the tag, so < leftspace> should not be interpreted as a tag, known or otherwise.

As a first level, the FIND Regex:<(?![\w/!]) will use a negative lookahead to find any < not followed by alphanumeric, /, or !. So that could find the bad opening <.

I had hoped that (?<!\<[\w/!][^>]+)> would find a closing > that was not preceded by a token which matches valid start (<alnum, <?, or <!), followed by one or more non-> characters… but it claims “Find: Invalid Regular Expression”, so I must have something dodgy there. Checking documentation: at least in Perl, lookbehinds have to be fixed width, and I think the boost regex library that NPP uses is similar to the Perl engine… I think that would explain it.

That said, I don’t know what I’d try next to find an end-> that wasn’t preceded by a valid start-of-tag: logically, what I am looking for is NOT( "<[alnum][non>]*" OR "</[non>]*" OR "<!--[non>]*--") FOLLOWED BY ">", but without variable-width lookbehind, I don’t know how. @guy038 (our regular expression guru), do you have any ideas?

guy038

Hello, @andrew-anderson and All,

After two days, working from time to time, I found out a way to isolate non-HTML tags, easily enough, with three consecutive regex S/R ;-))

To begin with, I got, from that site, the list of all regular HTML tags :

https://www.w3schools.com/TAGs/

Secondly, from some HTML documents, I restricted that list to the most common HTML tags used ( to my mind ! ), below, by alphabetic order :

<!--.......-->
<!DOCTYPE ...>

<a>
<b>
<body>
<br>
<dd>
<div>
<dl>
<dt>
<font>
<form>
<h1>
<h2>
<h3>
<h4>
<h5>
<h6>
<head>
<hr>
<html>
<i>
<iframe>
<img>
<input>
<li>
<link>
<meta>
<ol>
<option>
<p>
<script>
<select>
<source>
<span>
<style>
<table>
<tbody>
<td>
<th>
<title>
<tr>
<u>
<ul>

Finally, I built three regex S/R, which modify, after consecutive execution, any HTML document in a short list of tags, all different from any of those, above and from which it should be easy to point out the Non-HTML remaining tags !

Well, let’s go !

Copy the HTML document to analyse, in a new tab ( IMPORTANT )
Select that new tab ( You do not, even, need to change the language to HTML ! )
With the first S/R, we get rid of all comments and of the DOCTYPE declaration

SEARCH (?s)<!(--.+?--|DOCTYPE.+?)>

REPLACE Leave EMPTY

With the second S/R, we ONLY keep any of the three forms : <tag> , </tag> and <tag, followed by a space character, rewritten, one per line

SEARCH (?s).*?<(?|(\w+) |/?(\w+)?>)|.*\z

REPLACE ?1\1\r\n

With the third and last S/R, we, simply, delete any HTML tag, belonging to the common HTML list, given above

SEARCH (?-i)^(a|body|br|b|dd|div|dl|dt|font|form|h[1-6]|head|hr|html|iframe|img|input|i|link|li|meta|ol|option|p|script|select|source|span|style|table|tbody|td|th|title|tr|ul|u)\R

REPLACE Leave EMPTY

Now, let’s put into practice these regexes !

Here is, below, the main page source code of N++ site. I just added three non-regular tags to that code, right after the <body> tag :

<!DOCTYPE html> 
<html lang="en" class="home midcol">
<head>
<meta charset="utf-8" />
<title>Notepad++ Home</title>
<meta name="description" content="Notepad++: a free source code editor which supports several programming languages running under the MS Windows environment."/>
<meta name="keywords" content="Notepad++, telechargement, gratuit, free source code editor, remplacant de Notepad++, Notepad2, netpad, open source, web editor, html editor, xml editor, php editor, asp editor, javascript editor, java editor, c++ editor, c# editor, objective-c editor, NFO editor, VB editor, CSS, SQL, Pascal, Perl, Python, Lua, Regular Expression Search"/>

<link rel="alternate" type="application/rss+xml" title="Follow Notepad++ with RSS" href="/feed.rss"/>
<link rel="stylesheet" type="text/css" href="/assets/css/npp_c1.css"/>
<link rel="stylesheet" type="text/css" href="/assets/css/fonts/droidserif.css"/>
<link rel="shortcut icon" href="/assets/images/favicon.ico" type="image/x-icon" />
<!--[if lte IE 7]><link rel="stylesheet" type="text/css" href="/assets/css/ie67.css"/><![endif]-->



<script type="text/javascript">
window.___gcfg = {lang: 'en'};
(function()
{var po = document.createElement("script");
po.type = "text/javascript"; po.async = true;po.src = "https://apis.google.com/js/plusone.js";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(po, s);
})();</script>

<script type="text/javascript" src="https://code.jquery.com/jquery-1.5.min.js"></script>
<script type="text/javascript" src="/assets/js/npp_c1.js"></script>
<script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script>
<base href="/" />
</head>
<body>
    <bla bla>
	<guy title="Test">
    </abcde>	
	<div id="wrapper">
		<div id="content">
			<p id="skip"><a href="#content" title="Skip to main content">Skip to main content</a></p> 
			<div id="left">
				<ul id="translate">
<li class="english"><a href="/en/">English</a></li>
<li class="french"><a href="/fr/">French</a></li>
<li class="chinese"><a href="/zh/">Chinese</a></li>
</ul><p id="transmore"><a href="choose-your-language.html" title="choose from more languages">more languages</a></p>

				<h1><a href="/">Notepad++</a></h1>
				
				<ul id="nav"><li class="first active"><a href="https://notepad-plus-plus.org/" title="Home" >Home</a></li>
<li><a href="download/" title="Download" >Download</a></li>
<li><a href="news/" title="News" >News</a></li>
<li><a href="features/" title="Features" >Features</a></li>
<li><a href="resources.html" title="Resources" >Resources</a></li>
<li><a href="contribute/" title="Contribute" >Contribute</a></li>
<li><a href="donate/" title="Donate" >Donate</a></li>
<li><a href="/community" title="Community" >Community</a></li>
<li><a href="contributors/" title="Contributors" >Contributors</a></li>
<li class="last"><a href="links.html" title="Links" >Links</a></li>
</ul>
				
				<p id="download"><a href="download/v7.5.html">Download</a><br>Current Version: <span>7.5</span></p>

				<style>
#carbonads {
  display: block;
  //overflow: hidden;
  margin-top: 3em;
  padding: 2em;
  //border-top: solid 1px #cd8e2f;
  //border-bottom: solid 1px #a67326;
  //background-color: hsla(204, 15%, 19%, .6);
  font-size: 11px;
  font-family: Verdana, "Helvetica Neue", Helvetica, sans-serif;
  line-height: 1.5;

}

#carbonads span {
  display: block;
  overflow: hidden;
}

.carbon-text {
  display: block;
  margin-bottom: 1em;
  text-align: left;
  //width:240px;
}

.carbon-img {
  display: block;
  margin: 30px auto 1em;
  text-align: center;
}

.carbon-poweredby {
  display: block;
  text-align: right;
  font-size: 10px;
}
</style>

<script async type="text/javascript" src="//cdn.carbonads.com/carbon.js?zoneid=1673&serve=C6AILKT&placement=notepadplusplusorg" id="_carbonads_js"></script>


			</div>
			
			<div id="midcol">
				<h2>News</h2>
<ol id="news">
<li class="first"><a href="news/notepad-7.5-released.html">Notepad++ 7.5 released</a> Aug 16 2017</li>
<li><a href="news/notepad-7.4.2-released.html">Notepad++ 7.4.2 released</a> Jun 18 2017</li>
<li><a href="news/back-to-v7.3.3.html">Back to v7.3.3</a> Jun 08 2017</li>
<li><a href="news/notepad-7.4.1-released.html">Notepad++ 7.4.1 released</a> May 18 2017</li>
<li><a href="news/notepad-7.4-released.html">Notepad++ 7.4 released</a> May 14 2017</li>
<li><a href="news/notepad-7.3.3-fix-cia-hacking-issue.html">Fix CIA Hacking Issue</a> Mar 08 2017</li>
<li><a href="news/notepad-7.3.2-released.html">Notepad++ 7.3.2 released</a> Feb 13 2017</li>
<li><a href="news/notepad-7.3.1-released.html">Notepad++ 7.3.1 released</a> Jan 17 2017</li>
<li><a href="news/notepad-7.3-released.html">Notepad++ 7.3 released</a> Jan 01 2017</li>
<li><a href="news/notepad-7.2.2-released.html">Notepad++ 7.2.2 released</a> Nov 27 2016</li>
</ol>
<p id="morenews"><a href="news/">More news &raquo;</a></p>
			</div>
			
			<div id="main">

				<h2>About</h2>

				<p>Notepad++ is a free (as in "free speech" and also as in "free beer") source code editor and Notepad replacement that supports several languages. Running in the MS Windows environment, its use is governed by <a href="http://www.gnu.org/copyleft/gpl.html" target="_blank">GPL</a> License.</p>
<p>Based on the powerful editing component <a href="http://www.scintilla.org/" target="_blank">Scintilla</a>, <span>Notepad++</span> is written in C++ and uses pure Win32 API and STL which ensures a higher execution speed and smaller program size. By optimizing as many routines as possible without losing user friendliness, <span>Notepad++</span> is trying to reduce the world carbon dioxide emissions. When using less CPU power, the PC can throttle down and reduce power consumption, resulting in a greener environment.<br /> </p>
<p><img title="Screenshot" src="/assets/images/notepad4ever.gif" alt="Screenshot" /></p>
<p>You're encouraged to <a href="/contribute/binary-translation-howto.html">translate Notepad++</a> into your native language if there's not already a translation present in the <a href="/contribute/binary-translations.html">Binary Translations page</a>.</p>
<p><span>I hope you enjoy Notepad++ as much as I enjoy coding it.</span></p>

			</div>
		</div>

		<div id="footer">
               
		<!-- start ecreate box -->
				<div id="ecCredit">
					<div id="ecBG"></div><div id="ecLinkBG"></div><div id="ecSeeOurWork"></div>
					<div id="ecBox">
						<p><a href="http://www.ecreate.com.au">Ecreate is a Perth based Web and graphic design agency.</a></p>
						<ul>
							<li id="ecURL">&raquo;&raquo; &nbsp; <a href="http://www.ecreate.com.au" target="_blank">www.ecreate.com.au</a></li>
							<li id="ecTwitter"><a href="http://www.twitter.com/ecreate" target="_blank">Follow Ecreate on Twitter</a></li>
							<li id="ecFacebook"><a href="http://www.facebook.com/ecreate.com.au" target="_blank">Like us on Facebook</a></li>
						</ul>
					</div>
					<p id="ecLink"><a href="http://www.ecreate.com.au" target="_blank">Website kindly donated by <span>Ecreate</span></a></p>
				</div>
			<!-- end ecreate box -->

		<p id="share">
			<a href="https://plus.google.com/+notepad-plus-plus/"  rel="publisher" class="gplus"  target="_blank">Notepad++ on Google+</a>
			<!-- a href="http://www.facebook.com/Notepad.plus.plus" target="_blank">Like Notepad++ on Facebook</a -->
			<a href="http://twitter.com/notepad_plus" class="twitter" target="_blank">Follow Notepad++ on Twitter</a>
			<a href="feed.rss" class="rss">RSS News Feed</a>
		</p>

<div id="plusone">
		<!-- Place this tag where you want the +1 button to render. -->
		<div class="g-plusone"></div>

		<!-- Place this tag after the last +1 button tag. -->
		<script type="text/javascript">
		  (function() {
			var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;
			po.src = 'https://apis.google.com/js/plusone.js';
			var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);
		  })();
		</script>
		</div>
		

		<p id="copy">Copyright &copy; Don Ho 2016</p>
		<p id="validate">
		<a href="http://validator.w3.org/check?uri=referer">HTML</a> &bull; <a href="http://jigsaw.w3.org/css-validator/check/referer">CSS</a></p>

<!-- Google Analytics Begin -->
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
  ga('create', 'UA-47715314-1', 'notepad-plus-plus.org');
  ga('send', 'pageview');
</script>
<!-- Google Analytics End -->

</div>

	</div>
</body> 
</html>

After the first regex has beeen executed, the DOCTYPE tag and eight comments tags are removed
After execution of the second regex S/R, you get the text below :

html
head
meta
title
title
meta
meta
link
link
link
link
script
script
script
script
script
script
script
script
base
head
body
bla
guy
abcde
div
div
p
a
a
p
div
ul
li
a
a
li
li
a
a
li
li
a
a
li
ul
p
a
a
p
h1
a
a
h1
ul
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
ul
p
a
a
br
span
span
p
style
style
script
script
div
div
h2
h2
ol
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
li
a
a
li
ol
p
a
a
p
div
div
h2
h2
p
a
a
p
p
a
a
span
span
span
span
br
p
p
img
p
p
a
a
a
a
p
p
span
span
p
div
div
div
div
div
div
div
div
div
div
div
p
a
a
p
ul
li
a
a
li
li
a
a
li
li
a
a
li
ul
div
p
a
span
span
a
p
div
p
a
a
a
a
a
a
p
div
div
div
script
script
div
p
p
p
a
a
a
a
p
script
script
div
div
body
html

And, after running the third regex S/R, you should obtain the very short list :

base
bla
guy
abcde

Obviously, as the <base> tag is true HTML tag, this implies that this particular code contains three non-HTMl tags, only, written in black foreground colour !

NON-HTML tags :
bla
guy
abcde

Best Regards,

guy038

P.S. :

You may choose, for the third regex, a shortened list. For instance :

(?-i)^(a|body|br|b|div|font|form|h[1-6]|head|hr|html|img|input|i|li|ol|p|script|span|style|table|td|th|title|tr|ul|u)\R

Of course, the resulting list will be longer but it shouldn’t be very difficult to sort the non-HTML tags out !

It’s important to point out the right order of terms, in a list of alternatives. For instance, to match the three tags <br> , <body> and <b>, the regex must be <(body|br|b)> and NOT <(b|body|br)> neither <(b|br|body)> !!
I’ll give detailed explanations of these 3 regexes, very soon :;))