• Login
Community
  • Login

How to search for inactive HTML tags?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
3 Posts 3 Posters 5.2k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A
    Andrew Anderson
    last edited by Aug 24, 2017, 2:51 AM

    This question is a bit complicated, so here we go.

    You know how when you have something on Notepad++ designated as an HTML document, all the tags (i.e. stuff between < and >) is in different colors? But anything that is between < and > but is not an HTML tag has no color–those characters are just a regular black.

    My question is this: Is there a way to search for all the black < and > characters in an HTML document? With thousands of colored HTML tags, I really don’t have the time to look through every single one to see if there are any black non-tags I missed. (They also don’t show up at all when you look at the document in the form of a web page–so you can’t search for them that way.)

    Thanks in advance!

    1 Reply Last reply Reply Quote 0
    • P
      PeterJones
      last edited by PeterJones Aug 24, 2017, 1:43 PM Aug 24, 2017, 1:42 PM

      In this example, which of the following would you expect to find?

      1. just the brackets for spaced
      2. the brackets for spaced and leftspace
      3. the brackets for spaced and leftspace and rightspace
      4. the brackets for everything but html

      And what about <!-- comment -->?

      Also, it appears that Scintilla’s underlying HTML parser gets it wrong:

      This stackoverflow post points out that at least one HTML spec (xhtml) requires no space between the < and the name of the tag, so < leftspace> should not be interpreted as a tag, known or otherwise.

      As a first level, the FIND Regex:<(?![\w/!]) will use a negative lookahead to find any < not followed by alphanumeric, /, or !. So that could find the bad opening <.

      I had hoped that (?<!\<[\w/!][^>]+)> would find a closing > that was not preceded by a token which matches valid start (<alnum, <?, or <!), followed by one or more non-> characters… but it claims “Find: Invalid Regular Expression”, so I must have something dodgy there. Checking documentation: at least in Perl, lookbehinds have to be fixed width, and I think the boost regex library that NPP uses is similar to the Perl engine… I think that would explain it.

      That said, I don’t know what I’d try next to find an end-> that wasn’t preceded by a valid start-of-tag: logically, what I am looking for is NOT( "<[alnum][non>]*" OR "</[non>]*" OR "<!--[non>]*--") FOLLOWED BY ">", but without variable-width lookbehind, I don’t know how. @guy038 (our regular expression guru), do you have any ideas?

      1 Reply Last reply Reply Quote 0
      • G
        guy038
        last edited by guy038 Aug 26, 2017, 2:36 PM Aug 26, 2017, 2:30 PM

        Hello, @andrew-anderson and All,

        After two days, working from time to time, I found out a way to isolate non-HTML tags, easily enough, with three consecutive regex S/R ;-))

        To begin with, I got, from that site, the list of all regular HTML tags :

        https://www.w3schools.com/TAGs/

        Secondly, from some HTML documents, I restricted that list to the most common HTML tags used ( to my mind ! ), below, by alphabetic order :

        <!--.......-->
        <!DOCTYPE ...>
        
        <a>
        <b>
        <body>
        <br>
        <dd>
        <div>
        <dl>
        <dt>
        <font>
        <form>
        <h1>
        <h2>
        <h3>
        <h4>
        <h5>
        <h6>
        <head>
        <hr>
        <html>
        <i>
        <iframe>
        <img>
        <input>
        <li>
        <link>
        <meta>
        <ol>
        <option>
        <p>
        <script>
        <select>
        <source>
        <span>
        <style>
        <table>
        <tbody>
        <td>
        <th>
        <title>
        <tr>
        <u>
        <ul>
        

        Finally, I built three regex S/R, which modify, after consecutive execution, any HTML document in a short list of tags, all different from any of those, above and from which it should be easy to point out the Non-HTML remaining tags !

        Well, let’s go !

        • Copy the HTML document to analyse, in a new tab ( IMPORTANT )

        • Select that new tab ( You do not, even, need to change the language to HTML ! )

        • With the first S/R, we get rid of all comments and of the DOCTYPE declaration

        SEARCH (?s)<!(--.+?--|DOCTYPE.+?)>

        REPLACE Leave EMPTY

        • With the second S/R, we ONLY keep any of the three forms : <tag> , </tag> and <tag, followed by a space character, rewritten, one per line

        SEARCH (?s).*?<(?|(\w+) |/?(\w+)?>)|.*\z

        REPLACE ?1\1\r\n

        • With the third and last S/R, we, simply, delete any HTML tag, belonging to the common HTML list, given above

        SEARCH (?-i)^(a|body|br|b|dd|div|dl|dt|font|form|h[1-6]|head|hr|html|iframe|img|input|i|link|li|meta|ol|option|p|script|select|source|span|style|table|tbody|td|th|title|tr|ul|u)\R

        REPLACE Leave EMPTY


        Now, let’s put into practice these regexes !

        Here is, below, the main page source code of N++ site. I just added three non-regular tags to that code, right after the <body> tag :

        <!DOCTYPE html> 
        <html lang="en" class="home midcol">
        <head>
        <meta charset="utf-8" />
        <title>Notepad++ Home</title>
        <meta name="description" content="Notepad++: a free source code editor which supports several programming languages running under the MS Windows environment."/>
        <meta name="keywords" content="Notepad++, telechargement, gratuit, free source code editor, remplacant de Notepad++, Notepad2, netpad, open source, web editor, html editor, xml editor, php editor, asp editor, javascript editor, java editor, c++ editor, c# editor, objective-c editor, NFO editor, VB editor, CSS, SQL, Pascal, Perl, Python, Lua, Regular Expression Search"/>
        
        <link rel="alternate" type="application/rss+xml" title="Follow Notepad++ with RSS" href="/feed.rss"/>
        <link rel="stylesheet" type="text/css" href="/assets/css/npp_c1.css"/>
        <link rel="stylesheet" type="text/css" href="/assets/css/fonts/droidserif.css"/>
        <link rel="shortcut icon" href="/assets/images/favicon.ico" type="image/x-icon" />
        <!--[if lte IE 7]><link rel="stylesheet" type="text/css" href="/assets/css/ie67.css"/><![endif]-->
        
        
        
        <script type="text/javascript">
        window.___gcfg = {lang: 'en'};
        (function()
        {var po = document.createElement("script");
        po.type = "text/javascript"; po.async = true;po.src = "https://apis.google.com/js/plusone.js";
        var s = document.getElementsByTagName("script")[0];
        s.parentNode.insertBefore(po, s);
        })();</script>
        
        <script type="text/javascript" src="https://code.jquery.com/jquery-1.5.min.js"></script>
        <script type="text/javascript" src="/assets/js/npp_c1.js"></script>
        <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script>
        <base href="/" />
        </head>
        <body>
            <bla bla>
        	<guy title="Test">
            </abcde>	
        	<div id="wrapper">
        		<div id="content">
        			<p id="skip"><a href="#content" title="Skip to main content">Skip to main content</a></p> 
        			<div id="left">
        				<ul id="translate">
        <li class="english"><a href="/en/">English</a></li>
        <li class="french"><a href="/fr/">French</a></li>
        <li class="chinese"><a href="/zh/">Chinese</a></li>
        </ul><p id="transmore"><a href="choose-your-language.html" title="choose from more languages">more languages</a></p>
        
        				<h1><a href="/">Notepad++</a></h1>
        				
        				<ul id="nav"><li class="first active"><a href="https://notepad-plus-plus.org/" title="Home" >Home</a></li>
        <li><a href="download/" title="Download" >Download</a></li>
        <li><a href="news/" title="News" >News</a></li>
        <li><a href="features/" title="Features" >Features</a></li>
        <li><a href="resources.html" title="Resources" >Resources</a></li>
        <li><a href="contribute/" title="Contribute" >Contribute</a></li>
        <li><a href="donate/" title="Donate" >Donate</a></li>
        <li><a href="/community" title="Community" >Community</a></li>
        <li><a href="contributors/" title="Contributors" >Contributors</a></li>
        <li class="last"><a href="links.html" title="Links" >Links</a></li>
        </ul>
        				
        				<p id="download"><a href="download/v7.5.html">Download</a><br>Current Version: <span>7.5</span></p>
        
        				<style>
        #carbonads {
          display: block;
          //overflow: hidden;
          margin-top: 3em;
          padding: 2em;
          //border-top: solid 1px #cd8e2f;
          //border-bottom: solid 1px #a67326;
          //background-color: hsla(204, 15%, 19%, .6);
          font-size: 11px;
          font-family: Verdana, "Helvetica Neue", Helvetica, sans-serif;
          line-height: 1.5;
        
        }
        
        #carbonads span {
          display: block;
          overflow: hidden;
        }
        
        .carbon-text {
          display: block;
          margin-bottom: 1em;
          text-align: left;
          //width:240px;
        }
        
        .carbon-img {
          display: block;
          margin: 30px auto 1em;
          text-align: center;
        }
        
        .carbon-poweredby {
          display: block;
          text-align: right;
          font-size: 10px;
        }
        </style>
        
        <script async type="text/javascript" src="//cdn.carbonads.com/carbon.js?zoneid=1673&serve=C6AILKT&placement=notepadplusplusorg" id="_carbonads_js"></script>
        
        
        			</div>
        			
        			<div id="midcol">
        				<h2>News</h2>
        <ol id="news">
        <li class="first"><a href="news/notepad-7.5-released.html">Notepad++ 7.5 released</a> Aug 16 2017</li>
        <li><a href="news/notepad-7.4.2-released.html">Notepad++ 7.4.2 released</a> Jun 18 2017</li>
        <li><a href="news/back-to-v7.3.3.html">Back to v7.3.3</a> Jun 08 2017</li>
        <li><a href="news/notepad-7.4.1-released.html">Notepad++ 7.4.1 released</a> May 18 2017</li>
        <li><a href="news/notepad-7.4-released.html">Notepad++ 7.4 released</a> May 14 2017</li>
        <li><a href="news/notepad-7.3.3-fix-cia-hacking-issue.html">Fix CIA Hacking Issue</a> Mar 08 2017</li>
        <li><a href="news/notepad-7.3.2-released.html">Notepad++ 7.3.2 released</a> Feb 13 2017</li>
        <li><a href="news/notepad-7.3.1-released.html">Notepad++ 7.3.1 released</a> Jan 17 2017</li>
        <li><a href="news/notepad-7.3-released.html">Notepad++ 7.3 released</a> Jan 01 2017</li>
        <li><a href="news/notepad-7.2.2-released.html">Notepad++ 7.2.2 released</a> Nov 27 2016</li>
        </ol>
        <p id="morenews"><a href="news/">More news &raquo;</a></p>
        			</div>
        			
        			<div id="main">
        
        				<h2>About</h2>
        
        				<p>Notepad++ is a free (as in "free speech" and also as in "free beer") source code editor and Notepad replacement that supports several languages. Running in the MS Windows environment, its use is governed by <a href="http://www.gnu.org/copyleft/gpl.html" target="_blank">GPL</a> License.</p>
        <p>Based on the powerful editing component <a href="http://www.scintilla.org/" target="_blank">Scintilla</a>, <span>Notepad++</span> is written in C++ and uses pure Win32 API and STL which ensures a higher execution speed and smaller program size. By optimizing as many routines as possible without losing user friendliness, <span>Notepad++</span> is trying to reduce the world carbon dioxide emissions. When using less CPU power, the PC can throttle down and reduce power consumption, resulting in a greener environment.<br /> </p>
        <p><img title="Screenshot" src="/assets/images/notepad4ever.gif" alt="Screenshot" /></p>
        <p>You're encouraged to <a href="/contribute/binary-translation-howto.html">translate Notepad++</a> into your native language if there's not already a translation present in the <a href="/contribute/binary-translations.html">Binary Translations page</a>.</p>
        <p><span>I hope you enjoy Notepad++ as much as I enjoy coding it.</span></p>
        
        			</div>
        		</div>
        
        		<div id="footer">
                       
        		<!-- start ecreate box -->
        				<div id="ecCredit">
        					<div id="ecBG"></div><div id="ecLinkBG"></div><div id="ecSeeOurWork"></div>
        					<div id="ecBox">
        						<p><a href="http://www.ecreate.com.au">Ecreate is a Perth based Web and graphic design agency.</a></p>
        						<ul>
        							<li id="ecURL">&raquo;&raquo; &nbsp; <a href="http://www.ecreate.com.au" target="_blank">www.ecreate.com.au</a></li>
        							<li id="ecTwitter"><a href="http://www.twitter.com/ecreate" target="_blank">Follow Ecreate on Twitter</a></li>
        							<li id="ecFacebook"><a href="http://www.facebook.com/ecreate.com.au" target="_blank">Like us on Facebook</a></li>
        						</ul>
        					</div>
        					<p id="ecLink"><a href="http://www.ecreate.com.au" target="_blank">Website kindly donated by <span>Ecreate</span></a></p>
        				</div>
        			<!-- end ecreate box -->
        
        		<p id="share">
        			<a href="https://plus.google.com/+notepad-plus-plus/"  rel="publisher" class="gplus"  target="_blank">Notepad++ on Google+</a>
        			<!-- a href="http://www.facebook.com/Notepad.plus.plus" target="_blank">Like Notepad++ on Facebook</a -->
        			<a href="http://twitter.com/notepad_plus" class="twitter" target="_blank">Follow Notepad++ on Twitter</a>
        			<a href="feed.rss" class="rss">RSS News Feed</a>
        		</p>
        
        <div id="plusone">
        		<!-- Place this tag where you want the +1 button to render. -->
        		<div class="g-plusone"></div>
        
        		<!-- Place this tag after the last +1 button tag. -->
        		<script type="text/javascript">
        		  (function() {
        			var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;
        			po.src = 'https://apis.google.com/js/plusone.js';
        			var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);
        		  })();
        		</script>
        		</div>
        		
        
        		<p id="copy">Copyright &copy; Don Ho 2016</p>
        		<p id="validate">
        		<a href="http://validator.w3.org/check?uri=referer">HTML</a> &bull; <a href="http://jigsaw.w3.org/css-validator/check/referer">CSS</a></p>
        
        <!-- Google Analytics Begin -->
        <script>
          (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
          (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
          m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
          })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
          ga('create', 'UA-47715314-1', 'notepad-plus-plus.org');
          ga('send', 'pageview');
        </script>
        <!-- Google Analytics End -->
        
        </div>
        
        	</div>
        </body> 
        </html> 
        
        • After the first regex has beeen executed, the DOCTYPE tag and eight comments tags are removed

        • After execution of the second regex S/R, you get the text below :

        html
        head
        meta
        title
        title
        meta
        meta
        link
        link
        link
        link
        script
        script
        script
        script
        script
        script
        script
        script
        base
        head
        body
        bla
        guy
        abcde
        div
        div
        p
        a
        a
        p
        div
        ul
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        ul
        p
        a
        a
        p
        h1
        a
        a
        h1
        ul
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        ul
        p
        a
        a
        br
        span
        span
        p
        style
        style
        script
        script
        div
        div
        h2
        h2
        ol
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        ol
        p
        a
        a
        p
        div
        div
        h2
        h2
        p
        a
        a
        p
        p
        a
        a
        span
        span
        span
        span
        br
        p
        p
        img
        p
        p
        a
        a
        a
        a
        p
        p
        span
        span
        p
        div
        div
        div
        div
        div
        div
        div
        div
        div
        div
        div
        p
        a
        a
        p
        ul
        li
        a
        a
        li
        li
        a
        a
        li
        li
        a
        a
        li
        ul
        div
        p
        a
        span
        span
        a
        p
        div
        p
        a
        a
        a
        a
        a
        a
        p
        div
        div
        div
        script
        script
        div
        p
        p
        p
        a
        a
        a
        a
        p
        script
        script
        div
        div
        body
        html
        

        And, after running the third regex S/R, you should obtain the very short list :

        base
        bla
        guy
        abcde
        

        Obviously, as the <base> tag is true HTML tag, this implies that this particular code contains three non-HTMl tags, only, written in black foreground colour !

        NON-HTML tags :
        bla
        guy
        abcde
        

        Best Regards,

        guy038

        P.S. :

        • You may choose, for the third regex, a shortened list. For instance :

        (?-i)^(a|body|br|b|div|font|form|h[1-6]|head|hr|html|img|input|i|li|ol|p|script|span|style|table|td|th|title|tr|ul|u)\R

        Of course, the resulting list will be longer but it shouldn’t be very difficult to sort the non-HTML tags out !

        • It’s important to point out the right order of terms, in a list of alternatives. For instance, to match the three tags <br> , <body> and <b>, the regex must be <(body|br|b)> and NOT <(b|body|br)> neither <(b|br|body)> !!

        • I’ll give detailed explanations of these 3 regexes, very soon :;))

        1 Reply Last reply Reply Quote 1
        2 out of 3
        • First post
          2/3
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors