Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    How to search for inactive HTML tags?

    Help wanted · · · – – – · · ·
    3
    3
    4179
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Andrew Anderson
      Andrew Anderson last edited by

      This question is a bit complicated, so here we go.

      You know how when you have something on Notepad++ designated as an HTML document, all the tags (i.e. stuff between < and >) is in different colors? But anything that is between < and > but is not an HTML tag has no color–those characters are just a regular black.

      My question is this: Is there a way to search for all the black < and > characters in an HTML document? With thousands of colored HTML tags, I really don’t have the time to look through every single one to see if there are any black non-tags I missed. (They also don’t show up at all when you look at the document in the form of a web page–so you can’t search for them that way.)

      Thanks in advance!

      1 Reply Last reply Reply Quote 0
      • PeterJones
        PeterJones last edited by PeterJones

        In this example, which of the following would you expect to find?

        1. just the brackets for spaced
        2. the brackets for spaced and leftspace
        3. the brackets for spaced and leftspace and rightspace
        4. the brackets for everything but html

        And what about <!-- comment -->?

        Also, it appears that Scintilla’s underlying HTML parser gets it wrong:

        This stackoverflow post points out that at least one HTML spec (xhtml) requires no space between the < and the name of the tag, so < leftspace> should not be interpreted as a tag, known or otherwise.

        As a first level, the FIND Regex:<(?![\w/!]) will use a negative lookahead to find any < not followed by alphanumeric, /, or !. So that could find the bad opening <.

        I had hoped that (?<!\<[\w/!][^>]+)> would find a closing > that was not preceded by a token which matches valid start (<alnum, <?, or <!), followed by one or more non-> characters… but it claims “Find: Invalid Regular Expression”, so I must have something dodgy there. Checking documentation: at least in Perl, lookbehinds have to be fixed width, and I think the boost regex library that NPP uses is similar to the Perl engine… I think that would explain it.

        That said, I don’t know what I’d try next to find an end-> that wasn’t preceded by a valid start-of-tag: logically, what I am looking for is NOT( "<[alnum][non>]*" OR "</[non>]*" OR "<!--[non>]*--") FOLLOWED BY ">", but without variable-width lookbehind, I don’t know how. @guy038 (our regular expression guru), do you have any ideas?

        1 Reply Last reply Reply Quote 0
        • guy038
          guy038 last edited by guy038

          Hello, @andrew-anderson and All,

          After two days, working from time to time, I found out a way to isolate non-HTML tags, easily enough, with three consecutive regex S/R ;-))

          To begin with, I got, from that site, the list of all regular HTML tags :

          https://www.w3schools.com/TAGs/

          Secondly, from some HTML documents, I restricted that list to the most common HTML tags used ( to my mind ! ), below, by alphabetic order :

          <!--.......-->
          <!DOCTYPE ...>
          
          <a>
          <b>
          <body>
          <br>
          <dd>
          <div>
          <dl>
          <dt>
          <font>
          <form>
          <h1>
          <h2>
          <h3>
          <h4>
          <h5>
          <h6>
          <head>
          <hr>
          <html>
          <i>
          <iframe>
          <img>
          <input>
          <li>
          <link>
          <meta>
          <ol>
          <option>
          <p>
          <script>
          <select>
          <source>
          <span>
          <style>
          <table>
          <tbody>
          <td>
          <th>
          <title>
          <tr>
          <u>
          <ul>
          

          Finally, I built three regex S/R, which modify, after consecutive execution, any HTML document in a short list of tags, all different from any of those, above and from which it should be easy to point out the Non-HTML remaining tags !

          Well, let’s go !

          • Copy the HTML document to analyse, in a new tab ( IMPORTANT )

          • Select that new tab ( You do not, even, need to change the language to HTML ! )

          • With the first S/R, we get rid of all comments and of the DOCTYPE declaration

          SEARCH (?s)<!(--.+?--|DOCTYPE.+?)>

          REPLACE Leave EMPTY

          • With the second S/R, we ONLY keep any of the three forms : <tag> , </tag> and <tag, followed by a space character, rewritten, one per line

          SEARCH (?s).*?<(?|(\w+) |/?(\w+)?>)|.*\z

          REPLACE ?1\1\r\n

          • With the third and last S/R, we, simply, delete any HTML tag, belonging to the common HTML list, given above

          SEARCH (?-i)^(a|body|br|b|dd|div|dl|dt|font|form|h[1-6]|head|hr|html|iframe|img|input|i|link|li|meta|ol|option|p|script|select|source|span|style|table|tbody|td|th|title|tr|ul|u)\R

          REPLACE Leave EMPTY


          Now, let’s put into practice these regexes !

          Here is, below, the main page source code of N++ site. I just added three non-regular tags to that code, right after the <body> tag :

          <!DOCTYPE html> 
          <html lang="en" class="home midcol">
          <head>
          <meta charset="utf-8" />
          <title>Notepad++ Home</title>
          <meta name="description" content="Notepad++: a free source code editor which supports several programming languages running under the MS Windows environment."/>
          <meta name="keywords" content="Notepad++, telechargement, gratuit, free source code editor, remplacant de Notepad++, Notepad2, netpad, open source, web editor, html editor, xml editor, php editor, asp editor, javascript editor, java editor, c++ editor, c# editor, objective-c editor, NFO editor, VB editor, CSS, SQL, Pascal, Perl, Python, Lua, Regular Expression Search"/>
          
          <link rel="alternate" type="application/rss+xml" title="Follow Notepad++ with RSS" href="/feed.rss"/>
          <link rel="stylesheet" type="text/css" href="/assets/css/npp_c1.css"/>
          <link rel="stylesheet" type="text/css" href="/assets/css/fonts/droidserif.css"/>
          <link rel="shortcut icon" href="/assets/images/favicon.ico" type="image/x-icon" />
          <!--[if lte IE 7]><link rel="stylesheet" type="text/css" href="/assets/css/ie67.css"/><![endif]-->
          
          
          
          <script type="text/javascript">
          window.___gcfg = {lang: 'en'};
          (function()
          {var po = document.createElement("script");
          po.type = "text/javascript"; po.async = true;po.src = "https://apis.google.com/js/plusone.js";
          var s = document.getElementsByTagName("script")[0];
          s.parentNode.insertBefore(po, s);
          })();</script>
          
          <script type="text/javascript" src="https://code.jquery.com/jquery-1.5.min.js"></script>
          <script type="text/javascript" src="/assets/js/npp_c1.js"></script>
          <script type="text/javascript" src="https://apis.google.com/js/plusone.js"></script>
          <base href="/" />
          </head>
          <body>
              <bla bla>
          	<guy title="Test">
              </abcde>	
          	<div id="wrapper">
          		<div id="content">
          			<p id="skip"><a href="#content" title="Skip to main content">Skip to main content</a></p> 
          			<div id="left">
          				<ul id="translate">
          <li class="english"><a href="/en/">English</a></li>
          <li class="french"><a href="/fr/">French</a></li>
          <li class="chinese"><a href="/zh/">Chinese</a></li>
          </ul><p id="transmore"><a href="choose-your-language.html" title="choose from more languages">more languages</a></p>
          
          				<h1><a href="/">Notepad++</a></h1>
          				
          				<ul id="nav"><li class="first active"><a href="https://notepad-plus-plus.org/" title="Home" >Home</a></li>
          <li><a href="download/" title="Download" >Download</a></li>
          <li><a href="news/" title="News" >News</a></li>
          <li><a href="features/" title="Features" >Features</a></li>
          <li><a href="resources.html" title="Resources" >Resources</a></li>
          <li><a href="contribute/" title="Contribute" >Contribute</a></li>
          <li><a href="donate/" title="Donate" >Donate</a></li>
          <li><a href="/community" title="Community" >Community</a></li>
          <li><a href="contributors/" title="Contributors" >Contributors</a></li>
          <li class="last"><a href="links.html" title="Links" >Links</a></li>
          </ul>
          				
          				<p id="download"><a href="download/v7.5.html">Download</a><br>Current Version: <span>7.5</span></p>
          
          				<style>
          #carbonads {
            display: block;
            //overflow: hidden;
            margin-top: 3em;
            padding: 2em;
            //border-top: solid 1px #cd8e2f;
            //border-bottom: solid 1px #a67326;
            //background-color: hsla(204, 15%, 19%, .6);
            font-size: 11px;
            font-family: Verdana, "Helvetica Neue", Helvetica, sans-serif;
            line-height: 1.5;
          
          }
          
          #carbonads span {
            display: block;
            overflow: hidden;
          }
          
          .carbon-text {
            display: block;
            margin-bottom: 1em;
            text-align: left;
            //width:240px;
          }
          
          .carbon-img {
            display: block;
            margin: 30px auto 1em;
            text-align: center;
          }
          
          .carbon-poweredby {
            display: block;
            text-align: right;
            font-size: 10px;
          }
          </style>
          
          <script async type="text/javascript" src="//cdn.carbonads.com/carbon.js?zoneid=1673&serve=C6AILKT&placement=notepadplusplusorg" id="_carbonads_js"></script>
          
          
          			</div>
          			
          			<div id="midcol">
          				<h2>News</h2>
          <ol id="news">
          <li class="first"><a href="news/notepad-7.5-released.html">Notepad++ 7.5 released</a> Aug 16 2017</li>
          <li><a href="news/notepad-7.4.2-released.html">Notepad++ 7.4.2 released</a> Jun 18 2017</li>
          <li><a href="news/back-to-v7.3.3.html">Back to v7.3.3</a> Jun 08 2017</li>
          <li><a href="news/notepad-7.4.1-released.html">Notepad++ 7.4.1 released</a> May 18 2017</li>
          <li><a href="news/notepad-7.4-released.html">Notepad++ 7.4 released</a> May 14 2017</li>
          <li><a href="news/notepad-7.3.3-fix-cia-hacking-issue.html">Fix CIA Hacking Issue</a> Mar 08 2017</li>
          <li><a href="news/notepad-7.3.2-released.html">Notepad++ 7.3.2 released</a> Feb 13 2017</li>
          <li><a href="news/notepad-7.3.1-released.html">Notepad++ 7.3.1 released</a> Jan 17 2017</li>
          <li><a href="news/notepad-7.3-released.html">Notepad++ 7.3 released</a> Jan 01 2017</li>
          <li><a href="news/notepad-7.2.2-released.html">Notepad++ 7.2.2 released</a> Nov 27 2016</li>
          </ol>
          <p id="morenews"><a href="news/">More news &raquo;</a></p>
          			</div>
          			
          			<div id="main">
          
          				<h2>About</h2>
          
          				<p>Notepad++ is a free (as in "free speech" and also as in "free beer") source code editor and Notepad replacement that supports several languages. Running in the MS Windows environment, its use is governed by <a href="http://www.gnu.org/copyleft/gpl.html" target="_blank">GPL</a> License.</p>
          <p>Based on the powerful editing component <a href="http://www.scintilla.org/" target="_blank">Scintilla</a>, <span>Notepad++</span> is written in C++ and uses pure Win32 API and STL which ensures a higher execution speed and smaller program size. By optimizing as many routines as possible without losing user friendliness, <span>Notepad++</span> is trying to reduce the world carbon dioxide emissions. When using less CPU power, the PC can throttle down and reduce power consumption, resulting in a greener environment.<br /> </p>
          <p><img title="Screenshot" src="/assets/images/notepad4ever.gif" alt="Screenshot" /></p>
          <p>You're encouraged to <a href="/contribute/binary-translation-howto.html">translate Notepad++</a> into your native language if there's not already a translation present in the <a href="/contribute/binary-translations.html">Binary Translations page</a>.</p>
          <p><span>I hope you enjoy Notepad++ as much as I enjoy coding it.</span></p>
          
          			</div>
          		</div>
          
          		<div id="footer">
                         
          		<!-- start ecreate box -->
          				<div id="ecCredit">
          					<div id="ecBG"></div><div id="ecLinkBG"></div><div id="ecSeeOurWork"></div>
          					<div id="ecBox">
          						<p><a href="http://www.ecreate.com.au">Ecreate is a Perth based Web and graphic design agency.</a></p>
          						<ul>
          							<li id="ecURL">&raquo;&raquo; &nbsp; <a href="http://www.ecreate.com.au" target="_blank">www.ecreate.com.au</a></li>
          							<li id="ecTwitter"><a href="http://www.twitter.com/ecreate" target="_blank">Follow Ecreate on Twitter</a></li>
          							<li id="ecFacebook"><a href="http://www.facebook.com/ecreate.com.au" target="_blank">Like us on Facebook</a></li>
          						</ul>
          					</div>
          					<p id="ecLink"><a href="http://www.ecreate.com.au" target="_blank">Website kindly donated by <span>Ecreate</span></a></p>
          				</div>
          			<!-- end ecreate box -->
          
          		<p id="share">
          			<a href="https://plus.google.com/+notepad-plus-plus/"  rel="publisher" class="gplus"  target="_blank">Notepad++ on Google+</a>
          			<!-- a href="http://www.facebook.com/Notepad.plus.plus" target="_blank">Like Notepad++ on Facebook</a -->
          			<a href="http://twitter.com/notepad_plus" class="twitter" target="_blank">Follow Notepad++ on Twitter</a>
          			<a href="feed.rss" class="rss">RSS News Feed</a>
          		</p>
          
          <div id="plusone">
          		<!-- Place this tag where you want the +1 button to render. -->
          		<div class="g-plusone"></div>
          
          		<!-- Place this tag after the last +1 button tag. -->
          		<script type="text/javascript">
          		  (function() {
          			var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true;
          			po.src = 'https://apis.google.com/js/plusone.js';
          			var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s);
          		  })();
          		</script>
          		</div>
          		
          
          		<p id="copy">Copyright &copy; Don Ho 2016</p>
          		<p id="validate">
          		<a href="http://validator.w3.org/check?uri=referer">HTML</a> &bull; <a href="http://jigsaw.w3.org/css-validator/check/referer">CSS</a></p>
          
          <!-- Google Analytics Begin -->
          <script>
            (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
            (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
            m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
            })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
            ga('create', 'UA-47715314-1', 'notepad-plus-plus.org');
            ga('send', 'pageview');
          </script>
          <!-- Google Analytics End -->
          
          </div>
          
          	</div>
          </body> 
          </html> 
          
          • After the first regex has beeen executed, the DOCTYPE tag and eight comments tags are removed

          • After execution of the second regex S/R, you get the text below :

          html
          head
          meta
          title
          title
          meta
          meta
          link
          link
          link
          link
          script
          script
          script
          script
          script
          script
          script
          script
          base
          head
          body
          bla
          guy
          abcde
          div
          div
          p
          a
          a
          p
          div
          ul
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          ul
          p
          a
          a
          p
          h1
          a
          a
          h1
          ul
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          ul
          p
          a
          a
          br
          span
          span
          p
          style
          style
          script
          script
          div
          div
          h2
          h2
          ol
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          ol
          p
          a
          a
          p
          div
          div
          h2
          h2
          p
          a
          a
          p
          p
          a
          a
          span
          span
          span
          span
          br
          p
          p
          img
          p
          p
          a
          a
          a
          a
          p
          p
          span
          span
          p
          div
          div
          div
          div
          div
          div
          div
          div
          div
          div
          div
          p
          a
          a
          p
          ul
          li
          a
          a
          li
          li
          a
          a
          li
          li
          a
          a
          li
          ul
          div
          p
          a
          span
          span
          a
          p
          div
          p
          a
          a
          a
          a
          a
          a
          p
          div
          div
          div
          script
          script
          div
          p
          p
          p
          a
          a
          a
          a
          p
          script
          script
          div
          div
          body
          html
          

          And, after running the third regex S/R, you should obtain the very short list :

          base
          bla
          guy
          abcde
          

          Obviously, as the <base> tag is true HTML tag, this implies that this particular code contains three non-HTMl tags, only, written in black foreground colour !

          NON-HTML tags :
          bla
          guy
          abcde
          

          Best Regards,

          guy038

          P.S. :

          • You may choose, for the third regex, a shortened list. For instance :

          (?-i)^(a|body|br|b|div|font|form|h[1-6]|head|hr|html|img|input|i|li|ol|p|script|span|style|table|td|th|title|tr|ul|u)\R

          Of course, the resulting list will be longer but it shouldn’t be very difficult to sort the non-HTML tags out !

          • It’s important to point out the right order of terms, in a list of alternatives. For instance, to match the three tags <br> , <body> and <b>, the regex must be <(body|br|b)> and NOT <(b|body|br)> neither <(b|br|body)> !!

          • I’ll give detailed explanations of these 3 regexes, very soon :;))

          1 Reply Last reply Reply Quote 1
          • First post
            Last post
          Copyright © 2014 NodeBB Forums | Contributors