Community
    • Login

    Regex - Finding Invalid Characters in HTML Tags

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    regex
    8 Posts 3 Posters 894 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Sylvester BullittS
      Sylvester Bullitt
      last edited by Sylvester Bullitt

      I’m trying to write a regular expression to find invalid characters (e.g., soft hyphens, Unicode U+00AD) inside HTML tags like <title>, <figcaption>, <h1>, <h2>, etc.

      I’ve run into a issue that I’m not sure there’s an solution for, but thought I’d run it by the community to make sure.

      MY REGEX:

      <((title|figcaption|h\d)).*?[­].*?</\1
      

      THE TEXT WHERE IT INCORRECTLY FINDS A MATCH:

      <h1 class="screen-reader-only">Introduction</h1>
      <div class="preface-text">
      <p><span>Born:</span> No­vem­ber 8, 1877, Mo­berg, Den­mark.</p>
      <p><span>Died:</span> June 22, 1970, <span class="map" onclick='show("Minneapolis,MN")'>Min­ne­ap­olis</span>, Min­ne­so­ta.</p>
      <p><span>Buried:</span> <span class="map" onclick="show('45.01030,-93.21580')">Sun­set Me­mo­ri­al Park</span>, Min­ne­ap­olis, Min­ne­so­ta.</p>
      </div>
      <figure><img alt="portrait" class="notable" src="../../../../../img/a/a/b/e/aaberg_jc.jpg" height="204" width="153"></figure>
      </section>
      
      <section id="biography">
      <h1>Biography</h1>
      

      The regex finds a soft hyphen between <h1> and </h1> tags (in the word Minneapolis, for example), but the match is between the <h1> and </h1> for two separate tags. I only want a match if the invalid character is within a single tag.

      I’ve tried non-greedy operators, but can’t find a way to prevent the bad match.

      Can anyone see how to make this work?

      Terry RT 1 Reply Last reply Reply Quote 1
      • Terry RT
        Terry R @Sylvester Bullitt
        last edited by

        @Sylvester-Bullitt said in Regex - Finding Invalid Characters in HTML Tags:

        Can anyone see how to make this work?

        Yes, and it’s in the FAQ posts here.

        Your regex is trying to find what you want and without a limit on it’s movement it will find it, even if as you say it’s gone past the boundary you thought you had set.

        Have a read of that FAQ post and make the changes necessary, I’m sure it will work as you want afterwards.

        Terry

        Sylvester BullittS 1 Reply Last reply Reply Quote 2
        • Sylvester BullittS
          Sylvester Bullitt @Terry R
          last edited by Sylvester Bullitt

          @Terry-R Partial success, so far.

          I got the single-line search to work, but ran into problems searching for a repeated text zone.

          Here’s the failing regex. It’s trying to find soft hyphens in a <title>, <figcaption>, or heading (<h1>, <h2>, etc.):

          (?-si:<(title|figcaption|h\d)>|(?!\A)\G)(?s-i:(?!<\1>).)*?\K(?-si:­)
          

          Here’s the text to search. The regex finds the soft hyphens in the specified tags, but also finds soft hyphens outside those tags (i.e., in the <citation> tag). I need the regex to ignore any soft hyphens not in the specified tags.

          <!DOCTYPE HTML>
          <html lang="en-us">
          
          <head>
          <meta charset="utf-8">
          <title>Regex Te­st</title>
          </head>
          
          <body>
          <h1>George Portrait</h1>
          <figure><img alt="portrait" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/800px-Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg" width="200" height="243"><figcaption>George Washington<br>1732–1799</figcaption></figure>
          
          <p class="citation"><span class="author">Anderson, Ro­bert</span> &amp; <span class="author">Gail North</span>. <cite class="book">Mu­sic En­cy­clo­ped­ia</cite>. New York: Ster­ling Pub­lish­ing, 1979.</p>
          
          <h2>Second Head­ing</h2>
          <p>This second heading is the only one with a soft hyphen. 
          The <em>first</em> &lt;h1&gt; has <em>no</em> soft hyphen.</p>
          </body>
          
          </html>
          

          I use Notepad++ version 8.6 (64-bit). What am I missing?

          Terry RT 1 Reply Last reply Reply Quote 0
          • Terry RT
            Terry R @Sylvester Bullitt
            last edited by Terry R

            @Sylvester-Bullitt said in Regex - Finding Invalid Characters in HTML Tags:

            What am I missing?

            I’ve had a look, but as I’m not the creator of those regex I linked you to I cannot immediately see the problem. Yet as you say it’s finding the character outside of the desired html code.

            @guy038 is the creator and he will need to investigate this.

            Terry

            Sylvester BullittS 1 Reply Last reply Reply Quote 1
            • Sylvester BullittS
              Sylvester Bullitt @Terry R
              last edited by

              @Terry-R Ok. Thanks for taking a look.

              Will @guy038 automatically get a copy of my question?

              Terry RT 1 Reply Last reply Reply Quote 0
              • Terry RT
                Terry R @Sylvester Bullitt
                last edited by

                @Sylvester-Bullitt

                No but whenever someone references another member such as I (and you) did, so their member name is in red they will get a reference in the upper right area next to their account avatar.

                Generally that is enough but sometimes creating a chat (which is off air) and leaving it will also register in the same area and is further impetus to look at what it refers to.

                Bear in mind all members are using their free time to check in on the forum and @guy038 is in the French locale so time zones can also cause a delay in response.

                Terry

                1 Reply Last reply Reply Quote 2
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, @sylvester-bullitt, @terry-r and All,

                  @sylvester-bullitt, I understood why your version of the generic regex does not work, in this specific case !

                  In your example, below :

                  (?-si:<(title|figcaption|h\d)>|(?!\A)\G)(?s-i:(?!<\1>).)*?\K(?-si:­)

                  First, I suppose that you forgot the normal forward slash /, in the ESR region !

                  Thus, the regex should be :

                  (?-si:<(title|figcaption|h\d)>|(?!\A)\G)(?s-i:(?!</\1>).)*?\K(?-si:­)

                  So, why this regex does not work as expected ?

                  Well, from the very beginning of file , the regex looks, either for the exact string title or figcaption or h\d, stored as group 1 and surrounded by angle brackets

                  But, right after, it use the second alternative (?!\A)\G and looks for a range of chars till a soft-Hyphen char, with the condition to not go through the corresponding ending tag (?!</\1>). Unfortunately, in this second alternative, this group 1 is inexistant. Thus, it would temporarily end the search when the string </> is found, case which never happens !

                  Then, if you want to temporarily stop the search after, either, the </title>, </figcaption>, </h\d>, ending tags, two solutions are possible :

                  • Instead of using the back-reference \1, replace it by a group reference, which, indeed, represents the true regex identified as group 1 and NOT the present value of that group !

                  • Simplify the ESR region in such a way that it would not care about the exact name of the ending tag, while keeping the full BSR region, as before


                  So, follow these steps :

                  • Open or switch to your file

                  • Move to the very beginning of your text ( Ctrl + Home ) IMPORTANT

                  • Open the **Replace ** dialog ( Ctrl + H )

                  • Untick all the box options

                  • If you prefer the first solution :

                    • SEARCH (?-si:<(title|figcaption|h[1-6])>|(?!\A)\G)(?s:(?!</(?1)>).)*?\K\xAD

                    • REPLACE \x2D

                  • If you prefer the second solution :

                    • SEARCH (?-si:<(?:title|figcaption|h[1-6])>|(?!\A)\G)(?s:(?!</).)*?\K\xAD

                    • REPLACE \x2D

                  • Select the Regular expression search mode

                  • Click, repeatedly, on the Find Next button to verify that it only finds the soft-hyphen chars in ranges <title>......</title> or <figcaption>......</figcaption> or <h[1-6]>......</h[1-6]>

                  • if necessary, click on the Replace All button to replace each of these specific soft-hyphen characters by a usual dash char. You CANNOT use the Replace button, due to the \K syntax within the regex !


                  REMARK :

                  Whatever the option chosen, note that the regex would match the soft-hyphen character , in theory, within the three cases, below, so the right case <h2>......</h2> but also, for example, the wrong syntaxes <h2>......</title> and <h2>......</figcaption> :-((

                  <h2>Second Head­ing</h2>
                  <h2>Second Head­ing</title>
                  <h2>Second Head­ing</figcaption>
                  

                  Test this INPUT text. It should be OK !

                  <!DOCTYPE HTML>
                  <html lang="en-us">
                  
                  <head>
                  <meta charset="utf-8">
                  <title>Regex Te­st</title>
                  </head>
                  
                  <body>
                  <h1>George Portrait</h1>
                  <figure><img alt="portrait" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/800px-Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg" width="200" height="243"><figcaption>George Washi
                  ngton<br>1732­1799</figcaption></figure>
                  
                  <p class="citation"><span class="author">Anderson, Ro­bert</span> &amp; <span class="author">Gail North</span>. <cite class="book">Mu­sic En­cy­clo­ped­ia</cite>. New York: Ster­ling Pub­lish­ing, 1979.</p>
                  
                  <h2>Second Head­ing</h2>
                  <h2>Second Head­ing</title>
                  <h2>Second Head­ing</figcaption>
                  
                  <p>This second heading is the only one with a soft hyphen. 
                  The <em>first</em> &lt;h1&gt; has <em>no</em> soft hyphen.</p>
                  </body>
                  
                  </html>
                  

                  Best Regards

                  guy038

                  Sylvester BullittS 1 Reply Last reply Reply Quote 2
                  • Sylvester BullittS
                    Sylvester Bullitt @guy038
                    last edited by

                    @guy038 Wow. I feel like I’m sipping from a fire hose! I bow to far greater experience than mine. Here’s the regex I finally chose:

                    (?-si:<(?:title|figcaption|h[1-6])>|(?!\A)\G)(?s:(?!</).)*?\K\xAD
                    

                    I like your idea of using the \xAD for the soft hyphen: It clarifies what the regex is doing!

                    I ran the regex it against my entire Web site (16,000 HTML files), and it works perfectly.

                    I’ll be the first to admit I don’t fully understand why it works, but I can’t argue with success. I’m going to have to study it for a while, but it now has a permanent home in my toolbox.

                    Thanks for taking the time to look at this thorny issue!

                    1 Reply Last reply Reply Quote 3
                    • Terry RT Terry R referenced this topic on
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors