Regex - Finding Invalid Characters in HTML Tags

Sylvester Bullitt

I’m trying to write a regular expression to find invalid characters (e.g., soft hyphens, Unicode U+00AD) inside HTML tags like <title>, <figcaption>, <h1>, <h2>, etc.

I’ve run into a issue that I’m not sure there’s an solution for, but thought I’d run it by the community to make sure.

MY REGEX:

<((title|figcaption|h\d)).*?[].*?</\1

THE TEXT WHERE IT INCORRECTLY FINDS A MATCH:

<h1 class="screen-reader-only">Introduction</h1>
<div class="preface-text">
<p><span>Born:</span> November 8, 1877, Moberg, Denmark.</p>
<p><span>Died:</span> June 22, 1970, <span class="map" onclick='show("Minneapolis,MN")'>Minneapolis</span>, Minnesota.</p>
<p><span>Buried:</span> <span class="map" onclick="show('45.01030,-93.21580')">Sunset Memorial Park</span>, Minneapolis, Minnesota.</p>
</div>
<figure><img alt="portrait" class="notable" src="../../../../../img/a/a/b/e/aaberg_jc.jpg" height="204" width="153"></figure>
</section>

<section id="biography">
<h1>Biography</h1>

The regex finds a soft hyphen between <h1> and </h1> tags (in the word Minneapolis, for example), but the match is between the <h1> and </h1> for two separate tags. I only want a match if the invalid character is within a single tag.

I’ve tried non-greedy operators, but can’t find a way to prevent the bad match.

Can anyone see how to make this work?

Terry R

@Sylvester-Bullitt said in Regex - Finding Invalid Characters in HTML Tags:

Can anyone see how to make this work?

Yes, and it’s in the FAQ posts here.

Your regex is trying to find what you want and without a limit on it’s movement it will find it, even if as you say it’s gone past the boundary you thought you had set.

Have a read of that FAQ post and make the changes necessary, I’m sure it will work as you want afterwards.

Terry

Sylvester Bullitt

@Terry-R Partial success, so far.

I got the single-line search to work, but ran into problems searching for a repeated text zone.

Here’s the failing regex. It’s trying to find soft hyphens in a <title>, <figcaption>, or heading (<h1>, <h2>, etc.):

(?-si:<(title|figcaption|h\d)>|(?!\A)\G)(?s-i:(?!<\1>).)*?\K(?-si:)

Here’s the text to search. The regex finds the soft hyphens in the specified tags, but also finds soft hyphens outside those tags (i.e., in the <citation> tag). I need the regex to ignore any soft hyphens not in the specified tags.

<!DOCTYPE HTML>
<html lang="en-us">

<head>
<meta charset="utf-8">
<title>Regex Test</title>
</head>

<body>
<h1>George Portrait</h1>
<figure><img alt="portrait" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/800px-Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg" width="200" height="243"><figcaption>George Washington<br>1732–1799</figcaption></figure>

<p class="citation"><span class="author">Anderson, Robert</span> &amp; <span class="author">Gail North</span>. <cite class="book">Music Encyclopedia</cite>. New York: Sterling Publishing, 1979.</p>

<h2>Second Heading</h2>
<p>This second heading is the only one with a soft hyphen. 
The <em>first</em> &lt;h1&gt; has <em>no</em> soft hyphen.</p>
</body>

</html>

I use Notepad++ version 8.6 (64-bit). What am I missing?

Terry R

@Sylvester-Bullitt said in Regex - Finding Invalid Characters in HTML Tags:

What am I missing?

I’ve had a look, but as I’m not the creator of those regex I linked you to I cannot immediately see the problem. Yet as you say it’s finding the character outside of the desired html code.

@guy038 is the creator and he will need to investigate this.

Terry

Sylvester Bullitt

@Terry-R Ok. Thanks for taking a look.

Will @guy038 automatically get a copy of my question?

Terry R

@Sylvester-Bullitt

No but whenever someone references another member such as I (and you) did, so their member name is in red they will get a reference in the upper right area next to their account avatar.

Generally that is enough but sometimes creating a chat (which is off air) and leaving it will also register in the same area and is further impetus to look at what it refers to.

Bear in mind all members are using their free time to check in on the forum and @guy038 is in the French locale so time zones can also cause a delay in response.

Terry

guy038

Hello, @sylvester-bullitt, @terry-r and All,

@sylvester-bullitt, I understood why your version of the generic regex does not work, in this specific case !

In your example, below :

(?-si:<(title|figcaption|h\d)>|(?!\A)\G)(?s-i:(?!<\1>).)*?\K(?-si:)

First, I suppose that you forgot the normal forward slash /, in the ESR region !

Thus, the regex should be :

(?-si:<(title|figcaption|h\d)>|(?!\A)\G)(?s-i:(?!</\1>).)*?\K(?-si:)

So, why this regex does not work as expected ?

Well, from the very beginning of file , the regex looks, either for the exact string title or figcaption or h\d, stored as group 1 and surrounded by angle brackets

But, right after, it use the second alternative (?!\A)\G and looks for a range of chars till a soft-Hyphen char, with the condition to not go through the corresponding ending tag (?!</\1>). Unfortunately, in this second alternative, this group 1 is inexistant. Thus, it would temporarily end the search when the string </> is found, case which never happens !

Then, if you want to temporarily stop the search after, either, the </title>, </figcaption>, </h\d>, ending tags, two solutions are possible :

Instead of using the back-reference \1, replace it by a group reference, which, indeed, represents the true regex identified as group 1 and NOT the present value of that group !
Simplify the ESR region in such a way that it would not care about the exact name of the ending tag, while keeping the full BSR region, as before

So, follow these steps :

Open or switch to your file
Move to the very beginning of your text ( Ctrl + Home ) IMPORTANT
Open the **Replace ** dialog ( Ctrl + H )
Untick all the box options
If you prefer the first solution :
- SEARCH (?-si:<(title|figcaption|h[1-6])>|(?!\A)\G)(?s:(?!</(?1)>).)*?\K\xAD
- REPLACE \x2D
If you prefer the second solution :
- SEARCH (?-si:<(?:title|figcaption|h[1-6])>|(?!\A)\G)(?s:(?!</).)*?\K\xAD
- REPLACE \x2D
Select the Regular expression search mode
Click, repeatedly, on the Find Next button to verify that it only finds the soft-hyphen chars in ranges <title>......</title> or <figcaption>......</figcaption> or <h[1-6]>......</h[1-6]>
if necessary, click on the Replace All button to replace each of these specific soft-hyphen characters by a usual dash char. You CANNOT use the Replace button, due to the \K syntax within the regex !

REMARK :

Whatever the option chosen, note that the regex would match the soft-hyphen character , in theory, within the three cases, below, so the right case <h2>......</h2> but also, for example, the wrong syntaxes <h2>......</title> and <h2>......</figcaption> :-((

<h2>Second Heading</h2>
<h2>Second Heading</title>
<h2>Second Heading</figcaption>

Test this INPUT text. It should be OK !

<!DOCTYPE HTML>
<html lang="en-us">

<head>
<meta charset="utf-8">
<title>Regex Test</title>
</head>

<body>
<h1>George Portrait</h1>
<figure><img alt="portrait" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/800px-Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg" width="200" height="243"><figcaption>George Washi
ngton<br>17321799</figcaption></figure>

<p class="citation"><span class="author">Anderson, Robert</span> &amp; <span class="author">Gail North</span>. <cite class="book">Music Encyclopedia</cite>. New York: Sterling Publishing, 1979.</p>

<h2>Second Heading</h2>
<h2>Second Heading</title>
<h2>Second Heading</figcaption>

<p>This second heading is the only one with a soft hyphen. 
The <em>first</em> &lt;h1&gt; has <em>no</em> soft hyphen.</p>
</body>

</html>

Best Regards

guy038

Sylvester Bullitt

@guy038 Wow. I feel like I’m sipping from a fire hose! I bow to far greater experience than mine. Here’s the regex I finally chose:

(?-si:<(?:title|figcaption|h[1-6])>|(?!\A)\G)(?s:(?!</).)*?\K\xAD

I like your idea of using the \xAD for the soft hyphen: It clarifies what the regex is doing!

I ran the regex it against my entire Web site (16,000 HTML files), and it works perfectly.

I’ll be the first to admit I don’t fully understand why it works, but I can’t argue with success. I’m going to have to study it for a while, but it now has a permanent home in my toolbox.

Thanks for taking the time to look at this thorny issue!