Regex - Finding Invalid Characters in HTML Tags
-
I’m trying to write a regular expression to find invalid characters (e.g., soft hyphens, Unicode U+00AD) inside HTML tags like <title>, <figcaption>, <h1>, <h2>, etc.
I’ve run into a issue that I’m not sure there’s an solution for, but thought I’d run it by the community to make sure.
MY REGEX:
<((title|figcaption|h\d)).*?[].*?</\1
THE TEXT WHERE IT INCORRECTLY FINDS A MATCH:
<h1 class="screen-reader-only">Introduction</h1> <div class="preface-text"> <p><span>Born:</span> November 8, 1877, Moberg, Denmark.</p> <p><span>Died:</span> June 22, 1970, <span class="map" onclick='show("Minneapolis,MN")'>Minneapolis</span>, Minnesota.</p> <p><span>Buried:</span> <span class="map" onclick="show('45.01030,-93.21580')">Sunset Memorial Park</span>, Minneapolis, Minnesota.</p> </div> <figure><img alt="portrait" class="notable" src="../../../../../img/a/a/b/e/aaberg_jc.jpg" height="204" width="153"></figure> </section> <section id="biography"> <h1>Biography</h1>
The regex finds a soft hyphen between <h1> and </h1> tags (in the word Minneapolis, for example), but the match is between the <h1> and </h1> for two separate tags. I only want a match if the invalid character is within a single tag.
I’ve tried non-greedy operators, but can’t find a way to prevent the bad match.
Can anyone see how to make this work?
-
@Sylvester-Bullitt said in Regex - Finding Invalid Characters in HTML Tags:
Can anyone see how to make this work?
Yes, and it’s in the FAQ posts here.
Your regex is trying to find what you want and without a limit on it’s movement it will find it, even if as you say it’s gone past the boundary you thought you had set.
Have a read of that FAQ post and make the changes necessary, I’m sure it will work as you want afterwards.
Terry
-
@Terry-R Partial success, so far.
I got the single-line search to work, but ran into problems searching for a repeated text zone.
Here’s the failing regex. It’s trying to find soft hyphens in a <title>, <figcaption>, or heading (<h1>, <h2>, etc.):
(?-si:<(title|figcaption|h\d)>|(?!\A)\G)(?s-i:(?!<\1>).)*?\K(?-si:)
Here’s the text to search. The regex finds the soft hyphens in the specified tags, but also finds soft hyphens outside those tags (i.e., in the <citation> tag). I need the regex to ignore any soft hyphens not in the specified tags.
<!DOCTYPE HTML> <html lang="en-us"> <head> <meta charset="utf-8"> <title>Regex Test</title> </head> <body> <h1>George Portrait</h1> <figure><img alt="portrait" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/800px-Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg" width="200" height="243"><figcaption>George Washington<br>1732–1799</figcaption></figure> <p class="citation"><span class="author">Anderson, Robert</span> & <span class="author">Gail North</span>. <cite class="book">Music Encyclopedia</cite>. New York: Sterling Publishing, 1979.</p> <h2>Second Heading</h2> <p>This second heading is the only one with a soft hyphen. The <em>first</em> <h1> has <em>no</em> soft hyphen.</p> </body> </html>
I use Notepad++ version 8.6 (64-bit). What am I missing?
-
@Sylvester-Bullitt said in Regex - Finding Invalid Characters in HTML Tags:
What am I missing?
I’ve had a look, but as I’m not the creator of those regex I linked you to I cannot immediately see the problem. Yet as you say it’s finding the character outside of the desired html code.
@guy038 is the creator and he will need to investigate this.
Terry
-
-
No but whenever someone references another member such as I (and you) did, so their member name is in red they will get a reference in the upper right area next to their account avatar.
Generally that is enough but sometimes creating a chat (which is off air) and leaving it will also register in the same area and is further impetus to look at what it refers to.
Bear in mind all members are using their free time to check in on the forum and @guy038 is in the French locale so time zones can also cause a delay in response.
Terry
-
Hello, @sylvester-bullitt, @terry-r and All,
@sylvester-bullitt, I understood why your version of the generic regex does not work, in this specific case !
In your example, below :
(?-si:<(title|figcaption|h\d)>|(?!\A)\G)(?s-i:(?!<\1>).)*?\K(?-si:)
First, I suppose that you forgot the normal forward slash
/
, in the ESR region !Thus, the regex should be :
(?-si:<(title|figcaption|h\d)>|(?!\A)\G)(?s-i:(?!</\1>).)*?\K(?-si:)
So, why this regex does not work as expected ?
Well, from the very beginning of file , the regex looks, either for the exact string
title
orfigcaption
orh\d
, stored as group1
and surrounded by angle bracketsBut, right after, it use the second alternative
(?!\A)\G
and looks for a range of chars till asoft-Hyphen
char, with the condition to not go through the corresponding ending tag(?!</\1>)
. Unfortunately, in this second alternative, this group1
is inexistant. Thus, it would temporarily end the search when the string</>
is found, case which never happens !Then, if you want to temporarily stop the search after, either, the
</title>
,</figcaption>
,</h\d>
, ending tags, two solutions are possible :-
Instead of using the back-reference
\1
, replace it by a group reference, which, indeed, represents the true regex identified as group1
and NOT the present value of that group ! -
Simplify the ESR region in such a way that it would not care about the exact name of the ending tag, while keeping the full BSR region, as before
So, follow these steps :
-
Open or switch to your file
-
Move to the very beginning of your text (
Ctrl + Home
) IMPORTANT -
Open the **Replace ** dialog (
Ctrl + H
) -
Untick all the box options
-
If you prefer the first solution :
-
SEARCH
(?-si:<(title|figcaption|h[1-6])>|(?!\A)\G)(?s:(?!</(?1)>).)*?\K\xAD
-
REPLACE
\x2D
-
-
If you prefer the second solution :
-
SEARCH
(?-si:<(?:title|figcaption|h[1-6])>|(?!\A)\G)(?s:(?!</).)*?\K\xAD
-
REPLACE
\x2D
-
-
Select the
Regular expression
search mode -
Click, repeatedly, on the Find Next button to verify that it only finds the
soft-hyphen
chars in ranges<title>......</title>
or<figcaption>......</figcaption>
or<h[1-6]>......</h[1-6]>
-
if necessary, click on the
Replace All
button to replace each of these specificsoft-hyphen
characters by a usualdash
char. You CANNOT use theReplace
button, due to the\K
syntax within the regex !
REMARK :
Whatever the option chosen, note that the regex would match the
soft-hyphen
character , in theory, within the three cases, below, so the right case<h2>......</h2>
but also, for example, the wrong syntaxes<h2>......</title>
and<h2>......</figcaption>
:-((<h2>Second Heading</h2> <h2>Second Heading</title> <h2>Second Heading</figcaption>
Test this INPUT text. It should be OK !
<!DOCTYPE HTML> <html lang="en-us"> <head> <meta charset="utf-8"> <title>Regex Test</title> </head> <body> <h1>George Portrait</h1> <figure><img alt="portrait" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg/800px-Gilbert_Stuart_Williamstown_Portrait_of_George_Washington.jpg" width="200" height="243"><figcaption>George Washi ngton<br>17321799</figcaption></figure> <p class="citation"><span class="author">Anderson, Robert</span> & <span class="author">Gail North</span>. <cite class="book">Music Encyclopedia</cite>. New York: Sterling Publishing, 1979.</p> <h2>Second Heading</h2> <h2>Second Heading</title> <h2>Second Heading</figcaption> <p>This second heading is the only one with a soft hyphen. The <em>first</em> <h1> has <em>no</em> soft hyphen.</p> </body> </html>
Best Regards
guy038
-
-
@guy038 Wow. I feel like I’m sipping from a fire hose! I bow to far greater experience than mine. Here’s the regex I finally chose:
(?-si:<(?:title|figcaption|h[1-6])>|(?!\A)\G)(?s:(?!</).)*?\K\xAD
I like your idea of using the \xAD for the soft hyphen: It clarifies what the regex is doing!
I ran the regex it against my entire Web site (16,000 HTML files), and it works perfectly.
I’ll be the first to admit I don’t fully understand why it works, but I can’t argue with success. I’m going to have to study it for a while, but it now has a permanent home in my toolbox.
Thanks for taking the time to look at this thorny issue!