Find Duplicate HTML Tags with Regex

Sylvester Bullitt

I’m trying to write a regex to find duplicate HTML tag ID’s. Here’s the expression I’m working on. It works only if the HTML tags are on the same line.

((id=".+").+\g-1)

Here’s the HTML I’m running it against:

<ol>
<li id="duplicate-id">Testing <span id="duplicate-id">regex</span> to find duplicate HTML tag ID’s.</li>
<li>It finds the duplicates in the next two list elements (even though they’re on the same line)</li>
<li>Duplicate tags in two elements but <em>on one line</em>. First duplicate <span id="another-duplicate">here</span>.</li><li>Then a second one <span id="another-duplicate">here</span></li>
<li>It does <em>not</em> find the duplicate ID’s in the <em>next</em> two elements <em>on separate lines</em>.</li>
<li>First duplicate <span id="another-duplicate">here</span></li>
<li>And, after a line break, a <span id="another-duplicate">second</span> duplicate</li>
</ol>

Notes:

I include the outer enclosing parentheses because I want to append this to a larger regex once it works.
I need to check over 10,000 files, so an on line HTML checker impractical.

The Boost documentation says the dot character (just before \g in the above regex) includes line breaks by default. But I seem to be getting the opposite behavior.

I’m using Notepad++ 7.8.2 (64-bit) on Windows 10.

Can anyone see what I’m doing wrong?

Alan Kilborn

@Sylvester-Bullitt

Not looking too hard at your specific problem, but just at the “.” and “line breaks” part.

It is going to depend upon the . matches newline box setting. If unticked, it won’t match linebreaks. If ticked, it will.

Alternatively, you can ignore that checkbox completely and include (?s) at the start of your regex to allow linebreak matching, and (?-s) to disallow it.

Hope this helps…

Sylvester Bullitt

Thanks for replying, Alan. I can’t turn use the “matches newline” check box in my scenario, because this is one of many queries combined into a large consolidated query, and it would cause the other subqueries to fail.

I tried the (?s) option, but it it seemed to have no effect. I must have been using it wrong. If I have a one large query with a lot of subqueries, does it have to go at the start of the entire query, or can it be used at the beginning of a subquery? That it is, would it work with an overall query sructured like this?

(subquery1)|(?s)subquery2)|(subquery3) etc.

Alan Kilborn

@Sylvester-Bullitt

You can turn it on and off as needed. Example:

(?-s)blahblahblah(?s)blahblah(?-s)blahblahblah

or

(?s)blahblahblah(?-s)blahblah(?s)blahblahblah

where blah is an intermediate regular expression sequence.

In this case, only what’s between (?s) and (?-s) in the first case is going to be subject to the . character matching linebreaks. For the second case, it would be everywhere except in the middle.

PeterJones

@Alan-Kilborn said in Find Duplicate HTML Tags with Regex:

You can turn it on and off as needed

You can also turn it on for just the given subexpression using the colon syntax:

(?-s:subquery1)|(?s:subquery2)

https://npp-user-manual.org/docs/searching/#search-modifiers

Alan Kilborn

@PeterJones

So with my newfound knowledge based on your post, I tried this search regex to solve a realworld problem:

(?-i)(XXX)|(Xxx)|(?i:YYY)

The intent was to match XXX or Xxx exactly or to match any case of YYY. My replace string was:

(?1QQ)(?2ZZ)(?3AA)

and I think that is where the trouble lies. Changing the search expression slightly to:

(?-i)(XXX)|(Xxx)|((?i)YYY)

made it work as intended. Apparently there was no “group #3” with the first search syntax.

It appears that if you use the (?_:____) type of syntax, a capturing group is not formed. Hindsight says it makes sense, but it was a tad bit unexpected, especially since I had the (YYY) part at first, successfully capturing, and just slipped in the ?i: bit later when I determined I needed to ignore case. That’s when the regex gods said “Gotcha, padawan!”

PeterJones

@Alan-Kilborn said in Find Duplicate HTML Tags with Regex:

Apparently there was no “group #3” with the first search syntax.

Yes. As it says in Readability Enhancements, “(?:subset) ⇒ A grouping construct for the subset expression that doesn’t count as a subexpression (doesn’t get numbered or named)”. That was meant to apply to the (?enable-disable:subset) as well, but maybe it’s not clear enough. I’ll think about adding in something more explicit either in that paragraph, or in the Search Modifiers below (or both).

guy038

Hi, @sylvester-bullitt, @alan-kilborn, @peterjones and All,

Some observations about in-line modifiers :

Let’s consider the text :

1 abc
2 abC
3 aBc
4 aBC
5 Abc
6 AbC
7 ABc
8 ABC

The regex (?-i)a((?i)b)c matches lines 1 and 3 and the group 1 contains the letter b or B
Using a non-capturing group, the full syntax of the regex becomes (?-i)a(?:(?i)b)c and matches the same lines 1 and 3 but, this time, no group is defined
As a convenient shorthand, in case of a non-capturing group, the option may appear between the question mark and the colon, giving the alternate syntax (?-i)a(?i:b)c, which produces the same results as the former regex
Note that, because options are not reset until the end of a subpattern, if that subpattern contains alternatives, an option setting in one branch does affect all the subsequent branches !

So, assuming the text :

1 ab
2 aB
3 Ab
4 AB
5 cd
6 cD
7 Cd
8 CD

The regex ((?-i)ab|CD), with group1 and the regex (?-i:ab|CD), with a non-capturing group, match, both, lines 1 and 8, only

REMARK :

Note that the previous rule is always valid if, when choosing between the different alternatives, the regex engine is not aware of a later option. For instance, let’s consider the regex (?-i)(a(?i)b|cd)
In this regex, due to the initial (?-i) modifier, the search should be done in a sensitive way, except for the part (?i)b, which should match letter b or B. However, this regex matches the lines 1 and 2 and lines from 5 to 8, in the above text !
For instance, when the regex engine is located, right before the first letter of line 8, and choose the second alternative, in order to process c, rather than a, it does not know about the caseless option, applied to letter b. Still, it does match the uppercase letter C because the (?i) modifier is carried on into the cd alternative, as well !

Best Regards,

guy038