Community
    • Login

    Find Duplicate HTML Tags with Regex

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    8 Posts 4 Posters 864 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Sylvester BullittS
      Sylvester Bullitt
      last edited by

      I’m trying to write a regex to find duplicate HTML tag ID’s. Here’s the expression I’m working on. It works only if the HTML tags are on the same line.

      ((id=".+").+\g-1)
      

      Here’s the HTML I’m running it against:

      <ol>
      <li id="duplicate-id">Testing <span id="duplicate-id">regex</span> to find duplicate HTML tag ID’s.</li>
      <li>It finds the duplicates in the next two list elements (even though they’re on the same line)</li>
      <li>Duplicate tags in two elements but <em>on one line</em>. First duplicate <span id="another-duplicate">here</span>.</li><li>Then a second one <span id="another-duplicate">here</span></li>
      <li>It does <em>not</em> find the duplicate ID’s in the <em>next</em> two elements <em>on separate lines</em>.</li>
      <li>First duplicate <span id="another-duplicate">here</span></li>
      <li>And, after a line break, a <span id="another-duplicate">second</span> duplicate</li>
      </ol>
      
      

      Notes:

      • I include the outer enclosing parentheses because I want to append this to a larger regex once it works.

      • I need to check over 10,000 files, so an on line HTML checker impractical.

      The Boost documentation says the dot character (just before \g in the above regex) includes line breaks by default. But I seem to be getting the opposite behavior.

      I’m using Notepad++ 7.8.2 (64-bit) on Windows 10.

      Can anyone see what I’m doing wrong?

      Alan KilbornA 1 Reply Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @Sylvester Bullitt
        last edited by

        @Sylvester-Bullitt

        Not looking too hard at your specific problem, but just at the “.” and “line breaks” part.

        It is going to depend upon the . matches newline box setting. If unticked, it won’t match linebreaks. If ticked, it will.

        Alternatively, you can ignore that checkbox completely and include (?s) at the start of your regex to allow linebreak matching, and (?-s) to disallow it.

        Hope this helps…

        1 Reply Last reply Reply Quote 2
        • Sylvester BullittS
          Sylvester Bullitt
          last edited by

          Thanks for replying, Alan. I can’t turn use the “matches newline” check box in my scenario, because this is one of many queries combined into a large consolidated query, and it would cause the other subqueries to fail.

          I tried the (?s) option, but it it seemed to have no effect. I must have been using it wrong. If I have a one large query with a lot of subqueries, does it have to go at the start of the entire query, or can it be used at the beginning of a subquery? That it is, would it work with an overall query sructured like this?

          (subquery1)|(?s)subquery2)|(subquery3) etc.
          
          Alan KilbornA 1 Reply Last reply Reply Quote 1
          • Alan KilbornA
            Alan Kilborn @Sylvester Bullitt
            last edited by

            @Sylvester-Bullitt

            You can turn it on and off as needed. Example:

            (?-s)blahblahblah(?s)blahblah(?-s)blahblahblah

            or

            (?s)blahblahblah(?-s)blahblah(?s)blahblahblah

            where blah is an intermediate regular expression sequence.

            In this case, only what’s between (?s) and (?-s) in the first case is going to be subject to the . character matching linebreaks. For the second case, it would be everywhere except in the middle.

            PeterJonesP 1 Reply Last reply Reply Quote 1
            • PeterJonesP
              PeterJones @Alan Kilborn
              last edited by PeterJones

              @Alan-Kilborn said in Find Duplicate HTML Tags with Regex:

              You can turn it on and off as needed

              You can also turn it on for just the given subexpression using the colon syntax:

              (?-s:subquery1)|(?s:subquery2)
              

              https://npp-user-manual.org/docs/searching/#search-modifiers

              Alan KilbornA 1 Reply Last reply Reply Quote 1
              • Alan KilbornA
                Alan Kilborn @PeterJones
                last edited by Alan Kilborn

                @PeterJones

                So with my newfound knowledge based on your post, I tried this search regex to solve a realworld problem:

                (?-i)(XXX)|(Xxx)|(?i:YYY)

                The intent was to match XXX or Xxx exactly or to match any case of YYY. My replace string was:

                (?1QQ)(?2ZZ)(?3AA)

                and I think that is where the trouble lies. Changing the search expression slightly to:

                (?-i)(XXX)|(Xxx)|((?i)YYY)

                made it work as intended. Apparently there was no “group #3” with the first search syntax.

                It appears that if you use the (?_:____) type of syntax, a capturing group is not formed. Hindsight says it makes sense, but it was a tad bit unexpected, especially since I had the (YYY) part at first, successfully capturing, and just slipped in the ?i: bit later when I determined I needed to ignore case. That’s when the regex gods said “Gotcha, padawan!”

                1 Reply Last reply Reply Quote 2
                • PeterJonesP
                  PeterJones
                  last edited by

                  @Alan-Kilborn said in Find Duplicate HTML Tags with Regex:

                  Apparently there was no “group #3” with the first search syntax.

                  Yes. As it says in Readability Enhancements, “(?:subset) ⇒ A grouping construct for the subset expression that doesn’t count as a subexpression (doesn’t get numbered or named)”. That was meant to apply to the (?enable-disable:subset) as well, but maybe it’s not clear enough. I’ll think about adding in something more explicit either in that paragraph, or in the Search Modifiers below (or both).

                  1 Reply Last reply Reply Quote 3
                  • guy038G
                    guy038
                    last edited by guy038

                    Hi, @sylvester-bullitt, @alan-kilborn, @peterjones and All,

                    Some observations about in-line modifiers :

                    Let’s consider the text :

                    1 abc
                    2 abC
                    3 aBc
                    4 aBC
                    5 Abc
                    6 AbC
                    7 ABc
                    8 ABC
                    
                    • The regex (?-i)a((?i)b)c matches lines 1 and 3 and the group 1 contains the letter b or B

                    • Using a non-capturing group, the full syntax of the regex becomes (?-i)a(?:(?i)b)c and matches the same lines 1 and 3 but, this time, no group is defined

                    • As a convenient shorthand, in case of a non-capturing group, the option may appear between the question mark and the colon, giving the alternate syntax (?-i)a(?i:b)c, which produces the same results as the former regex

                    • Note that, because options are not reset until the end of a subpattern, if that subpattern contains alternatives, an option setting in one branch does affect all the subsequent branches !

                    So, assuming the text :

                    1 ab
                    2 aB
                    3 Ab
                    4 AB
                    5 cd
                    6 cD
                    7 Cd
                    8 CD
                    
                    • The regex ((?-i)ab|CD), with group1 and the regex (?-i:ab|CD), with a non-capturing group, match, both, lines 1 and 8, only

                    REMARK :

                    • Note that the previous rule is always valid if, when choosing between the different alternatives, the regex engine is not aware of a later option. For instance, let’s consider the regex (?-i)(a(?i)b|cd)

                    • In this regex, due to the initial (?-i) modifier, the search should be done in a sensitive way, except for the part (?i)b, which should match letter b or B. However, this regex matches the lines 1 and 2 and lines from 5 to 8, in the above text !

                    • For instance, when the regex engine is located, right before the first letter of line 8, and choose the second alternative, in order to process c, rather than a, it does not know about the caseless option, applied to letter b. Still, it does match the uppercase letter C because the (?i) modifier is carried on into the cd alternative, as well !

                    Best Regards,

                    guy038

                    1 Reply Last reply Reply Quote 3
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors