• Login
Community
  • Login

Find Duplicate HTML Tags with Regex

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
8 Posts 4 Posters 884 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • S
    Sylvester Bullitt
    last edited by Jan 25, 2020, 7:58 PM

    I’m trying to write a regex to find duplicate HTML tag ID’s. Here’s the expression I’m working on. It works only if the HTML tags are on the same line.

    ((id=".+").+\g-1)
    

    Here’s the HTML I’m running it against:

    <ol>
    <li id="duplicate-id">Testing <span id="duplicate-id">regex</span> to find duplicate HTML tag ID’s.</li>
    <li>It finds the duplicates in the next two list elements (even though they’re on the same line)</li>
    <li>Duplicate tags in two elements but <em>on one line</em>. First duplicate <span id="another-duplicate">here</span>.</li><li>Then a second one <span id="another-duplicate">here</span></li>
    <li>It does <em>not</em> find the duplicate ID’s in the <em>next</em> two elements <em>on separate lines</em>.</li>
    <li>First duplicate <span id="another-duplicate">here</span></li>
    <li>And, after a line break, a <span id="another-duplicate">second</span> duplicate</li>
    </ol>
    
    

    Notes:

    • I include the outer enclosing parentheses because I want to append this to a larger regex once it works.

    • I need to check over 10,000 files, so an on line HTML checker impractical.

    The Boost documentation says the dot character (just before \g in the above regex) includes line breaks by default. But I seem to be getting the opposite behavior.

    I’m using Notepad++ 7.8.2 (64-bit) on Windows 10.

    Can anyone see what I’m doing wrong?

    A 1 Reply Last reply Jan 25, 2020, 9:47 PM Reply Quote 0
    • A
      Alan Kilborn @Sylvester Bullitt
      last edited by Jan 25, 2020, 9:47 PM

      @Sylvester-Bullitt

      Not looking too hard at your specific problem, but just at the “.” and “line breaks” part.

      It is going to depend upon the . matches newline box setting. If unticked, it won’t match linebreaks. If ticked, it will.

      Alternatively, you can ignore that checkbox completely and include (?s) at the start of your regex to allow linebreak matching, and (?-s) to disallow it.

      Hope this helps…

      1 Reply Last reply Reply Quote 2
      • S
        Sylvester Bullitt
        last edited by Jan 26, 2020, 7:07 PM

        Thanks for replying, Alan. I can’t turn use the “matches newline” check box in my scenario, because this is one of many queries combined into a large consolidated query, and it would cause the other subqueries to fail.

        I tried the (?s) option, but it it seemed to have no effect. I must have been using it wrong. If I have a one large query with a lot of subqueries, does it have to go at the start of the entire query, or can it be used at the beginning of a subquery? That it is, would it work with an overall query sructured like this?

        (subquery1)|(?s)subquery2)|(subquery3) etc.
        
        A 1 Reply Last reply Jan 26, 2020, 7:39 PM Reply Quote 1
        • A
          Alan Kilborn @Sylvester Bullitt
          last edited by Jan 26, 2020, 7:39 PM

          @Sylvester-Bullitt

          You can turn it on and off as needed. Example:

          (?-s)blahblahblah(?s)blahblah(?-s)blahblahblah

          or

          (?s)blahblahblah(?-s)blahblah(?s)blahblahblah

          where blah is an intermediate regular expression sequence.

          In this case, only what’s between (?s) and (?-s) in the first case is going to be subject to the . character matching linebreaks. For the second case, it would be everywhere except in the middle.

          P 1 Reply Last reply Jan 26, 2020, 8:06 PM Reply Quote 1
          • P
            PeterJones @Alan Kilborn
            last edited by PeterJones Jan 26, 2020, 8:07 PM Jan 26, 2020, 8:06 PM

            @Alan-Kilborn said in Find Duplicate HTML Tags with Regex:

            You can turn it on and off as needed

            You can also turn it on for just the given subexpression using the colon syntax:

            (?-s:subquery1)|(?s:subquery2)
            

            https://npp-user-manual.org/docs/searching/#search-modifiers

            A 1 Reply Last reply Jan 30, 2020, 8:59 PM Reply Quote 1
            • A
              Alan Kilborn @PeterJones
              last edited by Alan Kilborn Jan 30, 2020, 9:01 PM Jan 30, 2020, 8:59 PM

              @PeterJones

              So with my newfound knowledge based on your post, I tried this search regex to solve a realworld problem:

              (?-i)(XXX)|(Xxx)|(?i:YYY)

              The intent was to match XXX or Xxx exactly or to match any case of YYY. My replace string was:

              (?1QQ)(?2ZZ)(?3AA)

              and I think that is where the trouble lies. Changing the search expression slightly to:

              (?-i)(XXX)|(Xxx)|((?i)YYY)

              made it work as intended. Apparently there was no “group #3” with the first search syntax.

              It appears that if you use the (?_:____) type of syntax, a capturing group is not formed. Hindsight says it makes sense, but it was a tad bit unexpected, especially since I had the (YYY) part at first, successfully capturing, and just slipped in the ?i: bit later when I determined I needed to ignore case. That’s when the regex gods said “Gotcha, padawan!”

              1 Reply Last reply Reply Quote 2
              • P
                PeterJones
                last edited by Jan 30, 2020, 9:13 PM

                @Alan-Kilborn said in Find Duplicate HTML Tags with Regex:

                Apparently there was no “group #3” with the first search syntax.

                Yes. As it says in Readability Enhancements , “(?:subset) ⇒ A grouping construct for the subset expression that doesn’t count as a subexpression (doesn’t get numbered or named)”. That was meant to apply to the (?enable-disable:subset) as well, but maybe it’s not clear enough. I’ll think about adding in something more explicit either in that paragraph, or in the Search Modifiers below (or both).

                1 Reply Last reply Reply Quote 3
                • G
                  guy038
                  last edited by guy038 Jan 31, 2020, 1:45 AM Jan 31, 2020, 1:39 AM

                  Hi, @sylvester-bullitt, @alan-kilborn, @peterjones and All,

                  Some observations about in-line modifiers :

                  Let’s consider the text :

                  1 abc
                  2 abC
                  3 aBc
                  4 aBC
                  5 Abc
                  6 AbC
                  7 ABc
                  8 ABC
                  
                  • The regex (?-i)a((?i)b)c matches lines 1 and 3 and the group 1 contains the letter b or B

                  • Using a non-capturing group, the full syntax of the regex becomes (?-i)a(?:(?i)b)c and matches the same lines 1 and 3 but, this time, no group is defined

                  • As a convenient shorthand, in case of a non-capturing group, the option may appear between the question mark and the colon, giving the alternate syntax (?-i)a(?i:b)c, which produces the same results as the former regex

                  • Note that, because options are not reset until the end of a subpattern, if that subpattern contains alternatives, an option setting in one branch does affect all the subsequent branches !

                  So, assuming the text :

                  1 ab
                  2 aB
                  3 Ab
                  4 AB
                  5 cd
                  6 cD
                  7 Cd
                  8 CD
                  
                  • The regex ((?-i)ab|CD), with group1 and the regex (?-i:ab|CD), with a non-capturing group, match, both, lines 1 and 8, only

                  REMARK :

                  • Note that the previous rule is always valid if, when choosing between the different alternatives, the regex engine is not aware of a later option. For instance, let’s consider the regex (?-i)(a(?i)b|cd)

                  • In this regex, due to the initial (?-i) modifier, the search should be done in a sensitive way, except for the part (?i)b, which should match letter b or B. However, this regex matches the lines 1 and 2 and lines from 5 to 8, in the above text !

                  • For instance, when the regex engine is located, right before the first letter of line 8, and choose the second alternative, in order to process c, rather than a, it does not know about the caseless option, applied to letter b. Still, it does match the uppercase letter C because the (?i) modifier is carried on into the cd alternative, as well !

                  Best Regards,

                  guy038

                  1 Reply Last reply Reply Quote 3
                  6 out of 8
                  • First post
                    6/8
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors