Regex: How to find a duplicate tag on consecutive lines?



  • hi, in the code below, you will see that two lines are repeated one after the other, which is a problem:

      </url>
     </url>
    

    The code:

    <url>
      <loc>https://my-website.com/en/example-love.html</loc>
      <lastmod>2018-11-30T17:19:37+00:00</lastmod>
     <changefreq>weekly</changefreq>
      <priority>0.6400</priority>
    </url>
      </url>
      <url>
      <loc>https://my-website.com/en/my-cat-is-here.html</loc>
      <lastmod>2018-11-30T17:19:37+00:00</lastmod>
     <changefreq>weekly</changefreq>
      <priority>0.6400</priority>
    </url>
    

    I want to find out the rss.xml file that contains a duplicate </url> on consecutive lines.

    I don’t know why my regex is not working:

    FIND: (?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:(?!</url>))
    Replace by: (Leave empty)



  • Hi, @robin-cruise,

    The correct regex is :

    (?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:</url>)

    Indeed, as your search regex ends with \K(?-i:(?!</url>)), this means that you’re looking for an empty string, not followed by </url> with this exact case. Not what you were expected too !


    Now, why do you bother to use this complicated regex syntax ?

    Simply use this shorter regex S/R :

    SEARCH (</url>)\s+\K\1

    REPLACE Leave EMPTY

    Best Regards,

    guy038



  • @guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

    Now, why do you bother to use this complicated regex syntax ?

    I suspect the reason for that is the OP just pasted some regex because our policy to get (repeat) help here is to show-us-what-you-tried-first.

    And currently, OP is our champion of write-my-regex-for-me.

    So it has been paste-anything-because-that-ticks-the-box-and-now-I-will-get-the-help-I-need.



  • Hello, @alan-kilborn and All,

    Alan, I’m not fooled and I know for a fact that some people are just showing us a regular expression, loosely deduced from a previous regex, already posted on the forum, serving as an alibi for a supposed search !

    But, well: I had the opportunity, at the same time, in a short reply, to correct the OP’s version and to indicate the solution to the Robin’s question, correctly expressed !

    BR

    guy038



  • @guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

    (</url>)\s+\K\1

    hello @guy038 If I make the search and replace, I will have an empty line between tags. Such as:

    </url>
      
      <url>
    

    How can I modify those 2 regex you made, as not to have an empty line after replacement?



  • Hi, @robin-cruise,

    Yes, I made two little mistakes but you could have found the problem by yourself !

    • First I forgot the (?-i) syntax, at the very beginning of the regex to ensures that all the process will be case sensitive !

    • Secondly, my previous regex (</url>)\s+\K\1 first searches for the <url> string, according to the match case option, then some regex space class characters ( so any non-null range of space, tab, and/or End of Line chars ), then the </url> string , according to the match case option, too. But I forgot to search for the line-break chars after </url>, which can be achieved with the \R syntax

    Thus, the exact regex to use is (?-i)(</url>)\s+\K\1\R and, of course, click on the Replace All button, only ( not on the Replace one, due to the presence of the \K construction )

    Cheers,

    guy038



  • @guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

    (?-i)(</url>)\s+\K\1\R

    thanks. But, can you tell me, when can I use that \1 in the FIND option? Because I usually use it at the replace section, not at the FIND.

    ok, I know that \1 refers to the first bracket (</url>). But if I make a regex search, something like (<loc>).*\1 it doesn’t work, seems that notepad++ cannot find the text.

    I try, also, (<loc>).*|\1\R

    so, when exactly can I use \1 or \2 in the FIND option?



  • Hello, @Robin-Cruise and All,

    Ah, interesting point !

    Let’s work against this simple line, pasted in a new tab :

    ----abc12345abc----def67890def----abc12345def----def67890abc----
    
    • The regex (?-i)(abc)\d+\1|(def)\d+\2 correctly matches the strings abc12345abc and def67890def :

      • The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1

      • The group 2 is the string def, re-used in the search regex as the back-reference \2, in its second alternative (def)\d+\2

      • Each back-reference to a group, \1 and \2, is totally defined when using each alternative

    • The almost identical regex (?-i)(abc)\d+\1|(def)\d+\1 just matches the string abc12345abc ! Why ?

      • The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1

      • The group 2 is the string def, not re-used in this search regex

      • When trying the second alternative (def)\d+\1, the regex engine does not know the back-reference \1 which refers to group 1 ( the string abc ), defined in the first alternative. Thus, this second alternative will never match !

    • However, note that :

      • The regex (?i)(def)\d+\1, only, would normally match the string def67890def

      • The regex (?i)(def)\d+\2, only, outputs the message Invalid regular expression, as no group 2 is defined in this regex

    • Now, let’s use the \K syntax in the regex (?-i)(abc)\d+\1|(def)\d+\2.+\Kabc\d+\2

      • The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1

      • The group 2 is the string def, re-used in the search regex as the back-reference \2, in its second alternative (def)\d+\2.+\Kabc\d+\2

      • The first alternative, (abc)\d+\1, as above, matches the string abc12345abc

      • The second alternative, (def)\d+\2.+\Kabc\d+\2 first matches the def67890def string and defines the group 2 ( def ) then, further on, after reset of search by the \K feature, matches the string abc12345def because the groups defined, before \K, are still defined after \K !

    • Finally, let’s consider the regex (?-i)(abc)\d+\1|(def)\d+\2|abc\d+\2|def\d+\1, composed of 4 alternatives

      • Only the first two alternatives match, successively, the two strings abc12345abc and def67890def

      • In the last two alternatives, the back-references \1 and \2 refer to a not defined group, when executing the third / fourth alternative !

    • Now, if, instead of the back-references \1 and \2, we use the subroutines calls (?1) and (?2), which represent the exact regexes in groups 1 and 2 ( so the strings abc and def ), we get the regex (?-i)(abc)\d+\1|(def)\d+\2|abc\d+(?2)|def\d+(?1) which do match the four strings : abc12345abc, def67890def, abc12345def and def67890abc

      • As you can see, subroutine calls, (?#) are defined by the regex engine before executing any alternative and, thus, are defined in any part of the overall regex, where they occur. You may even create a regex containing a subroutine call to a group, which comes next ! For instance, the regex (?1):(?1):(\d\d) would match the string 03:05:45 or 11:52:17 And just compare with the regex \1:\1:(\d\d) which is totally invalid !

    Back to your questions, @robin-cruise, you said :

    But if I make a regex search, something like (<loc>).*\1 it doesn’t work, seems that notepad++ cannot find the text.

    Sorry, but it does work !

    Try the regex (?s)(<loc>).*\1 against this text :

    <loc>
    bla
    bla
    blah
    <loc>
    

    As about the (<loc>).*|\1\R regex, obviously, only the first alternative can match. Indeed, the second alternative \1\R contains a back-reference \1 to group 1, which is not defined in THIS alternative !

    Finally, your (<loc>).*|\1\R regex is simply equivalent to one the two forms :

    • (?s)(<loc>).*

    • (?-s)(<loc>).*

    Best Regards,

    guy038



  • @guy038 great answer, thanks !


Log in to reply