Regex: How to find a duplicate tag on consecutive lines?

Robin Cruise

hi, in the code below, you will see that two lines are repeated one after the other, which is a problem:

  </url>
 </url>

The code:

<url>
  <loc>https://my-website.com/en/example-love.html</loc>
  <lastmod>2018-11-30T17:19:37+00:00</lastmod>
 <changefreq>weekly</changefreq>
  <priority>0.6400</priority>
</url>
  </url>
  <url>
  <loc>https://my-website.com/en/my-cat-is-here.html</loc>
  <lastmod>2018-11-30T17:19:37+00:00</lastmod>
 <changefreq>weekly</changefreq>
  <priority>0.6400</priority>
</url>

I want to find out the rss.xml file that contains a duplicate </url> on consecutive lines.

I don’t know why my regex is not working:

FIND: (?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:(?!</url>))
Replace by: (Leave empty)

guy038

Hi, @robin-cruise,

The correct regex is :

(?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:</url>)

Indeed, as your search regex ends with \K(?-i:(?!</url>)), this means that you’re looking for an empty string, not followed by </url> with this exact case. Not what you were expected too !

Now, why do you bother to use this complicated regex syntax ?

Simply use this shorter regex S/R :

SEARCH (</url>)\s+\K\1

REPLACE Leave EMPTY

Best Regards,

guy038

Alan Kilborn

@guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

Now, why do you bother to use this complicated regex syntax ?

I suspect the reason for that is the OP just pasted some regex because our policy to get (repeat) help here is to show-us-what-you-tried-first.

And currently, OP is our champion of write-my-regex-for-me.

So it has been paste-anything-because-that-ticks-the-box-and-now-I-will-get-the-help-I-need.

guy038

Hello, @alan-kilborn and All,

Alan, I’m not fooled and I know for a fact that some people are just showing us a regular expression, loosely deduced from a previous regex, already posted on the forum, serving as an alibi for a supposed search !

But, well: I had the opportunity, at the same time, in a short reply, to correct the OP’s version and to indicate the solution to the Robin’s question, correctly expressed !

BR

guy038

Robin Cruise

@guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

(</url>)\s+\K\1

hello @guy038 If I make the search and replace, I will have an empty line between tags. Such as:

</url>
  
  <url>

How can I modify those 2 regex you made, as not to have an empty line after replacement?

guy038

Hi, @robin-cruise,

Yes, I made two little mistakes but you could have found the problem by yourself !

First I forgot the (?-i) syntax, at the very beginning of the regex to ensures that all the process will be case sensitive !
Secondly, my previous regex (</url>)\s+\K\1 first searches for the <url> string, according to the match case option, then some regex space class characters ( so any non-null range of space, tab, and/or End of Line chars ), then the </url> string , according to the match case option, too. But I forgot to search for the line-break chars after </url>, which can be achieved with the \R syntax

Thus, the exact regex to use is (?-i)(</url>)\s+\K\1\R and, of course, click on the Replace All button, only ( not on the Replace one, due to the presence of the \K construction )

Cheers,

guy038

Robin Cruise

@guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

(?-i)(</url>)\s+\K\1\R

thanks. But, can you tell me, when can I use that \1 in the FIND option? Because I usually use it at the replace section, not at the FIND.

ok, I know that \1 refers to the first bracket (</url>). But if I make a regex search, something like (<loc>).*\1 it doesn’t work, seems that notepad++ cannot find the text.

I try, also, (<loc>).*|\1\R

so, when exactly can I use \1 or \2 in the FIND option?

guy038

Hello, @Robin-Cruise and All,

Ah, interesting point !

Let’s work against this simple line, pasted in a new tab :

----abc12345abc----def67890def----abc12345def----def67890abc----

The regex (?-i)(abc)\d+\1|(def)\d+\2 correctly matches the strings abc12345abc and def67890def :
- The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1
- The group 2 is the string def, re-used in the search regex as the back-reference \2, in its second alternative (def)\d+\2
- Each back-reference to a group, \1 and \2, is totally defined when using each alternative
The almost identical regex (?-i)(abc)\d+\1|(def)\d+\1 just matches the string abc12345abc ! Why ?
- The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1
- The group 2 is the string def, not re-used in this search regex
- When trying the second alternative (def)\d+\1, the regex engine does not know the back-reference \1 which refers to group 1 ( the string abc ), defined in the first alternative. Thus, this second alternative will never match !
However, note that :
- The regex (?i)(def)\d+\1, only, would normally match the string def67890def
- The regex (?i)(def)\d+\2, only, outputs the message Invalid regular expression, as no group 2 is defined in this regex
Now, let’s use the \K syntax in the regex (?-i)(abc)\d+\1|(def)\d+\2.+\Kabc\d+\2
- The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1
- The group 2 is the string def, re-used in the search regex as the back-reference \2, in its second alternative (def)\d+\2.+\Kabc\d+\2
- The first alternative, (abc)\d+\1, as above, matches the string abc12345abc
- The second alternative, (def)\d+\2.+\Kabc\d+\2 first matches the def67890def string and defines the group 2 ( def ) then, further on, after reset of search by the \K feature, matches the string abc12345def because the groups defined, before \K, are still defined after \K !
Finally, let’s consider the regex (?-i)(abc)\d+\1|(def)\d+\2|abc\d+\2|def\d+\1, composed of 4 alternatives
- Only the first two alternatives match, successively, the two strings abc12345abc and def67890def
- In the last two alternatives, the back-references \1 and \2 refer to a not defined group, when executing the third / fourth alternative !
Now, if, instead of the back-references \1 and \2, we use the subroutines calls (?1) and (?2), which represent the exact regexes in groups 1 and 2 ( so the strings abc and def ), we get the regex (?-i)(abc)\d+\1|(def)\d+\2|abc\d+(?2)|def\d+(?1) which do match the four strings : abc12345abc, def67890def, abc12345def and def67890abc
- As you can see, subroutine calls, (?#) are defined by the regex engine before executing any alternative and, thus, are defined in any part of the overall regex, where they occur. You may even create a regex containing a subroutine call to a group, which comes next ! For instance, the regex (?1):(?1):(\d\d) would match the string 03:05:45 or 11:52:17 And just compare with the regex \1:\1:(\d\d) which is totally invalid !

Back to your questions, @robin-cruise, you said :

But if I make a regex search, something like (<loc>).*\1 it doesn’t work, seems that notepad++ cannot find the text.

Sorry, but it does work !

Try the regex (?s)(<loc>).*\1 against this text :

<loc>
bla
bla
blah
<loc>

As about the (<loc>).*|\1\R regex, obviously, only the first alternative can match. Indeed, the second alternative \1\R contains a back-reference \1 to group 1, which is not defined in THIS alternative !

Finally, your (<loc>).*|\1\R regex is simply equivalent to one the two forms :

(?s)(<loc>).*
(?-s)(<loc>).*

Best Regards,

guy038

Robin Cruise

@guy038 great answer, thanks !