Regex: How to find a duplicate tag on consecutive lines?
-
hi, in the code below, you will see that two lines are repeated one after the other, which is a problem:
</url> </url>The code:
<url> <loc>https://my-website.com/en/example-love.html</loc> <lastmod>2018-11-30T17:19:37+00:00</lastmod> <changefreq>weekly</changefreq> <priority>0.6400</priority> </url> </url> <url> <loc>https://my-website.com/en/my-cat-is-here.html</loc> <lastmod>2018-11-30T17:19:37+00:00</lastmod> <changefreq>weekly</changefreq> <priority>0.6400</priority> </url>I want to find out the
rss.xmlfile that contains a duplicate</url>on consecutive lines.I don’t know why my regex is not working:
FIND:
(?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:(?!</url>))
Replace by:(Leave empty) -
Hi, @robin-cruise,
The correct regex is :
(?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:</url>)Indeed, as your search regex ends with
\K(?-i:(?!</url>)), this means that you’re looking for an empty string, not followed by</url>with this exact case. Not what you were expected too !
Now, why do you bother to use this complicated regex syntax ?
Simply use this shorter regex S/R :
SEARCH
(</url>)\s+\K\1REPLACE
Leave EMPTYBest Regards,
guy038
-
@guy038 said in Regex: How to find a duplicate tag on consecutive lines?:
Now, why do you bother to use this complicated regex syntax ?
I suspect the reason for that is the OP just pasted some regex because our policy to get (repeat) help here is to show-us-what-you-tried-first.
And currently, OP is our champion of write-my-regex-for-me.
So it has been paste-anything-because-that-ticks-the-box-and-now-I-will-get-the-help-I-need.
-
Hello, @alan-kilborn and All,
Alan, I’m not fooled and I know for a fact that some people are just showing us a regular expression, loosely deduced from a previous regex, already posted on the forum, serving as an alibi for a supposed search !
But, well: I had the opportunity, at the same time, in a short reply, to correct the OP’s version and to indicate the solution to the Robin’s question, correctly expressed !
BR
guy038
-
@guy038 said in Regex: How to find a duplicate tag on consecutive lines?:
(</url>)\s+\K\1
hello @guy038 If I make the search and replace, I will have an empty line between tags. Such as:
</url> <url>How can I modify those 2 regex you made, as not to have an empty line after replacement?
-
Hi, @robin-cruise,
Yes, I made two little mistakes but you could have found the problem by yourself !
-
First I forgot the
(?-i)syntax, at the very beginning of the regex to ensures that all the process will be case sensitive ! -
Secondly, my previous regex
(</url>)\s+\K\1first searches for the<url>string, according to thematch caseoption, then some regex space class characters ( so any non-null range of space, tab, and/or End of Line chars ), then the</url>string , according to thematch caseoption, too. But I forgot to search for the line-break chars after</url>, which can be achieved with the\Rsyntax
Thus, the exact regex to use is
(?-i)(</url>)\s+\K\1\Rand, of course, click on theReplace Allbutton, only ( not on the Replace one, due to the presence of the\Kconstruction )Cheers,
guy038
-
-
@guy038 said in Regex: How to find a duplicate tag on consecutive lines?:
(?-i)(</url>)\s+\K\1\R
thanks. But, can you tell me, when can I use that
\1in the FIND option? Because I usually use it at the replace section, not at the FIND.ok, I know that
\1refers to the first bracket(</url>). But if I make a regex search, something like(<loc>).*\1it doesn’t work, seems that notepad++ cannot find the text.I try, also,
(<loc>).*|\1\Rso, when exactly can I use
\1or\2in the FIND option? -
Hello, @Robin-Cruise and All,
Ah, interesting point !
Let’s work against this simple line, pasted in a new tab :
----abc12345abc----def67890def----abc12345def----def67890abc-----
The regex
(?-i)(abc)\d+\1|(def)\d+\2correctly matches the strings abc12345abc and def67890def :-
The group
1is the stringabc, re-used in the search regex as the back-reference\1, in its first alternative(abc)\d+\1 -
The group
2is the stringdef, re-used in the search regex as the back-reference\2, in its second alternative(def)\d+\2 -
Each back-reference to a group,
\1and\2, is totally defined when using each alternative
-
-
The almost identical regex
(?-i)(abc)\d+\1|(def)\d+\1just matches the string abc12345abc ! Why ?-
The group
1is the stringabc, re-used in the search regex as the back-reference\1, in its first alternative(abc)\d+\1 -
The group
2is the stringdef, not re-used in this search regex -
When trying the second alternative
(def)\d+\1, the regex engine does not know the back-reference\1which refers to group1( the stringabc), defined in the first alternative. Thus, this second alternative will never match !
-
-
However, note that :
-
The regex
(?i)(def)\d+\1, only, would normally match the string def67890def -
The regex
(?i)(def)\d+\2, only, outputs the messageInvalid regular expression, as no group2is defined in this regex
-
-
Now, let’s use the
\Ksyntax in the regex(?-i)(abc)\d+\1|(def)\d+\2.+\Kabc\d+\2-
The group
1is the stringabc, re-used in the search regex as the back-reference\1, in its first alternative(abc)\d+\1 -
The group
2is the stringdef, re-used in the search regex as the back-reference\2, in its second alternative(def)\d+\2.+\Kabc\d+\2 -
The first alternative,
(abc)\d+\1, as above, matches the string abc12345abc -
The second alternative,
(def)\d+\2.+\Kabc\d+\2first matches the def67890def string and defines the group2(def) then, further on, after reset of search by the\Kfeature, matches the string abc12345def because the groups defined, before\K, are still defined after\K!
-
-
Finally, let’s consider the regex
(?-i)(abc)\d+\1|(def)\d+\2|abc\d+\2|def\d+\1, composed of4alternatives-
Only the first two alternatives match, successively, the two strings abc12345abc and def67890def
-
In the last two alternatives, the back-references
\1and\2refer to a not defined group, when executing the third / fourth alternative !
-
-
Now, if, instead of the back-references
\1and\2, we use the subroutines calls(?1)and(?2), which represent the exact regexes in groups1and2( so the stringsabcanddef), we get the regex(?-i)(abc)\d+\1|(def)\d+\2|abc\d+(?2)|def\d+(?1)which do match the four strings : abc12345abc, def67890def, abc12345def and def67890abc- As you can see, subroutine calls,
(?#)are defined by the regex engine before executing any alternative and, thus, are defined in any part of the overall regex, where they occur. You may even create a regex containing a subroutine call to a group, which comes next ! For instance, the regex(?1):(?1):(\d\d)would match the string 03:05:45 or 11:52:17 And just compare with the regex\1:\1:(\d\d)which is totally invalid !
- As you can see, subroutine calls,
Back to your questions, @robin-cruise, you said :
But if I make a regex search, something like (<loc>).*\1 it doesn’t work, seems that notepad++ cannot find the text.
Sorry, but it does work !
Try the regex
(?s)(<loc>).*\1against this text :<loc> bla bla blah <loc>As about the
(<loc>).*|\1\Rregex, obviously, only the first alternative can match. Indeed, the second alternative\1\Rcontains a back-reference\1to group1, which is not defined in THIS alternative !Finally, your
(<loc>).*|\1\Rregex is simply equivalent to one the two forms :-
(?s)(<loc>).* -
(?-s)(<loc>).*
Best Regards,
guy038
-
-
@guy038 great answer, thanks !