Regex: How to find a duplicate tag on consecutive lines?
-
hi, in the code below, you will see that two lines are repeated one after the other, which is a problem:
</url> </url>
The code:
<url> <loc>https://my-website.com/en/example-love.html</loc> <lastmod>2018-11-30T17:19:37+00:00</lastmod> <changefreq>weekly</changefreq> <priority>0.6400</priority> </url> </url> <url> <loc>https://my-website.com/en/my-cat-is-here.html</loc> <lastmod>2018-11-30T17:19:37+00:00</lastmod> <changefreq>weekly</changefreq> <priority>0.6400</priority> </url>
I want to find out the
rss.xml
file that contains a duplicate</url>
on consecutive lines.I don’t know why my regex is not working:
FIND:
(?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:(?!</url>))
Replace by:(Leave empty)
-
Hi, @robin-cruise,
The correct regex is :
(?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:</url>)
Indeed, as your search regex ends with
\K(?-i:(?!</url>))
, this means that you’re looking for an empty string, not followed by</url>
with this exact case. Not what you were expected too !
Now, why do you bother to use this complicated regex syntax ?
Simply use this shorter regex S/R :
SEARCH
(</url>)\s+\K\1
REPLACE
Leave EMPTY
Best Regards,
guy038
-
@guy038 said in Regex: How to find a duplicate tag on consecutive lines?:
Now, why do you bother to use this complicated regex syntax ?
I suspect the reason for that is the OP just pasted some regex because our policy to get (repeat) help here is to show-us-what-you-tried-first.
And currently, OP is our champion of write-my-regex-for-me.
So it has been paste-anything-because-that-ticks-the-box-and-now-I-will-get-the-help-I-need.
-
Hello, @alan-kilborn and All,
Alan, I’m not fooled and I know for a fact that some people are just showing us a regular expression, loosely deduced from a previous regex, already posted on the forum, serving as an alibi for a supposed search !
But, well: I had the opportunity, at the same time, in a short reply, to correct the OP’s version and to indicate the solution to the Robin’s question, correctly expressed !
BR
guy038
-
@guy038 said in Regex: How to find a duplicate tag on consecutive lines?:
(</url>)\s+\K\1
hello @guy038 If I make the search and replace, I will have an empty line between tags. Such as:
</url> <url>
How can I modify those 2 regex you made, as not to have an empty line after replacement?
-
Hi, @robin-cruise,
Yes, I made two little mistakes but you could have found the problem by yourself !
-
First I forgot the
(?-i)
syntax, at the very beginning of the regex to ensures that all the process will be case sensitive ! -
Secondly, my previous regex
(</url>)\s+\K\1
first searches for the<url>
string, according to thematch case
option, then some regex space class characters ( so any non-null range of space, tab, and/or End of Line chars ), then the</url>
string , according to thematch case
option, too. But I forgot to search for the line-break chars after</url>
, which can be achieved with the\R
syntax
Thus, the exact regex to use is
(?-i)(</url>)\s+\K\1\R
and, of course, click on theReplace All
button, only ( not on the Replace one, due to the presence of the\K
construction )Cheers,
guy038
-
-
@guy038 said in Regex: How to find a duplicate tag on consecutive lines?:
(?-i)(</url>)\s+\K\1\R
thanks. But, can you tell me, when can I use that
\1
in the FIND option? Because I usually use it at the replace section, not at the FIND.ok, I know that
\1
refers to the first bracket(</url>)
. But if I make a regex search, something like(<loc>).*\1
it doesn’t work, seems that notepad++ cannot find the text.I try, also,
(<loc>).*|\1\R
so, when exactly can I use
\1
or\2
in the FIND option? -
Hello, @Robin-Cruise and All,
Ah, interesting point !
Let’s work against this simple line, pasted in a new tab :
----abc12345abc----def67890def----abc12345def----def67890abc----
-
The regex
(?-i)(abc)\d+\1|(def)\d+\2
correctly matches the strings abc12345abc and def67890def :-
The group
1
is the stringabc
, re-used in the search regex as the back-reference\1
, in its first alternative(abc)\d+\1
-
The group
2
is the stringdef
, re-used in the search regex as the back-reference\2
, in its second alternative(def)\d+\2
-
Each back-reference to a group,
\1
and\2
, is totally defined when using each alternative
-
-
The almost identical regex
(?-i)(abc)\d+\1|(def)\d+\1
just matches the string abc12345abc ! Why ?-
The group
1
is the stringabc
, re-used in the search regex as the back-reference\1
, in its first alternative(abc)\d+\1
-
The group
2
is the stringdef
, not re-used in this search regex -
When trying the second alternative
(def)\d+\1
, the regex engine does not know the back-reference\1
which refers to group1
( the stringabc
), defined in the first alternative. Thus, this second alternative will never match !
-
-
However, note that :
-
The regex
(?i)(def)\d+\1
, only, would normally match the string def67890def -
The regex
(?i)(def)\d+\2
, only, outputs the messageInvalid regular expression
, as no group2
is defined in this regex
-
-
Now, let’s use the
\K
syntax in the regex(?-i)(abc)\d+\1|(def)\d+\2.+\Kabc\d+\2
-
The group
1
is the stringabc
, re-used in the search regex as the back-reference\1
, in its first alternative(abc)\d+\1
-
The group
2
is the stringdef
, re-used in the search regex as the back-reference\2
, in its second alternative(def)\d+\2.+\Kabc\d+\2
-
The first alternative,
(abc)\d+\1
, as above, matches the string abc12345abc -
The second alternative,
(def)\d+\2.+\Kabc\d+\2
first matches the def67890def string and defines the group2
(def
) then, further on, after reset of search by the\K
feature, matches the string abc12345def because the groups defined, before\K
, are still defined after\K
!
-
-
Finally, let’s consider the regex
(?-i)(abc)\d+\1|(def)\d+\2|abc\d+\2|def\d+\1
, composed of4
alternatives-
Only the first two alternatives match, successively, the two strings abc12345abc and def67890def
-
In the last two alternatives, the back-references
\1
and\2
refer to a not defined group, when executing the third / fourth alternative !
-
-
Now, if, instead of the back-references
\1
and\2
, we use the subroutines calls(?1)
and(?2)
, which represent the exact regexes in groups1
and2
( so the stringsabc
anddef
), we get the regex(?-i)(abc)\d+\1|(def)\d+\2|abc\d+(?2)|def\d+(?1)
which do match the four strings : abc12345abc, def67890def, abc12345def and def67890abc- As you can see, subroutine calls,
(?#)
are defined by the regex engine before executing any alternative and, thus, are defined in any part of the overall regex, where they occur. You may even create a regex containing a subroutine call to a group, which comes next ! For instance, the regex(?1):(?1):(\d\d)
would match the string 03:05:45 or 11:52:17 And just compare with the regex\1:\1:(\d\d)
which is totally invalid !
- As you can see, subroutine calls,
Back to your questions, @robin-cruise, you said :
But if I make a regex search, something like (<loc>).*\1 it doesn’t work, seems that notepad++ cannot find the text.
Sorry, but it does work !
Try the regex
(?s)(<loc>).*\1
against this text :<loc> bla bla blah <loc>
As about the
(<loc>).*|\1\R
regex, obviously, only the first alternative can match. Indeed, the second alternative\1\R
contains a back-reference\1
to group1
, which is not defined in THIS alternative !Finally, your
(<loc>).*|\1\R
regex is simply equivalent to one the two forms :-
(?s)(<loc>).*
-
(?-s)(<loc>).*
Best Regards,
guy038
-
-
@guy038 great answer, thanks !