Regex: How can I find those html files with links that are not identical in different places?
-
I have this link at the beginning of html page:
<link rel="canonical" href="https://xxx.com/en/page-AAA.html" />
also I have another link on the middle of the file:
<a href="https://xxx.com/en/page-AAA.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>
You see that there are the same links, but in different contexts and places. Compare it.
But how can I find those html files with links that are not identical in those different places?
Suppose the first link will be:
<link rel="canonical" href="https://xxx.com/en/page-CCC.html" />
in this case are not identical those too, so regex should find that file that contains different links.How can I do this with Regex?
-
Hi, @robin-Cruise and All,
Let’s suppose you have, at least, two links of the form
https://xxx.com/en/••••••••••.••••
, where the part••••••••••.••••
is different.Then, the regex
(?s)(https://xxx.com/en/)([^"]+)".+?\1(?!\2").+?"
will match the range between these two links, included !Thus, the regex does not match anything if all the
https://xxx.com/en/••••••••••.••••
, of current file, have the same••••••••••.••••
part.Best Regards,
guy038
-
This post is deleted! -
@guy038 thanks a lot. You are the best !
-
by the way, @guy038 Can you explain what does this part of your regex do?
\1(?!\2")
-
Hello, @robin-cruise and All,
In the search regex
(?s)(https://xxx.com/en/)([^"]+)".+?\1(?!\2").+?"
:-
The regex part
https://xxx.com/en/
looks for the literal string https://xxx.com/en/, stored as group1
-
The regex part
([^"]+)"
represents the remainder of the internet address ( for instance the string page-AAA.html ), followed with a double-quote, because[^"]+
is a non-null range of consecutive chars, all different from"
, stored as group2
-
Now, the part
.+?
stands for the shortest range of any char till…-
The group
1
(\1
). So an other string https://xxx.com/en/ -
Which must be followed by
.+?"
, which represents the shortest non-null range of any char before a double-quote… -
But ONLY IF this range is different from
\2
( i.e. different, for instance, from the string page-AAA.html and a"
char )
-
Note also that the
[^"]+"
syntax, without the parentheses, is more restrictive than.+?"
and must be preferred because of the negative look-ahead(?!\2")
Besst Regards,
guy038
-