Regex: How can I find those html files with links that are not identical in different places?
-
I have this link at the beginning of html page:
<link rel="canonical" href="https://xxx.com/en/page-AAA.html" />also I have another link on the middle of the file:
<a href="https://xxx.com/en/page-AAA.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>You see that there are the same links, but in different contexts and places. Compare it.
But how can I find those html files with links that are not identical in those different places?
Suppose the first link will be:
<link rel="canonical" href="https://xxx.com/en/page-CCC.html" />in this case are not identical those too, so regex should find that file that contains different links.How can I do this with Regex?
-
Hi, @robin-Cruise and All,
Let’s suppose you have, at least, two links of the form
https://xxx.com/en/••••••••••.••••, where the part••••••••••.••••is different.Then, the regex
(?s)(https://xxx.com/en/)([^"]+)".+?\1(?!\2").+?"will match the range between these two links, included !Thus, the regex does not match anything if all the
https://xxx.com/en/••••••••••.••••, of current file, have the same••••••••••.••••part.Best Regards,
guy038
-
This post is deleted! -
@guy038 thanks a lot. You are the best !
-
by the way, @guy038 Can you explain what does this part of your regex do?
\1(?!\2") -
Hello, @robin-cruise and All,
In the search regex
(?s)(https://xxx.com/en/)([^"]+)".+?\1(?!\2").+?":-
The regex part
https://xxx.com/en/looks for the literal string https://xxx.com/en/, stored as group1 -
The regex part
([^"]+)"represents the remainder of the internet address ( for instance the string page-AAA.html ), followed with a double-quote, because[^"]+is a non-null range of consecutive chars, all different from", stored as group2 -
Now, the part
.+?stands for the shortest range of any char till…-
The group
1(\1). So an other string https://xxx.com/en/ -
Which must be followed by
.+?", which represents the shortest non-null range of any char before a double-quote… -
But ONLY IF this range is different from
\2( i.e. different, for instance, from the string page-AAA.html and a"char )
-
Note also that the
[^"]+"syntax, without the parentheses, is more restrictive than.+?"and must be preferred because of the negative look-ahead(?!\2")Besst Regards,
guy038
-
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login