Regex: How can I find those html files with links that are not identical in different places?

  • I have this link at the beginning of html page:

    <link rel="canonical" href="" />

    also I have another link on the middle of the file:

    <a href=""><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>

    You see that there are the same links, but in different contexts and places. Compare it.

    But how can I find those html files with links that are not identical in those different places?

    Suppose the first link will be: <link rel="canonical" href="" /> in this case are not identical those too, so regex should find that file that contains different links.

    How can I do this with Regex?

  • Hi, @robin-Cruise and All,

    Let’s suppose you have, at least, two links of the form••••••••••.••••, where the part ••••••••••.•••• is different.

    Then, the regex (?s)([^"]+)".+?\1(?!\2").+?" will match the range between these two links, included !

    Thus, the regex does not match anything if all the••••••••••.•••• , of current file, have the same ••••••••••.•••• part.

    Best Regards,


  • This post is deleted!

  • @guy038 thanks a lot. You are the best !

  • by the way, @guy038 Can you explain what does this part of your regex do?


  • Hello, @robin-cruise and All,

    In the search regex (?s)([^"]+)".+?\1(?!\2").+?" :

    • The regex part looks for the literal string, stored as group 1

    • The regex part ([^"]+)" represents the remainder of the internet address ( for instance the string page-AAA.html ), followed with a double-quote, because [^"]+ is a non-null range of consecutive chars, all different from ", stored as group 2

    • Now, the part .+? stands for the shortest range of any char till…

      • The group 1 ( \1 ). So an other string

      • Which must be followed by .+?", which represents the shortest non-null range of any char before a double-quote

      • But ONLY IF this range is different from \2 ( i.e. different, for instance, from the string page-AAA.html and a " char )

    Note also that the [^"]+" syntax, without the parentheses, is more restrictive than .+?" and must be preferred because of the negative look-ahead (?!\2")

    Besst Regards,


Log in to reply