Community
    • Login

    Regex: How can I find those html files with links that are not identical in different places?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 2 Posters 372 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR
      Robin Cruise
      last edited by Robin Cruise

      I have this link at the beginning of html page:

      <link rel="canonical" href="https://xxx.com/en/page-AAA.html" />

      also I have another link on the middle of the file:

      <a href="https://xxx.com/en/page-AAA.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>

      You see that there are the same links, but in different contexts and places. Compare it.

      But how can I find those html files with links that are not identical in those different places?

      Suppose the first link will be: <link rel="canonical" href="https://xxx.com/en/page-CCC.html" /> in this case are not identical those too, so regex should find that file that contains different links.

      How can I do this with Regex?

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by

        Hi, @robin-Cruise and All,

        Let’s suppose you have, at least, two links of the form https://xxx.com/en/••••••••••.••••, where the part ••••••••••.•••• is different.

        Then, the regex (?s)(https://xxx.com/en/)([^"]+)".+?\1(?!\2").+?" will match the range between these two links, included !

        Thus, the regex does not match anything if all the https://xxx.com/en/••••••••••.•••• , of current file, have the same ••••••••••.•••• part.

        Best Regards,

        guy038

        Robin CruiseR 1 Reply Last reply Reply Quote 1
        • Robin CruiseR
          Robin Cruise
          last edited by

          This post is deleted!
          1 Reply Last reply Reply Quote 0
          • Robin CruiseR
            Robin Cruise @guy038
            last edited by

            @guy038 thanks a lot. You are the best !

            1 Reply Last reply Reply Quote 0
            • Robin CruiseR
              Robin Cruise
              last edited by

              by the way, @guy038 Can you explain what does this part of your regex do?

              \1(?!\2")

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by

                Hello, @robin-cruise and All,

                In the search regex (?s)(https://xxx.com/en/)([^"]+)".+?\1(?!\2").+?" :

                • The regex part https://xxx.com/en/ looks for the literal string https://xxx.com/en/, stored as group 1

                • The regex part ([^"]+)" represents the remainder of the internet address ( for instance the string page-AAA.html ), followed with a double-quote, because [^"]+ is a non-null range of consecutive chars, all different from ", stored as group 2

                • Now, the part .+? stands for the shortest range of any char till…

                  • The group 1 ( \1 ). So an other string https://xxx.com/en/

                  • Which must be followed by .+?", which represents the shortest non-null range of any char before a double-quote…

                  • But ONLY IF this range is different from \2 ( i.e. different, for instance, from the string page-AAA.html and a " char )

                Note also that the [^"]+" syntax, without the parentheses, is more restrictive than .+?" and must be preferred because of the negative look-ahead (?!\2")

                Besst Regards,

                guy038

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors