Community
    • Login

    Regex: How can I find those html files with links that are not identical in different places?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 2 Posters 874 Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR Offline
      Robin Cruise
      last edited by Robin Cruise

      I have this link at the beginning of html page:

      <link rel="canonical" href="https://xxx.com/en/page-AAA.html" />

      also I have another link on the middle of the file:

      <a href="https://xxx.com/en/page-AAA.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>

      You see that there are the same links, but in different contexts and places. Compare it.

      But how can I find those html files with links that are not identical in those different places?

      Suppose the first link will be: <link rel="canonical" href="https://xxx.com/en/page-CCC.html" /> in this case are not identical those too, so regex should find that file that contains different links.

      How can I do this with Regex?

      1 Reply Last reply Reply Quote 0
      • guy038G Offline
        guy038
        last edited by

        Hi, @robin-Cruise and All,

        Let’s suppose you have, at least, two links of the form https://xxx.com/en/••••••••••.••••, where the part ••••••••••.•••• is different.

        Then, the regex (?s)(https://xxx.com/en/)([^"]+)".+?\1(?!\2").+?" will match the range between these two links, included !

        Thus, the regex does not match anything if all the https://xxx.com/en/••••••••••.•••• , of current file, have the same ••••••••••.•••• part.

        Best Regards,

        guy038

        Robin CruiseR 1 Reply Last reply Reply Quote 1
        • Robin CruiseR Offline
          Robin Cruise
          last edited by

          This post is deleted!
          1 Reply Last reply Reply Quote 0
          • Robin CruiseR Offline
            Robin Cruise @guy038
            last edited by

            @guy038 thanks a lot. You are the best !

            1 Reply Last reply Reply Quote 0
            • Robin CruiseR Offline
              Robin Cruise
              last edited by

              by the way, @guy038 Can you explain what does this part of your regex do?

              \1(?!\2")

              1 Reply Last reply Reply Quote 0
              • guy038G Offline
                guy038
                last edited by

                Hello, @robin-cruise and All,

                In the search regex (?s)(https://xxx.com/en/)([^"]+)".+?\1(?!\2").+?" :

                • The regex part https://xxx.com/en/ looks for the literal string https://xxx.com/en/, stored as group 1

                • The regex part ([^"]+)" represents the remainder of the internet address ( for instance the string page-AAA.html ), followed with a double-quote, because [^"]+ is a non-null range of consecutive chars, all different from ", stored as group 2

                • Now, the part .+? stands for the shortest range of any char till…

                  • The group 1 ( \1 ). So an other string https://xxx.com/en/

                  • Which must be followed by .+?", which represents the shortest non-null range of any char before a double-quote…

                  • But ONLY IF this range is different from \2 ( i.e. different, for instance, from the string page-AAA.html and a " char )

                Note also that the [^"]+" syntax, without the parentheses, is more restrictive than .+?" and must be preferred because of the negative look-ahead (?!\2")

                Besst Regards,

                guy038

                1 Reply Last reply Reply Quote 0

                Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                With your input, this post could be even better 💗

                Register Login
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors