• Login
Community
  • Login

Regex: How can I find those html files with links that are not identical in different places?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
6 Posts 2 Posters 390 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R
    Robin Cruise
    last edited by Robin Cruise Aug 5, 2021, 3:28 PM Aug 5, 2021, 3:27 PM

    I have this link at the beginning of html page:

    <link rel="canonical" href="https://xxx.com/en/page-AAA.html" />

    also I have another link on the middle of the file:

    <a href="https://xxx.com/en/page-AAA.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>

    You see that there are the same links, but in different contexts and places. Compare it.

    But how can I find those html files with links that are not identical in those different places?

    Suppose the first link will be: <link rel="canonical" href="https://xxx.com/en/page-CCC.html" /> in this case are not identical those too, so regex should find that file that contains different links.

    How can I do this with Regex?

    1 Reply Last reply Reply Quote 0
    • G
      guy038
      last edited by Aug 6, 2021, 2:38 PM

      Hi, @robin-Cruise and All,

      Let’s suppose you have, at least, two links of the form https://xxx.com/en/••••••••••.••••, where the part ••••••••••.•••• is different.

      Then, the regex (?s)(https://xxx.com/en/)([^"]+)".+?\1(?!\2").+?" will match the range between these two links, included !

      Thus, the regex does not match anything if all the https://xxx.com/en/••••••••••.•••• , of current file, have the same ••••••••••.•••• part.

      Best Regards,

      guy038

      R 1 Reply Last reply Aug 7, 2021, 4:18 PM Reply Quote 1
      • R
        Robin Cruise
        last edited by Aug 7, 2021, 4:06 PM

        This post is deleted!
        1 Reply Last reply Reply Quote 0
        • R
          Robin Cruise @guy038
          last edited by Aug 7, 2021, 4:18 PM

          @guy038 thanks a lot. You are the best !

          1 Reply Last reply Reply Quote 0
          • R
            Robin Cruise
            last edited by Aug 7, 2021, 8:15 PM

            by the way, @guy038 Can you explain what does this part of your regex do?

            \1(?!\2")

            1 Reply Last reply Reply Quote 0
            • G
              guy038
              last edited by Aug 10, 2021, 9:14 AM

              Hello, @robin-cruise and All,

              In the search regex (?s)(https://xxx.com/en/)([^"]+)".+?\1(?!\2").+?" :

              • The regex part https://xxx.com/en/ looks for the literal string https://xxx.com/en/ , stored as group 1

              • The regex part ([^"]+)" represents the remainder of the internet address ( for instance the string page-AAA.html ), followed with a double-quote, because [^"]+ is a non-null range of consecutive chars, all different from ", stored as group 2

              • Now, the part .+? stands for the shortest range of any char till…

                • The group 1 ( \1 ). So an other string https://xxx.com/en/

                • Which must be followed by .+?", which represents the shortest non-null range of any char before a double-quote…

                • But ONLY IF this range is different from \2 ( i.e. different, for instance, from the string page-AAA.html and a " char )

              Note also that the [^"]+" syntax, without the parentheses, is more restrictive than .+?" and must be preferred because of the negative look-ahead (?!\2")

              Besst Regards,

              guy038

              1 Reply Last reply Reply Quote 0
              1 out of 6
              • First post
                1/6
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors