Community
    • Login

    Regex: Find those files that doesn't contain the same link in 2 different html tags

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    30 Posts 5 Posters 1.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hello, @hellena-crainicu, @terry-r and All,

      I don’t think that working on copies is necessary ;-)) So, Hellena, simply use this regex :

      SEARCH (?s)<link\h+rel="canonical"\h*\Khref="([^"]+)".+<a href="(?!\1).+?"


      Notes :

      • The \h syntax is equivalent to the [\t\x20\xA0] syntax

      • The group 1 is the regex [^"]+ and represents the link ••••• in the expression <link rel="canonical" href="•••••" />

      • Due to the <a href="(?!\1).+?", this link must not be present in the href atttribute of the <a> tag

      • The \K feature cancels the match attempt so far ( <link\h+rel="canonical"\h* ) and resets the working position of the regex engine at the word href. So, the overall regex will catch the range of chars between the first href="•••••" expression and the last href="•••••", only !


      Finally, the main problem was to be sure that the range of chars, between the two double-quotes of the first link, does end at the closing double-quote and not later, because of the internal backtracking process of the regex engine !

      For instance, let’s suppose this text, with the same link test.com

      href="test.com" test="value" href="test.com"
      

      Oddly, the regex (?-si)href="(.+?)".+href="(?!\1).+?" matches this text. The common sense tells that it shouldn’t as we have the negative look-ahead (?!\1) structure !?

      So why ? Let’s try to follow the regex engine process !

      • First the regex engine matches the href=" string and catches the shortest range of chars till a double-quote so the value test.com is stored in group 1

      • Then, it matches the part .+href=". But, as the second link is the same as the first one, the negative look-ahead, which follows, prevent from matching the remainder range of chars

      • Now, that’s the important point : the regex engine backtracks and try, by all means, to get a positive match attempt !

        • The regex engine moves back to the location right after the first href=" string and catches an other shortest range of chars till a double-quote. Thus, this time, the value test.com" test="value is stored as group 1 ! Indeed that text is embedded between " !

        • Then, again, it matches the part .+href=". And, now, as the second link test.com is obviously different from the contents of the group 1** ( test.com" test="value ) the negative look-ahead returns TRUE and the overall regex wrongly matches the complete text href="test.com" test="value" href="test.com"

      • We now understand the way to get the right regex. We just need to avoid that each char between double-quotes may be, themselves, a " char !

      • So, the second regex version (?-si)href="([^"]+)".+href="(?!\1).+?", as expected, does not find the text

      href="test.com" test="value" href="test.com"
      

      And would get this one !

      href="test.com" test="value" href="tests.com"
      

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 0
      • Terry RT
        Terry R
        last edited by

        @guy038 said in Regex: Find those files that doesn't contain the same link in 2 different html tags:

        SEARCH (?s)<link\h+rel=“canonical”\h*\Khref=“([^”]+)“.+<a href=”(?!\1).+?"

        I used the below example set for my test and got the 3 mismatch hits I created. When I ran your regex on my example set I only got 1 hit. I think I see where your interpretation differed from mine. I did not know for sure there would ONLY be 2 https references in each file, the OP wasn’t specific enough. Now that I see your interpretation I can see that the OP may have suggested that. So certainly if that’s the case I have definitely overworked my regex.

        Cheers
        Terry

        <link rel="canonical" href="https://mywebsite.com/en/truth.html"/>
        
        text text
            
        text
        
        <img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/en/love.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
        <link rel="canonical" href="https://mywebsite.com/en/ttt.html"/>
        
        text text
            
        text
        
        <img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/en/ttt.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
        <link rel="canonical" href="https://mywebsite.com/en/truth.html"/>
        
        text text
            
        text
        
        <img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/en/sloven.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
        
        <link rel="canonical" href="https://mywebsite.com/en/lovel.html"/>
        
        text text
            
        text
        
        <img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/en/lovely.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
        
        <link rel="canonical" href="https://mywebsite.com/en/lov.html"/>
        
        text text
            
        text
        
        <img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/en/lov.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
        
        Terry RT 1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hi, @hellena-crainicu, @terry-r and All,

          Reading the Terry’s post made me think that I had not considered the possibility of successive couples <link rel="canonical" href=" – <a href=" in a same HTML file !

          For instance, against the text :

          <link rel="canonical" href="https://mywebsite.com/en/truth.html"
          
          text text
          
          <a href="https://mywebsite.com/en/truth.html"
          
          text
          
          
          text
          
          
          <link rel="canonical" href="https://mywebsite.com/en/love.html"
          
          text text
          
          <a href="https://mywebsite.com/en/love.html"
          

          My previous regex would wrongly match all text after <link rel="canonical". Indeed, as each couple of links are identical ( 2 × truth and 2 × love ), I suppose, @Hellena-crainicu, that you do not want a match, in that specific case, too !


          So, @hellena-crainicu, prefer this second version, more robust !

          SEARCH / MARK (?s)<link\h+rel="canonical"\h*\Khref="([^"]+)"((?!<link).)+?<a href="(?!\1).+?"

          As you can see, the changed part is ((?!<link).)+? which represents the shortest range of characters, not containing the string <link, at any position, globally, between the first and last href attribute !

          BR

          guy038

          1 Reply Last reply Reply Quote 1
          • Terry RT
            Terry R @Terry R
            last edited by

            @Terry-R said in Regex: Find those files that doesn't contain the same link in 2 different html tags:

            I did not know for sure there would ONLY be 2 https references in each file, the OP wasn’t specific enough.

            I note that I did show the OP a previous test I did which had 3 sets i tested against (image). The OP did not mention at that time that he only had 1 set in each file, guess we need the OP to verify if ONLY 1 set in each html file or MANY!

            So @Hellena-Crainicu does each html file contain only 1 set of https references (so 2 https references in each file) or many sets that the test must be carried out on.

            Terry

            1 Reply Last reply Reply Quote 2
            • Hellena CrainicuH
              Hellena Crainicu
              last edited by

              This post is deleted!
              1 Reply Last reply Reply Quote 0
              • Hellena CrainicuH
                Hellena Crainicu
                last edited by Hellena Crainicu

                now I see, there is a small problem.

                @Terry-R Your regex seems to be good on my example: (?s)^<link rel.+?https://([^"]+).+?https://(*SKIP)(?!\1) find only the files whose links are different.

                @guy038 Your regex, also, it is good on my example: (?s)<link\h+rel="canonical"\h*\Khref="([^"]+)"((?!<link).)+?<a href="(?!\1).+?"

                But, I put that TEXT TEXT between those 2 tags, that means that those links can be find more than twice. For this reason we have specified exactly the lines to be considered. Please see the entire code I have:

                <link rel="canonical" href="https://mywebsite.com/en/truth.html" />
                
                <meta name="copyright" content="me, https://mywebsite.com/"/>
                <link rel="sitemap" type="application/rss+xml" href="rss.xml" /> 
                <link rel="image_src" type="image/jpeg" href="https://mywebsite.com/icon-facebook.jpg" style="display:none"/>    
                <meta itemprop="image" content="https://mywebsite.com/icon-facebook.jpg"/>
                <meta property="og:image" content="https://mywebsite.com/icon-facebook.jpg"/>
                <meta property="og:type"  content="article" />
                <meta property="fb:app_id" content="2156440"/>
                <meta property="fb:admins" content="16454242"/>
                <meta property="og:url" content="https://mywebsite.com/en/other-car.html"/>
                
                <body>
                
                TEXT TEXT
                
                <div class="search">
                                <div align="left">
                
                                  <a href="https://mywebsite.com/hope.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/fr/book.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/en/truth.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/es/green.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/pt/yellow.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/ar/truth.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://mywebsite.com/zh/truth.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://mywebsite.com/hi/truth.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://mywebsite.com/de/truth.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/ru/truth.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
                
                TEXT TEXT
                
                
                <div id="pixxell"> <a href="https://mywebsite.com/en/book-miracle.html">I find a miracle </div>
                
                TEXT TEXT
                
                1 Reply Last reply Reply Quote 0
                • Hellena CrainicuH
                  Hellena Crainicu
                  last edited by

                  so, the second tag <img src=...> is extracted from the <div class="search"> section. Must be taken into account this part.

                  Terry RT 1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hi, @hellena-crainicu, @terry-r and All,

                    I’m really sorry but I still don’t understand what is your goal !

                    • First, my last regex (?s)<link\h+rel="canonical"\h*\Khref="([^"]+)"((?!<link).)+?<a href="(?!\1).+?", unlike you said, does not match anything against your last text, even if I remove the parts TEXT TEXT !? Anyway, I don’t care about it as the next regex version will surely be very different !

                    Now, the important points are :

                    • Firstly : Does the HTML text that you provided, and which is repeated, below, represents a real part of you HTML files ?
                    <link rel="canonical" href="https://mywebsite.com/en/truth.html" />
                    
                    <meta name="copyright" content="me, https://mywebsite.com/"/>
                    <link rel="sitemap" type="application/rss+xml" href="rss.xml" /> 
                    <link rel="image_src" type="image/jpeg" href="https://mywebsite.com/icon-facebook.jpg" style="display:none"/>    
                    <meta itemprop="image" content="https://mywebsite.com/icon-facebook.jpg"/>
                    <meta property="og:image" content="https://mywebsite.com/icon-facebook.jpg"/>
                    <meta property="og:type"  content="article" />
                    <meta property="fb:app_id" content="2156440"/>
                    <meta property="fb:admins" content="16454242"/>
                    <meta property="og:url" content="https://mywebsite.com/en/other-car.html"/>
                    
                    <body>
                    
                    TEXT TEXT
                    
                    <div class="search">
                                    <div align="left">
                    
                                      <a href="https://mywebsite.com/hope.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/fr/book.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/en/truth.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/es/green.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/pt/yellow.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/ar/truth.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://mywebsite.com/zh/truth.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://mywebsite.com/hi/truth.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://mywebsite.com/de/truth.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/ru/truth.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
                    
                    TEXT TEXT
                    
                    
                    <div id="pixxell"> <a href="https://mywebsite.com/en/book-miracle.html">I find a miracle </div>
                    
                    TEXT TEXT
                    
                    • Secondly : If so, I suppose that the first line <link rel="canonical" href="https://mywebsite.com/en/truth.html" /> with the link https://mywebsite.com/en/truth.html is the first of the two links to consider in the future ( correct ! ) regex

                    • Thirdly : I also suppose that any of the links, below, after <div class="search">, and which are followed with an <img src=•••••> tag are taken as a second link to be considered in the future regex

                    href="https://mywebsite.com/hope.html">
                    href="https://mywebsite.com/fr/book.html">
                    href="https://mywebsite.com/en/truth.html">
                    href="https://mywebsite.com/es/green.html">
                    href="https://mywebsite.com/pt/yellow.html">
                    href="https://mywebsite.com/ar/truth.html">
                    href="https://mywebsite.com/zh/truth.html">
                    href="https://mywebsite.com/hi/truth.html">
                    href="https://mywebsite.com/de/truth.html">
                    href="https://mywebsite.com/ru/truth.html">
                    

                    Fourthly : As the tag <link rel="canonical"••••" /> contains the link href="https://mywebsite.com/en/truth.html, I suppose that, considering the list of links, above, you would like that the regex matches :

                    • FROM the expression <link rel="canonical" href="https://mywebsite.com/en/truth.html" /> or, at least, its link href="https://mywebsite.com/en/truth.html"

                    • TO the first link different from href="https://mywebsite.com/en/truth.html", so, in this example, the first link of the list href="https://mywebsite.com/hope.html">


                    As you can see, it’ is generally much more difficult to fully understand what are the OP’s needs than finding out any kind of regex ;-))

                    BR

                    guy038

                    1 Reply Last reply Reply Quote 2
                    • Terry RT
                      Terry R @Hellena Crainicu
                      last edited by

                      @Hellena-Crainicu said in Regex: Find those files that doesn't contain the same link in 2 different html tags:

                      so, the second tag <img src=…> is extracted from the <div class=“search”> section

                      So hopefully you now understand providing us the most accurate information possible is key to getting a workable solution.

                      With my regex I think all you need to add is </a>&nbsp; <a href=" directly in front of the second https string. I’m not on a PC currently, instead typing on a smartphone. But if you can try adding these characters I think my regex will work. It’s all a matter of getting the regex to consume characters up until the https tag we wish to check against. This adjustment should get you there. That is unless there are other nbsp in between. But then your own regex was using the nbsp as well so I feel confident.

                      Terry

                      1 Reply Last reply Reply Quote 1
                      • Hellena CrainicuH
                        Hellena Crainicu
                        last edited by

                        @guy038 and @Terry-R

                        There are 2 particular lines, which is not repeated:

                        <link rel="canonical" href="https://mywebsite.com/en/truth.html" />

                        and

                        <img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/en/love.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>

                        So, to be much better understood. I translated the site into ten languages.

                        For french section I have href="https://mywebsite.com/fr/truth.html">
                        For russian section I have href="https://mywebsite.com/ru/truth.html">

                        so on. See those little De / FR / Ru / Hi / Ar …

                        I want to check if there is any link that I omitted (in the German section, “de”), and this link for german section is only found in the line with <img src =

                        Note that the links in English and German are the same, only the content of the html files is different.

                        the line with <canonical represents the page in English. For example: https://mywebsite.com/en/truth.html

                        the <img src=.. tag has also a link, but that link must be https://mywebsite.com/de/truth.html not the same as canonical https://mywebsite.com/en/truth.html

                        That is why I must find those files that doesn’t contain the same link on canonical an <img …de> tag. If are identically, means that I miss to translate the german section ( /de/ ). Because I copy the file from english, and translated only the text.

                        Terry RT 1 Reply Last reply Reply Quote 0
                        • Alan KilbornA
                          Alan Kilborn
                          last edited by

                          @guy038 said in Regex: Find those files that doesn't contain the same link in 2 different html tags:

                          I’m really sorry but I still don’t understand what is your goal !

                          I could see early on that this was where this thread was going to go; sometimes you can just “spot them”. :-)
                          Kudos to @guy038 and @Terry-R for carrying things on…

                          1 Reply Last reply Reply Quote 1
                          • Hellena CrainicuH
                            Hellena Crainicu
                            last edited by

                            I believe, much more simple was to use the ?! operators.

                            Find with Regex:

                            alt="de" /></a>&nbsp; <a href="https://mywebsite.com/(?!de)

                            because the important was not to have the same link for english part. Once it does not have EN it means that it is not the same. ;)

                            1 Reply Last reply Reply Quote 0
                            • guy038G
                              guy038
                              last edited by

                              Hi, @hellena-crainicu, @terry-r, @alan-kilborn and All,

                              @hellena-crainicu, in this post, you provided an HTMl text which contained this very long line :

                              <a href="https://mywebsite.com/hope.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/fr/book.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/en/truth.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/es/green.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/pt/yellow.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/ar/truth.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://mywebsite.com/zh/truth.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://mywebsite.com/hi/truth.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://mywebsite.com/de/truth.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/ru/truth.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
                              

                              In order to better see all the contents of this loooooong line, I split it into 10 lines, corresponding to your 10 languages !

                              <a href="https://mywebsite.com/hope.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; 
                              <a href="https://mywebsite.com/fr/book.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; 
                              <a href="https://mywebsite.com/en/truth.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; 
                              <a href="https://mywebsite.com/es/green.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; 
                              <a href="https://mywebsite.com/pt/yellow.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; 
                              <a href="https://mywebsite.com/ar/truth.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; 
                              <a href="https://mywebsite.com/zh/truth.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; 
                              <a href="https://mywebsite.com/hi/truth.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; 
                              <a href="https://mywebsite.com/de/truth.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; 
                              <a href="https://mywebsite.com/ru/truth.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
                              

                              Now, if we isolate the line, relative to German, we get :

                              <a href="https://mywebsite.com/de/truth.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; 
                              

                              Apparently, in that line, the link <a href="https://mywebsite.com/de/truth.html"> seems to occur fisrt and the part alt="de" seems to occur later !

                              Now the regex of your last post seems to search, first for the alt="de" /></a>&nbsp; string, followed with a space char and then, for the beginning of the a tag : <a href="https://mywebsite.com/(?!de)

                              So, exactly the opposite that your previous example !? Again, I totally confused ! Could you provide us an exact and real example ?

                              Best Regards,

                              guy038

                              Hellena CrainicuH 1 Reply Last reply Reply Quote 2
                              • Terry RT
                                Terry R @Hellena Crainicu
                                last edited by Terry R

                                @Hellena-Crainicu said in Regex: Find those files that doesn't contain the same link in 2 different html tags:

                                I want to check if there is any link that I omitted (in the German section, “de”), and this link for german section is only found in the line with <img src =

                                At this point (thanks @Alan-Kilborn) I’m about to throw in the towel (that means give up). I don’t think even with your latest post I’m entirely clear on what you need to do.

                                I understand you have the first line in the set which has the <link rel="canonical" href="https://mywebsite.com/en/truth.html" /> First question, which part of the link are you testing against. Is it mywebsite.com/en/truth.html or just truth.html?

                                I get that further into the set you have duplicate information, each with a language mentioned, fr, en, es, pt, ar, zh, hi, de, ru. Second question, is it that you only want to test the https reference with the de portion?

                                A third question. In each html file is there just 1 set of the "link rel=… to … I ask that as even the last example suggested the example was not complete. There are the starting tags <div class="search"> and <div align="left"> yet no close on these tags appear (</div>). We need to be clear on what content exists if the example isn’t complete. What other data have you excluded? You have already excluded data which you thought irrelevent yet later you realise it was important. Anything which appears in the area starting where the test starts and the end point of the test (so between the 2 https references) is relevant and must be included to give the best chance of supplying a workable solution.

                                Please answer all 3 questions precisely. If unable to, then the towel gets chucked and I’m out. And as @guy038 asks, your original regex where you look for the alt="de" string followed by https reference is clearly wrong as the alt='de' appears after the assigned https reference. It’s good that someone else spotted that as I was reading the example multiple times wondering how it should have worked with your regexes.

                                One parting suggestion. It’s almost at the point where making copies of all the files and editing each to remove unneeded portions would greatly simplify the resulting regex to do the actual test. You would leave enough unique information so that the relevant section (if more than 1 in each file) would show you where to look in the appropriate original file so you could perform the necessary edit to fix the link.

                                Terry

                                1 Reply Last reply Reply Quote 2
                                • Hellena CrainicuH
                                  Hellena Crainicu @guy038
                                  last edited by Hellena Crainicu

                                  hello @guy038 Was my mistake, should be: <a href="https://mywebsite.com/ru/truth.html

                                  <a href="https://mywebsite.com/hope.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/fr/book.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/en/truth.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/es/green.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/pt/yellow.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/ar/truth.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://mywebsite.com/zh/truth.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://mywebsite.com/hi/truth.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://mywebsite.com/de/truth.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/ru/truth.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
                                  

                                  so my last regex should be: alt="de" /></a>&nbsp; <a href="https://mywebsite.com/(?!ru)

                                  The problem of this forum is that I cannot edit again the post after couples of minutes, and I forgot to change. Of course, I didn’t think anyone would be interested anymore.

                                  Alan KilbornA 1 Reply Last reply Reply Quote 1
                                  • Alan KilbornA
                                    Alan Kilborn @Hellena Crainicu
                                    last edited by

                                    @Hellena-Crainicu said in Regex: Find those files that doesn't contain the same link in 2 different html tags:

                                    The problem of this forum is that I cannot edit again the post after couples of minutes, and I forgot to change.

                                    It’s good to have the posting history, exactly as it is.
                                    That way, later posts will make sense.
                                    If earlier posts could change, I think we’d totally have a mess in some of the threads here (and probably this thread is a great example of that).
                                    If you have new/corrected information, just add an additional post.

                                    BUT…be aware the those helping you are putting a lot of time/effort into it.
                                    So you really should think hard about what you are posting and try to get it right the first time, to avoid others wasting their time.
                                    Sure, errors happen, but there’s a difference between and honest mistake and someone that just hasn’t bothered to think things through enough.

                                    1 Reply Last reply Reply Quote 1
                                    • Vasile CarausV
                                      Vasile Caraus
                                      last edited by Vasile Caraus

                                      This post is deleted!
                                      1 Reply Last reply Reply Quote 0
                                      • Vasile CarausV
                                        Vasile Caraus
                                        last edited by Vasile Caraus

                                        This post is deleted!
                                        1 Reply Last reply Reply Quote 0
                                        • Vasile CarausV
                                          Vasile Caraus
                                          last edited by Vasile Caraus

                                          happy Easter, friends.

                                          Another solution can be next one:

                                          1. Select the link you want from canonical line: (<link rel="canonical" href=")(.*?)(" \/>)
                                          2. Select the second link from ru section: (alt="de" \/></a>&nbsp; <a href=")(.*?)(><img src="index_files\/flag_lang_ru)
                                          3. Combine these 2 regex in the same way (.*?) and put (\2) on second link, after (.*?) (this selects the second bracket, so the link in the canonical line)

                                          So the regex become: (<link rel="canonical" href=")(.*?)(" \/>)(.*?)(alt="de" \/></a>&nbsp; <a href=")(.*?)(\2)(><img src="index_files\/flag_lang_ru)

                                          eventualy, we can try (?!\2) instead of (\2) and make a FIND with .matches newsline

                                          So the regex become: (<link rel="canonical" href=")(.*?)(" \/>)(.*?)(alt="de" \/></a>&nbsp; <a href=")(.*?)(?!\2)(><img src="index_files\/flag_lang_ru)

                                          Don’t know why is not working. I believe my thinking was correct. :)

                                          <link rel="canonical" href="https://mywebsite.com/en/truth.html" />
                                          
                                          <meta name="copyright" content="me, https://mywebsite.com/"/>
                                          <link rel="sitemap" type="application/rss+xml" href="rss.xml" /> 
                                          <link rel="image_src" type="image/jpeg" href="https://mywebsite.com/icon-facebook.jpg" style="display:none"/>    
                                          <meta itemprop="image" content="https://mywebsite.com/icon-facebook.jpg"/>
                                          <meta property="og:image" content="https://mywebsite.com/icon-facebook.jpg"/>
                                          <meta property="og:type"  content="article" />
                                          <meta property="fb:app_id" content="2156440"/>
                                          <meta property="fb:admins" content="16454242"/>
                                          <meta property="og:url" content="https://mywebsite.com/en/other-car.html"/>
                                          
                                          <body>
                                          
                                          TEXT TEXT
                                          
                                          <div class="search">
                                                          <div align="left">
                                          
                                                            <a href="https://mywebsite.com/hope.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/fr/book.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/en/truth.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/es/green.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/pt/yellow.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://mywebsite.com/ar/truth.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://mywebsite.com/zh/truth.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://mywebsite.com/hi/truth.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://mywebsite.com/de/truth.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://mywebsite.com/ru/truth.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a>
                                          
                                          TEXT TEXT
                                          
                                          
                                          <div id="pixxell"> <a href="https://mywebsite.com/en/book-miracle.html">I find a miracle </div>
                                          
                                          TEXT TEXT
                                          
                                          1 Reply Last reply Reply Quote 1
                                          • First post
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors