Community
    • Login

    why does this parsing not work? Replace the content of html tags between comments section

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    11 Posts 2 Posters 405 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR
      Robin Cruise
      last edited by

      hello, I read this topic and the answer of @guy038 over HERE

      I have a similar request. I this html code.

             <!-- Form Start -->
          <div class="55df" >
              <div class="gg44">
                  <h4 class="header-text">United Romaketh</h4>
                  <img src="https://sensy.com/33.png" alt="Alternate Text" />
              </div>
              <!-- Identify  -->
              <div class="45de">
                  <div class="col">
                      <div class="body-text">
                          <h4>Marcus 33</h4>
                          <p>Can you tell me?</p>
                      </div>
                      <div class="yy">
                          <!-- Yesd. -->
      
                          <form action="https://www.gre.com/" method="post" class="similar">
      
                              <!-- Stones -->
                              <input type="hidden" name="business" value="erer@gmail.com">
      
                              <!-- button simple -->
                              <input type="hidden" name="cmd" value="_buttons">
      
                              <!-- contribution -->
                              <input type="hidden" name="item_name" value="Maxim">
                              <input type="hidden" name="item_number" value="Maxim">
                              <select name="amount"><option value="3.00">&euro;3.00</option><option value="5.00">&euro;5.00</option><option value="10.00">&euro;10.00</option><option value="25.00">&euro;25.00</option><option value="50.00">&euro;50.00</option></select>
                              <input type="hidden" name="currency" value="DOL">
      
                              <!-- Display the button. -->
                              <input class="paypal-img" type="image" src="https://www.pitt.com/hh.gif" border="0" name="submit" title="yy" alt="button" />
                          </form>
      
      
      
      
                      </div>
                  </div>
                  <div class="col">
                      <div class="body-text">
                          <h4>I am here</h4>
                          <p>My text here</p>
                      </div>
                      <div class="gono">
                          <form action="https://www.concate.com/donate" method="post" target="_top">
                              <input type="hidden" name="444" value="7Z7JBUL" />
                              <input class="r-img" type="image" src="https://www.rer.com/" border="0" name="submit" title="d" alt="sdsd" />
                              <img alt="" border="0" src="https://www.dd.com/pixel.gif" width="1" height="1" />
                          </form>
                      </div>
                  </div>
      
              </div>
      
          </div>
          <div class="Home Store">
              <h4>I am here</h4>
              <p>my dauther loves me</p>
              <div class="text-sdsg">
                  Love Me tender
              </div>
          </div>
          
                      <!-- Form Final -->
      

      The problem: I must replace all <p></p> with <p class="STAR-ONE"></p> from the section <!-- Form Start --> to <!-- Form Final-->

      My regex seems ok, but is not replacing too good. Instead of <p class="STAR-ONE"></p> it gives me <p><p class="STAR-ONE"></p>

      I really don’t know where is the mistake on my regex…?!

      Search: (?:.*?<!-- Form Start -->|\G).*?\K(<p>).*?(?=</p>.*?<!-- Form Final -->)
      Replace by: \1<p class="STAR-ONE">\3

      1 Reply Last reply Reply Quote 0
      • Robin CruiseR
        Robin Cruise
        last edited by Robin Cruise

        @Robin-Cruise said in why does this parsing not work? Replace the content of html tags between comments section:

        (?:.?<!-- Form Start -->|\G).?\K(<p>).?(?=</p>.?<!-- Form Final -->)

        I find the answer, thanks:

        FIND: (?:.*?<!-- Form Start -->|\G).*?\K<p>(.*?)(?=</p>.*?<!-- Form Final -->)
        Replace by: \2<p class="STAR-ONE">\1\3

        CHECK Wrap around
        CHECK Regular expression
        CHECK . matches newline

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hello, @robin-cruise and All,

          Firstly, I just cannot understand your last regex S/R :

          SEARCH (?:.*?<!-- Form Start -->|\G).*?\K<p>(.*?)(?=</p>.*?<!-- Form Final -->)
          1
          REPLACE \2<p class="STAR-ONE">\1\3

          Indeed, in replacement, you have three groups 1, 2 and 3 but your search regex contains only one group (.*?) !? I verified that, after each replace operation, the groups 2 and 3 are always empty !


          Secondly, you should have expressed your initial goal as :

          " I have some <p>Some Text</p> zones and I would like to change them into <p class="STAR-ONE">SAME Text</p> "

          It’s a little bit clearer !


          Thirdly :

          • You could had added the (?s) modifier, at beginning of your regex, in order to not care about the . matches newline option !

          • You could had used used the more accurate ^\h* syntax, instead of .*?, at beginning of the non-capturing group

          • You could had added the negative look-around (?!\A), right before the \G assertion. Indeed, as the Wrap around option is set, the regex engine starts the replacement process from the very beginning of file, whatever the current caret location, and the (?!\A) syntax ensures that the regex engine will not used the second alternative \G but will look, instead, for a <!-- Form Start --> line, first !


          Fourthly, your regex cannot work with, for instance, the text below, where I duplicated your initial example and, in between, I inserted the same section, without the boundaries <!-- Form Start --> and <!-- Form Final -->

                 <!-- Form Start -->
              <div class="55df" >
                  <div class="gg44">
                      <h4 class="header-text">United Romaketh</h4>
                      <img src="https://sensy.com/33.png" alt="Alternate Text" />
                  </div>
                  <!-- Identify  -->
                  <div class="45de">
                      <div class="col">
                          <div class="body-text">
                              <h4>Marcus 33</h4>
                              <p>Can you tell me?</p>
                          </div>
                          <div class="yy">
                              <!-- Yesd. -->
          
                              <form action="https://www.gre.com/" method="post" class="similar">
          
                                  <!-- Stones -->
                                  <input type="hidden" name="business" value="erer@gmail.com">
          
                                  <!-- button simple -->
                                  <input type="hidden" name="cmd" value="_buttons">
          
                                  <!-- contribution -->
                                  <input type="hidden" name="item_name" value="Maxim">
                                  <input type="hidden" name="item_number" value="Maxim">
                                  <select name="amount"><option value="3.00">&euro;3.00</option><option value="5.00">&euro;5.00</option><option value="10.00">&euro;10.00</option><option value="25.00">&euro;25.00</option><option value="50.00">&euro;50.00</option></select>
                                  <input type="hidden" name="currency" value="DOL">
          
                                  <!-- Display the button. -->
                                  <input class="paypal-img" type="image" src="https://www.pitt.com/hh.gif" border="0" name="submit" title="yy" alt="button" />
                              </form>
          
          
          
          
                          </div>
                      </div>
                      <div class="col">
                          <div class="body-text">
                              <h4>I am here</h4>
                              <p>My text here</p>
                          </div>
                          <div class="gono">
                              <form action="https://www.concate.com/donate" method="post" target="_top">
                                  <input type="hidden" name="444" value="7Z7JBUL" />
                                  <input class="r-img" type="image" src="https://www.rer.com/" border="0" name="submit" title="d" alt="sdsd" />
                                  <img alt="" border="0" src="https://www.dd.com/pixel.gif" width="1" height="1" />
                              </form>
                          </div>
                      </div>
          
                  </div>
          
              </div>
              <div class="Home Store">
                  <h4>I am here</h4>
                  <p>my dauther loves me</p>
                  <div class="text-sdsg">
                      Love Me tender
                  </div>
              </div>
              
                          <!-- Form Final -->
          <!-- ------------------------------------------------------------------------------------------------------------------------------- -->
              <div class="55df" >
                  <div class="gg44">
                      <h4 class="header-text">United Romaketh</h4>
                      <img src="https://sensy.com/33.png" alt="Alternate Text" />
                  </div>
                  <!-- Identify  -->
                  <div class="45de">
                      <div class="col">
                          <div class="body-text">
                              <h4>Marcus 33</h4>
                              <p>Can you tell me?</p>
                          </div>
                          <div class="yy">
                              <!-- Yesd. -->
          
                              <form action="https://www.gre.com/" method="post" class="similar">
          
                                  <!-- Stones -->
                                  <input type="hidden" name="business" value="erer@gmail.com">
          
                                  <!-- button simple -->
                                  <input type="hidden" name="cmd" value="_buttons">
          
                                  <!-- contribution -->
                                  <input type="hidden" name="item_name" value="Maxim">
                                  <input type="hidden" name="item_number" value="Maxim">
                                  <select name="amount"><option value="3.00">&euro;3.00</option><option value="5.00">&euro;5.00</option><option value="10.00">&euro;10.00</option><option value="25.00">&euro;25.00</option><option value="50.00">&euro;50.00</option></select>
                                  <input type="hidden" name="currency" value="DOL">
          
                                  <!-- Display the button. -->
                                  <input class="paypal-img" type="image" src="https://www.pitt.com/hh.gif" border="0" name="submit" title="yy" alt="button" />
                              </form>
          
          
          
          
                          </div>
                      </div>
                      <div class="col">
                          <div class="body-text">
                              <h4>I am here</h4>
                              <p>My text here</p>
                          </div>
                          <div class="gono">
                              <form action="https://www.concate.com/donate" method="post" target="_top">
                                  <input type="hidden" name="444" value="7Z7JBUL" />
                                  <input class="r-img" type="image" src="https://www.rer.com/" border="0" name="submit" title="d" alt="sdsd" />
                                  <img alt="" border="0" src="https://www.dd.com/pixel.gif" width="1" height="1" />
                              </form>
                          </div>
                      </div>
          
                  </div>
          
              </div>
              <div class="Home Store">
                  <h4>I am here</h4>
                  <p>my dauther loves me</p>
                  <div class="text-sdsg">
                      Love Me tender
                  </div>
              </div>
          <!-- ------------------------------------------------------------------------------------------------------------------------------- -->
                 <!-- Form Start -->
              <div class="55df" >
                  <div class="gg44">
                      <h4 class="header-text">United Romaketh</h4>
                      <img src="https://sensy.com/33.png" alt="Alternate Text" />
                  </div>
                  <!-- Identify  -->
                  <div class="45de">
                      <div class="col">
                          <div class="body-text">
                              <h4>Marcus 33</h4>
                              <p>Can you tell me?</p>
                          </div>
                          <div class="yy">
                              <!-- Yesd. -->
          
                              <form action="https://www.gre.com/" method="post" class="similar">
          
                                  <!-- Stones -->
                                  <input type="hidden" name="business" value="erer@gmail.com">
          
                                  <!-- button simple -->
                                  <input type="hidden" name="cmd" value="_buttons">
          
                                  <!-- contribution -->
                                  <input type="hidden" name="item_name" value="Maxim">
                                  <input type="hidden" name="item_number" value="Maxim">
                                  <select name="amount"><option value="3.00">&euro;3.00</option><option value="5.00">&euro;5.00</option><option value="10.00">&euro;10.00</option><option value="25.00">&euro;25.00</option><option value="50.00">&euro;50.00</option></select>
                                  <input type="hidden" name="currency" value="DOL">
          
                                  <!-- Display the button. -->
                                  <input class="paypal-img" type="image" src="https://www.pitt.com/hh.gif" border="0" name="submit" title="yy" alt="button" />
                              </form>
          
          
          
          
                          </div>
                      </div>
                      <div class="col">
                          <div class="body-text">
                              <h4>I am here</h4>
                              <p>My text here</p>
                          </div>
                          <div class="gono">
                              <form action="https://www.concate.com/donate" method="post" target="_top">
                                  <input type="hidden" name="444" value="7Z7JBUL" />
                                  <input class="r-img" type="image" src="https://www.rer.com/" border="0" name="submit" title="d" alt="sdsd" />
                                  <img alt="" border="0" src="https://www.dd.com/pixel.gif" width="1" height="1" />
                              </form>
                          </div>
                      </div>
          
                  </div>
          
              </div>
              <div class="Home Store">
                  <h4>I am here</h4>
                  <p>my dauther loves me</p>
                  <div class="text-sdsg">
                      Love Me tender
                  </div>
              </div>
              
                          <!-- Form Final -->
          

          Finally, we should use, from this post, the generic regex S/R, below :

          SEARCH (?-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-i:FR)

          REPLACE RR

          with :

          • BSR ( Begin Search-region Regex ) = ^\h*<!-- Form Start -->

          • ESR ( End Search-region Regex ) = ^\h*<!-- Form Final -->

          • FR ( Find Regex ) = <p>

          • FR ( Find Regex ) = <p class="STAR-ONE">

          This leads to the effective regex :

          • SEARCH (?-i:^\h*<!-- Form Start -->|(?!\A)\G)(?s-i:(?!^\h*<!-- Form Final -->).)*?\K(?-i:<p>)

          • REPLACE <p class="STAR-ONE">

          But, as the -i modifier is used everywhere and as the (?s) dot is used for a single ., only, we can even simplify the S/R as :

          • SEARCH (?s-i)(?:^\h*<!-- Form Start -->|(?!\A)\G)(?:(?!^\h*<!-- Form Final -->).)*?\K<p>

          • REPLACE <p class="STAR-ONE">

          which correctly matches 6 occurrences in my above example ( = 3 x 2 zones <!-- Form Start -->•••••<!-- Form Final --> ) !

          Best Regards,

          guy038

          1 Reply Last reply Reply Quote 1
          • Robin CruiseR
            Robin Cruise
            last edited by Robin Cruise

            thank you.

            also, maybe @guy038 can help me with a similar problem:

            I have this 4 lines:

             <link rel="canonical" href="https://website.com/en/camera.html" />
            

            and

            	<div class="somers"><a href="https://othersite/fffffon.html" class="flags bg" hreflang="bg" title="bk"></a>
            <a href="https://roberta.com/test-lofet.html" class="flags sk" hreflang="sk" title="sk"></a>
            <a href="https://cameleon.com/america.html" class="flags uk" hreflang="uk" title="uk"></a>
            

            I want to copy https://website.com/en/camera.html from canonical tag, and copy/replace those 3 links on other line with it.

            My regex change only the first of the first three, don’t know why :(

            Search: (?s)<link rel="canonical" href="(.*?)"\h/>.*?<a href="\K.*?(?="\hclass="flags)
            Replace by: \1

            The pattern I follow is:

            FIND: (?s)PART-A(.*?)PART-B.*?SECOND-A\K.*?(?=SECOND-2)
            REPLACE BY: \1

            The output:

            	<div class="somers"><a href="https://website.com/en/camera.html" class="flags bg" hreflang="bg" title="bk"></a>
            <a href="https://website.com/en/camera.html" class="flags sk" hreflang="sk" title="sk"></a>
            <a href="https://website.com/en/camera.html" class="flags uk" hreflang="uk" title="uk"></a>
            
            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by

              Hi @robin-cruise,

              • Can you specify if the line <link rel="canonical" href="https://website.com/en/camera.html" /> occurs only once, in each HTML file ?

              • Does this line come always before the different <a href="•••••••••••••••" class="flags expressions ?

              TIA,

              Cheers

              guy038

              1 Reply Last reply Reply Quote 1
              • Robin CruiseR
                Robin Cruise
                last edited by

                hello @guy038

                yes, <link rel="canonical" href="https://website.com/en/camera.html" /> occurs only once, in each HTML file.

                and yes, that line come always before the different <a href="•••••••••••••••" class="flags expressions.

                canonical line is at about the beginning of the file
                all those <a href="•••••••••••••••" class="flags are at the end of the files, in the footer.

                1 Reply Last reply Reply Quote 1
                • Robin CruiseR
                  Robin Cruise
                  last edited by

                  can you help me @guy038 ?

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by

                    Hello @robin-cruise ,

                    Sorry, I spent a lot of time with the @xaviermdq’s problem ! Refer here !

                    I won’t be long. I’ve already imagined something which should work !

                    BR

                    guy038

                    1 Reply Last reply Reply Quote 1
                    • guy038G
                      guy038
                      last edited by guy038

                      Hi, @robin-cruise and All,

                      The general problem is how to modify some lines with an expression ( https://website.com/en/camera.html ), located before these lines ? Somehow, we need to rewrite the address, in the <link••••• /> tag somewhere, after the lines to modify !


                      So I propose to decompose the problem in two smaller ones :

                      • Firstly, we store in a comment, at the very end of each HTML file, the address located in the <link ••••• /> tag with this regex S/R :

                      SEARCH (?-is)<link rel="canonical" href="(.+?)"(?s).+\K

                      REPLACE \r\n<!-- \1 -->

                      • Secondly :

                        • We replace the address, in each <a href="••••••••••" class="flags••••••••••></a> tag found, with the stored address in the last comment of the file, at the very end of file

                        • We delete this temporary comment, as well

                      With this regex S/R :

                      SEARCH (?-is)<a href="\K.*?(?="\h+class="flags(?s).+<!-- (.+) -->\z)|(?-s)<!--.+\z

                      REPLACE ?1\1

                      Best Regards,

                      guy038

                      1 Reply Last reply Reply Quote 1
                      • Robin CruiseR
                        Robin Cruise
                        last edited by

                        thanks @guy038

                        1 Reply Last reply Reply Quote 0
                        • Robin CruiseR
                          Robin Cruise
                          last edited by

                          I found a solution that works with PowerShell, that will replace all lines with the canonical link tag:

                          $sourcedir = "C:\Folder1\"
                          $resultsdir = "C:\Folder1\"
                          
                          Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object {
                              $content = Get-Content -Path $_.FullName -Raw
                              $replaceValue = (Select-String -InputObject $content -Pattern '(?<=<link rel="canonical" href=").+(?=" />)').Matches.Value
                              $content = $content -replace 'https:\/\/.+.html',$replaceValue
                              Set-Content -Path $resultsdir\$($_.name) $content
                          }
                          
                          1 Reply Last reply Reply Quote 0
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors