why does this parsing not work? Replace the content of html tags between comments section



  • hello, I read this topic and the answer of @guy038 over HERE

    I have a similar request. I this html code.

           <!-- Form Start -->
        <div class="55df" >
            <div class="gg44">
                <h4 class="header-text">United Romaketh</h4>
                <img src="https://sensy.com/33.png" alt="Alternate Text" />
            </div>
            <!-- Identify  -->
            <div class="45de">
                <div class="col">
                    <div class="body-text">
                        <h4>Marcus 33</h4>
                        <p>Can you tell me?</p>
                    </div>
                    <div class="yy">
                        <!-- Yesd. -->
    
                        <form action="https://www.gre.com/" method="post" class="similar">
    
                            <!-- Stones -->
                            <input type="hidden" name="business" value="erer@gmail.com">
    
                            <!-- button simple -->
                            <input type="hidden" name="cmd" value="_buttons">
    
                            <!-- contribution -->
                            <input type="hidden" name="item_name" value="Maxim">
                            <input type="hidden" name="item_number" value="Maxim">
                            <select name="amount"><option value="3.00">&euro;3.00</option><option value="5.00">&euro;5.00</option><option value="10.00">&euro;10.00</option><option value="25.00">&euro;25.00</option><option value="50.00">&euro;50.00</option></select>
                            <input type="hidden" name="currency" value="DOL">
    
                            <!-- Display the button. -->
                            <input class="paypal-img" type="image" src="https://www.pitt.com/hh.gif" border="0" name="submit" title="yy" alt="button" />
                        </form>
    
    
    
    
                    </div>
                </div>
                <div class="col">
                    <div class="body-text">
                        <h4>I am here</h4>
                        <p>My text here</p>
                    </div>
                    <div class="gono">
                        <form action="https://www.concate.com/donate" method="post" target="_top">
                            <input type="hidden" name="444" value="7Z7JBUL" />
                            <input class="r-img" type="image" src="https://www.rer.com/" border="0" name="submit" title="d" alt="sdsd" />
                            <img alt="" border="0" src="https://www.dd.com/pixel.gif" width="1" height="1" />
                        </form>
                    </div>
                </div>
    
            </div>
    
        </div>
        <div class="Home Store">
            <h4>I am here</h4>
            <p>my dauther loves me</p>
            <div class="text-sdsg">
                Love Me tender
            </div>
        </div>
        
                    <!-- Form Final -->
    

    The problem: I must replace all <p></p> with <p class="STAR-ONE"></p> from the section <!-- Form Start --> to <!-- Form Final-->

    My regex seems ok, but is not replacing too good. Instead of <p class="STAR-ONE"></p> it gives me <p><p class="STAR-ONE"></p>

    I really don’t know where is the mistake on my regex…?!

    Search: (?:.*?<!-- Form Start -->|\G).*?\K(<p>).*?(?=</p>.*?<!-- Form Final -->)
    Replace by: \1<p class="STAR-ONE">\3



  • @Robin-Cruise said in why does this parsing not work? Replace the content of html tags between comments section:

    (?:.?<!-- Form Start -->|\G).?\K(<p>).?(?=</p>.?<!-- Form Final -->)

    I find the answer, thanks:

    FIND: (?:.*?<!-- Form Start -->|\G).*?\K<p>(.*?)(?=</p>.*?<!-- Form Final -->)
    Replace by: \2<p class="STAR-ONE">\1\3

    CHECK Wrap around
    CHECK Regular expression
    CHECK . matches newline



  • Hello, @robin-cruise and All,

    Firstly, I just cannot understand your last regex S/R :

    SEARCH (?:.*?<!-- Form Start -->|\G).*?\K<p>(.*?)(?=</p>.*?<!-- Form Final -->)
    1
    REPLACE \2<p class="STAR-ONE">\1\3

    Indeed, in replacement, you have three groups 1, 2 and 3 but your search regex contains only one group (.*?) !? I verified that, after each replace operation, the groups 2 and 3 are always empty !


    Secondly, you should have expressed your initial goal as :

    " I have some <p>Some Text</p> zones and I would like to change them into <p class="STAR-ONE">SAME Text</p> "

    It’s a little bit clearer !


    Thirdly :

    • You could had added the (?s) modifier, at beginning of your regex, in order to not care about the . matches newline option !

    • You could had used used the more accurate ^\h* syntax, instead of .*?, at beginning of the non-capturing group

    • You could had added the negative look-around (?!\A), right before the \G assertion. Indeed, as the Wrap around option is set, the regex engine starts the replacement process from the very beginning of file, whatever the current caret location, and the (?!\A) syntax ensures that the regex engine will not used the second alternative \G but will look, instead, for a <!-- Form Start --> line, first !


    Fourthly, your regex cannot work with, for instance, the text below, where I duplicated your initial example and, in between, I inserted the same section, without the boundaries <!-- Form Start --> and <!-- Form Final -->

           <!-- Form Start -->
        <div class="55df" >
            <div class="gg44">
                <h4 class="header-text">United Romaketh</h4>
                <img src="https://sensy.com/33.png" alt="Alternate Text" />
            </div>
            <!-- Identify  -->
            <div class="45de">
                <div class="col">
                    <div class="body-text">
                        <h4>Marcus 33</h4>
                        <p>Can you tell me?</p>
                    </div>
                    <div class="yy">
                        <!-- Yesd. -->
    
                        <form action="https://www.gre.com/" method="post" class="similar">
    
                            <!-- Stones -->
                            <input type="hidden" name="business" value="erer@gmail.com">
    
                            <!-- button simple -->
                            <input type="hidden" name="cmd" value="_buttons">
    
                            <!-- contribution -->
                            <input type="hidden" name="item_name" value="Maxim">
                            <input type="hidden" name="item_number" value="Maxim">
                            <select name="amount"><option value="3.00">&euro;3.00</option><option value="5.00">&euro;5.00</option><option value="10.00">&euro;10.00</option><option value="25.00">&euro;25.00</option><option value="50.00">&euro;50.00</option></select>
                            <input type="hidden" name="currency" value="DOL">
    
                            <!-- Display the button. -->
                            <input class="paypal-img" type="image" src="https://www.pitt.com/hh.gif" border="0" name="submit" title="yy" alt="button" />
                        </form>
    
    
    
    
                    </div>
                </div>
                <div class="col">
                    <div class="body-text">
                        <h4>I am here</h4>
                        <p>My text here</p>
                    </div>
                    <div class="gono">
                        <form action="https://www.concate.com/donate" method="post" target="_top">
                            <input type="hidden" name="444" value="7Z7JBUL" />
                            <input class="r-img" type="image" src="https://www.rer.com/" border="0" name="submit" title="d" alt="sdsd" />
                            <img alt="" border="0" src="https://www.dd.com/pixel.gif" width="1" height="1" />
                        </form>
                    </div>
                </div>
    
            </div>
    
        </div>
        <div class="Home Store">
            <h4>I am here</h4>
            <p>my dauther loves me</p>
            <div class="text-sdsg">
                Love Me tender
            </div>
        </div>
        
                    <!-- Form Final -->
    <!-- ------------------------------------------------------------------------------------------------------------------------------- -->
        <div class="55df" >
            <div class="gg44">
                <h4 class="header-text">United Romaketh</h4>
                <img src="https://sensy.com/33.png" alt="Alternate Text" />
            </div>
            <!-- Identify  -->
            <div class="45de">
                <div class="col">
                    <div class="body-text">
                        <h4>Marcus 33</h4>
                        <p>Can you tell me?</p>
                    </div>
                    <div class="yy">
                        <!-- Yesd. -->
    
                        <form action="https://www.gre.com/" method="post" class="similar">
    
                            <!-- Stones -->
                            <input type="hidden" name="business" value="erer@gmail.com">
    
                            <!-- button simple -->
                            <input type="hidden" name="cmd" value="_buttons">
    
                            <!-- contribution -->
                            <input type="hidden" name="item_name" value="Maxim">
                            <input type="hidden" name="item_number" value="Maxim">
                            <select name="amount"><option value="3.00">&euro;3.00</option><option value="5.00">&euro;5.00</option><option value="10.00">&euro;10.00</option><option value="25.00">&euro;25.00</option><option value="50.00">&euro;50.00</option></select>
                            <input type="hidden" name="currency" value="DOL">
    
                            <!-- Display the button. -->
                            <input class="paypal-img" type="image" src="https://www.pitt.com/hh.gif" border="0" name="submit" title="yy" alt="button" />
                        </form>
    
    
    
    
                    </div>
                </div>
                <div class="col">
                    <div class="body-text">
                        <h4>I am here</h4>
                        <p>My text here</p>
                    </div>
                    <div class="gono">
                        <form action="https://www.concate.com/donate" method="post" target="_top">
                            <input type="hidden" name="444" value="7Z7JBUL" />
                            <input class="r-img" type="image" src="https://www.rer.com/" border="0" name="submit" title="d" alt="sdsd" />
                            <img alt="" border="0" src="https://www.dd.com/pixel.gif" width="1" height="1" />
                        </form>
                    </div>
                </div>
    
            </div>
    
        </div>
        <div class="Home Store">
            <h4>I am here</h4>
            <p>my dauther loves me</p>
            <div class="text-sdsg">
                Love Me tender
            </div>
        </div>
    <!-- ------------------------------------------------------------------------------------------------------------------------------- -->
           <!-- Form Start -->
        <div class="55df" >
            <div class="gg44">
                <h4 class="header-text">United Romaketh</h4>
                <img src="https://sensy.com/33.png" alt="Alternate Text" />
            </div>
            <!-- Identify  -->
            <div class="45de">
                <div class="col">
                    <div class="body-text">
                        <h4>Marcus 33</h4>
                        <p>Can you tell me?</p>
                    </div>
                    <div class="yy">
                        <!-- Yesd. -->
    
                        <form action="https://www.gre.com/" method="post" class="similar">
    
                            <!-- Stones -->
                            <input type="hidden" name="business" value="erer@gmail.com">
    
                            <!-- button simple -->
                            <input type="hidden" name="cmd" value="_buttons">
    
                            <!-- contribution -->
                            <input type="hidden" name="item_name" value="Maxim">
                            <input type="hidden" name="item_number" value="Maxim">
                            <select name="amount"><option value="3.00">&euro;3.00</option><option value="5.00">&euro;5.00</option><option value="10.00">&euro;10.00</option><option value="25.00">&euro;25.00</option><option value="50.00">&euro;50.00</option></select>
                            <input type="hidden" name="currency" value="DOL">
    
                            <!-- Display the button. -->
                            <input class="paypal-img" type="image" src="https://www.pitt.com/hh.gif" border="0" name="submit" title="yy" alt="button" />
                        </form>
    
    
    
    
                    </div>
                </div>
                <div class="col">
                    <div class="body-text">
                        <h4>I am here</h4>
                        <p>My text here</p>
                    </div>
                    <div class="gono">
                        <form action="https://www.concate.com/donate" method="post" target="_top">
                            <input type="hidden" name="444" value="7Z7JBUL" />
                            <input class="r-img" type="image" src="https://www.rer.com/" border="0" name="submit" title="d" alt="sdsd" />
                            <img alt="" border="0" src="https://www.dd.com/pixel.gif" width="1" height="1" />
                        </form>
                    </div>
                </div>
    
            </div>
    
        </div>
        <div class="Home Store">
            <h4>I am here</h4>
            <p>my dauther loves me</p>
            <div class="text-sdsg">
                Love Me tender
            </div>
        </div>
        
                    <!-- Form Final -->
    

    Finally, we should use, from this post, the generic regex S/R, below :

    SEARCH (?-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-i:FR)

    REPLACE RR

    with :

    • BSR ( Begin Search-region Regex ) = ^\h*<!-- Form Start -->

    • ESR ( End Search-region Regex ) = ^\h*<!-- Form Final -->

    • FR ( Find Regex ) = <p>

    • FR ( Find Regex ) = <p class="STAR-ONE">

    This leads to the effective regex :

    • SEARCH (?-i:^\h*<!-- Form Start -->|(?!\A)\G)(?s-i:(?!^\h*<!-- Form Final -->).)*?\K(?-i:<p>)

    • REPLACE <p class="STAR-ONE">

    But, as the -i modifier is used everywhere and as the (?s) dot is used for a single ., only, we can even simplify the S/R as :

    • SEARCH (?s-i)(?:^\h*<!-- Form Start -->|(?!\A)\G)(?:(?!^\h*<!-- Form Final -->).)*?\K<p>

    • REPLACE <p class="STAR-ONE">

    which correctly matches 6 occurrences in my above example ( = 3 x 2 zones <!-- Form Start -->•••••<!-- Form Final --> ) !

    Best Regards,

    guy038



  • thank you.

    also, maybe @guy038 can help me with a similar problem:

    I have this 4 lines:

     <link rel="canonical" href="https://website.com/en/camera.html" />
    

    and

    	<div class="somers"><a href="https://othersite/fffffon.html" class="flags bg" hreflang="bg" title="bk"></a>
    <a href="https://roberta.com/test-lofet.html" class="flags sk" hreflang="sk" title="sk"></a>
    <a href="https://cameleon.com/america.html" class="flags uk" hreflang="uk" title="uk"></a>
    

    I want to copy https://website.com/en/camera.html from canonical tag, and copy/replace those 3 links on other line with it.

    My regex change only the first of the first three, don’t know why :(

    Search: (?s)<link rel="canonical" href="(.*?)"\h/>.*?<a href="\K.*?(?="\hclass="flags)
    Replace by: \1

    The pattern I follow is:

    FIND: (?s)PART-A(.*?)PART-B.*?SECOND-A\K.*?(?=SECOND-2)
    REPLACE BY: \1

    The output:

    	<div class="somers"><a href="https://website.com/en/camera.html" class="flags bg" hreflang="bg" title="bk"></a>
    <a href="https://website.com/en/camera.html" class="flags sk" hreflang="sk" title="sk"></a>
    <a href="https://website.com/en/camera.html" class="flags uk" hreflang="uk" title="uk"></a>


  • Hi @robin-cruise,

    • Can you specify if the line <link rel="canonical" href="https://website.com/en/camera.html" /> occurs only once, in each HTML file ?

    • Does this line come always before the different <a href="•••••••••••••••" class="flags expressions ?

    TIA,

    Cheers

    guy038



  • hello @guy038

    yes, <link rel="canonical" href="https://website.com/en/camera.html" /> occurs only once, in each HTML file.

    and yes, that line come always before the different <a href="•••••••••••••••" class="flags expressions.

    canonical line is at about the beginning of the file
    all those <a href="•••••••••••••••" class="flags are at the end of the files, in the footer.



  • can you help me @guy038 ?



  • Hello @robin-cruise ,

    Sorry, I spent a lot of time with the @xaviermdq’s problem ! Refer here !

    I won’t be long. I’ve already imagined something which should work !

    BR

    guy038



  • Hi, @robin-cruise and All,

    The general problem is how to modify some lines with an expression ( https://website.com/en/camera.html ), located before these lines ? Somehow, we need to rewrite the address, in the <link••••• /> tag somewhere, after the lines to modify !


    So I propose to decompose the problem in two smaller ones :

    • Firstly, we store in a comment, at the very end of each HTML file, the address located in the <link ••••• /> tag with this regex S/R :

    SEARCH (?-is)<link rel="canonical" href="(.+?)"(?s).+\K

    REPLACE \r\n<!-- \1 -->

    • Secondly :

      • We replace the address, in each <a href="••••••••••" class="flags••••••••••></a> tag found, with the stored address in the last comment of the file, at the very end of file

      • We delete this temporary comment, as well

    With this regex S/R :

    SEARCH (?-is)<a href="\K.*?(?="\h+class="flags(?s).+<!-- (.+) -->\z)|(?-s)<!--.+\z

    REPLACE ?1\1

    Best Regards,

    guy038



  • thanks @guy038



  • I found a solution that works with PowerShell, that will replace all lines with the canonical link tag:

    $sourcedir = "C:\Folder1\"
    $resultsdir = "C:\Folder1\"
    
    Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object {
        $content = Get-Content -Path $_.FullName -Raw
        $replaceValue = (Select-String -InputObject $content -Pattern '(?<=<link rel="canonical" href=").+(?=" />)').Matches.Value
        $content = $content -replace 'https:\/\/.+.html',$replaceValue
        Set-Content -Path $resultsdir\$($_.name) $content
    }
    

Log in to reply