Community
    • Login

    Regex: Delete everything that falls between the html comments except some tags

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    5 Posts 4 Posters 442 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Hellena CrainicuH
      Hellena Crainicu
      last edited by

      Hello, I have this html lines that starts with <!-- MAIN START --> and ending with <!-- MAIN FINAL -->

      <!-- MAIN START -->
      
      <div align="center">
            <table width="33" border="0">
              <tr>
                <td>
            <h1 class="tre" itemprop="sfe">Text here</h1>
                </td>
              </tr>
              <tr>
                <td class="rest">Something, by Author</td>
              </tr>
            </table>
          <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
         <map name="goon" id="m2_34">
      <p class="my_2">I love myself</p>
      <area shape="rect" coords="45,74,582" href="#plata" alt="" />
      </map>
          </h2>
            <p class="my_2">Why this text text?</p>
      <p class="my_3">test text text</p>
            <p class="my_2">test text text</p>
      <p class="my_3">test text text</p>
        </div>
        <p align="justify" class="justify_em">Yes</p>
       
           <!-- MAIN FINAL -->
      

      I want to remove everything from these comments, and keep only the html tags such as <p class=...</p>

      The Output should be"

      <!-- MAIN START -->
      
       <p class="my_2">I love myself</p>
       <p class="my_2">Why this text text?</p>
       <p class="my_3">test text text</p>
       <p class="my_2">test text text</p>
       <p class="my_3">test text text</p>
         
      <!-- MAIN FINAL -->
      

      My regex is not working:

      Find: (<\!-- MAIN START -->).*(?!<p class=.*</p>).*(<\!-- MAIN FINAL -->)

      Replace by: \1\2\3

      PeterJonesP 1 Reply Last reply Reply Quote 1
      • Vasile CarausV
        Vasile Caraus
        last edited by Vasile Caraus

        I don’t know if it’s possible with Regex and Notepad++, but it’s definitely possible with PoweShell:

        $sourcedir = "C:\Folder1\"
         $resultsdir = "C:\Folder2\"
         Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object{
             $output=@()
             $content = Get-Content -Path $_.FullName
             $start = $content | Where-Object {$_ -match '<!-- MAIN START -->'} 
             $final = $content | Where-Object {$_ -match '<!-- MAIN FINAL -->'} 
             for($i=0;$i -lt $content.Count;$i++){
                 if(($i -gt $content.IndexOf($start)) -and ($i -lt $content.IndexOf($final))){
                     if($content[$i] -notmatch '<p class='){
                         continue
                     }
                 }
                 $output += $content[$i]
             }
             $output | Out-File -FilePath $resultsdir\$($_.name)
         }
        
        1 Reply Last reply Reply Quote 0
        • PeterJonesP
          PeterJones @Hellena Crainicu
          last edited by

          @Hellena-Crainicu ,

          Our regex guru @guy038 has provided a “generic regex” in this post which provides a generic way of finding specific text (FR) and replacing it with specific text (RR), but only when found between a start-of-region and end-of-region pair (BSR and ESR).

          4cde95db-d042-4eaa-863f-3592996fe4df-image.png

          You would have to figure out the FR, RR, BSR, and ESR, and then plug those values into the generic expression…

          The find expression will be the hardest, because you want to find any text that isn’t an HTML paragraph. I don’t immediately have an idea of how to do that for sure… The others are pretty straightforward:

          • FR = … unsure yet … it’s going to involve lookaheads
          • RR = since you want to replace the found text with nothing, you can just leave this blank
          • BSR = <!-- MAIN START -->
          • ESR = <!-- MAIN FINAL -->

          My initial guess was that the FR was going to be something like ((?!<p.*?</p>).)+… unfortunately, that exact sequence won’t work because when it advances one step beyond the < of the <p..., it is matching again:
          e1d7fab2-c1e2-4637-a3de-0a27048489b4-image.png

          My next guess is that the generic would have to be modified somewhat, probably adding an or-condition next to the ESR… but I’m not sure how. Maybe @guy038 will be able to find time to experiment with this one.

          Alternately, it would be pretty easy in some tool other than pure Notepad++. @Vasile-Caraus has provided an off-topic powershell answer. If you wanted to stick closer to Notepad++, you could use one of the scripting plugins like PythonScript to access the file currently open in Notepad++ with your script, but implement the start and end checks similar to what @Vasile-Caraus did, but in Python and using the PythonScript interface to access the contents of the document open in Notepad++.

          1 Reply Last reply Reply Quote 0
          • Hellena CrainicuH
            Hellena Crainicu
            last edited by Hellena Crainicu

            thank you @Vasile-Caraus Works very well with powershell, but it would have been faster and easier with regex

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hello, @hellena-crainicu and All,

              Before giving, in the second part of this post, the specific solution for @hellena-crainicu, here is a simple example to show you the difficulties I had to face !

              Let’s start with this text :

              
              ---000000---DEF---
              START---12345---6789---1111199999---DEF---STOP
              ---GHI---000
              00000---
              START---123---PQR---456---789STOP
              ---00000000---00---GHI---0000---
              START987---AAA---654
              ---ZZZ---321---STOP
              ---0000---000000000
              

              And let’s suppose that we want to rewrite all numbers, between the boundaries START and STOP, each on a new line

              If, in addition, we want to add a line-break, after the START opening section, we need, from the generic regex, discussed before by @peterjones, to slightly modify this regex, as we search for two independent strings, simultaneously. It leads to the regex S/R :

              (A) SEARCH   (?sx-i)(?: (START) | (?!\A)\G )  (?: (?!STOP). )*?  (\d+ (\R)? )
              
              (A) REPLACE  (?1\1\r\n\r\n)\2(?3:\r\n)
              

              and would change the initial text as :

              ---000000---DEF---
              START
              
              12345
              6789
              1111199999
              ---DEF---STOP
              ---GHI---000
              00000---
              START
              
              123
              456
              789
              STOP
              ---00000000---00---GHI---0000---
              START
              
              987
              654
              321
              ---STOP
              ---0000---000000000
              

              As you can see :

              • The START boundary is clearly defined

              • The different numbers, located between START and STOP are correctly rewritten one per line and extra stuff is deleted

              • However, between the last number and the closing boundary STOP, some extra characters are still not deleted :-(

              No problem, we may modify this S/R to include the search of STOP, too, within a non-capturing group, giving :

              (B) SEARCH   (?sx-i)(?: (START) | (?!\A)\G )  (?: (?!STOP). )*?  (?: ( \d+ (\R)? ) | (STOP) )
              
              (B) REPLACE  (?1\1\r\n\r\n)(?2\2(?3:\r\n))(?4\r\n\4)
              

              And we will take the opportunity to add a line-break, right before the closing section STOP

              Thus, we obtain :

              ---000000---DEF---
              START
              
              12345
              6789
              1111199999
              
              STOP000
              00000
              123
              456
              789
              
              STOP00000000
              00
              0000
              987
              654
              321
              
              STOP0000
              000000000
              

              Unfortunately, it seems that the 0 digits are also processed like the other numbers, although they are not part of a START •••••STOP region :-((

              Indeed, after matching some stuff ending with STOP, the search process restarts immediately and considers the following characters as we have specified the (?s) modifier ! So, how to tell the regex engine, to directly jump to the next START boundary ?

              I had the idea to only search for the beginning of the STOP string, for instance the string ST and add a negative look-behind (?!OP), executed once only, after the START string or location of the previous match

              So :

              • First, extra chars before STOP as well as ST are changed as the string \r\nST

              • Now, the regex engine is located right before the OP string of the word STOP. However, due to the look-ahead (?!OP), it must advance of one position in order that the condition (?!OP) is true. As this new match do not start where the previous match ends, the \G assertion forces the failure of the match attempt !

              • Thus, the string OP and further stuff should not be modified and the new match would necessarily catch an other string START, so the beginning of an other allowed region !

              (C) SEARCH   (?sx-i)(?: (START) | (?!\A)\G )  (?!OP) (?: (?!STOP). )*?  (?: ( \d+ (\R)? ) | (ST) )
              
              (C) REPLACE  (?1\1\r\n\r\n)(?2\2(?3:\r\n))(?4\r\n\4)
              

              After replacement, we get :

              ---000000---DEF---
              START
              
              12345
              6789
              1111199999
              
              STOP
              ---GHI---000
              00000---
              START
              
              123
              456
              789
              
              STOP
              ---00000000---00---GHI---0000---
              START
              
              987
              654
              321
              
              STOP
              ---0000---000000000
              

              This time, it easy to see that the parts of text :

              • Before the first START boundary

              • After a STOP boundary and before a START boundary

              • After the last STOP boundary

              Are not modified at all by the replacement, as expected !


              Now, @hellena-crainicu, as promised, here is the regex S/R to achieve what you want :

              SEARCH (?s)(?:^\h*(<!-- MAIN START -->)(?:\h*\R)+|(?!\A)\G)(?!->)(?:(?!<!-- MAIN FINAL -->).)*?(?:^\h*(<p class=".+?</p>(\R)?)|^(?:\h*\R)*\h*(<!-- MAIN FINAL -))

              REPLACE (?1\1\r\n\r\n)\2(?3:\r\n)\4

              You may test it against this sample text, below, containing two sections <!-- MAIN START --> ••••• <!-- MAIN FINAL -->, embedded into three other sections !

              <div align="center">
                    <table width="33" border="0">
                      <tr>
                        <td>
                    <h1 class="tre" itemprop="sfe">Text here</h1>
                        </td>
                      </tr>
                      <tr>
                        <td class="rest">Something, by Author</td>
                      </tr>
                    </table>
                  <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
                 <map name="goon" id="m2_34">
              <p class="my_2">I love myself</p>
              <area shape="rect" coords="45,74,582" href="#plata" alt="" />
              </map>
                  </h2>
                    <p class="my_2">Why this text text?</p>
              <p class="my_3">test text text</p>
                    <p class="my_2">test text text</p>
              <p class="my_3">test text text</p>
                </div>
                <p align="justify" class="justify_em">Yes</p>
              
              			         <!-- MAIN START -->
                 
              		
              
              
              <div align="center">
                    <table width="33" border="0">
                      <tr>
                        <td>
                    <h1 class="tre" itemprop="sfe">Text here</h1>
                        </td>
                      </tr>
                      <tr>
                        <td class="rest">Something, by Author</td>
                      </tr>
                    </table>
                  <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
                 <map name="goon" id="m2_34">
              <p class="my_2">I love myself</p>
              <area shape="rect" coords="45,74,582" href="#plata" alt="" />
              </map>
                  </h2>
                    <p class="my_2">Why this text text?</p>
              <p class="my_3">test text text</p>      <p class="my_2">test text text</p>
              <p class="my_3">test text text</p>
                </div>
                <p align="justify" class="justify_em">Yes</p>
               
                   <!-- MAIN FINAL -->
              
              <div align="center">
                    <table width="33" border="0">
                      <tr>
                        <td>
                    <h1 class="tre" itemprop="sfe">Text here</h1>
                        </td>
                      </tr>
                      <tr>
                        <td class="rest">Something, by Author</td>
                      </tr>
                    </table>
                  <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
                 <map name="goon" id="m2_34">
              <p class="my_2">I love myself</p>
              <area shape="rect" coords="45,74,582" href="#plata" alt="" />
              </map>
                  </h2>
                    <p class="my_2">Why this text text?</p>
              <p class="my_3">test text text</p>
                    <p class="my_2">test text text</p>
              <p class="my_3">test text text</p>
                </div>
                <p align="justify" class="justify_em">Yes</p>
               
              <!-- MAIN START -->
              
              <div align="center">
                    <table width="33" border="0">
                      <tr>
                        <td>
                    <h1 class="tre" itemprop="sfe">Text here</h1>
                        </td>
                      </tr>
                      <tr>
                        <td class="rest">Something, by Author</td>
                      </tr>
                    </table>
                  <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
                 <map name="goon" id="m2_34">
              <p class="my_2">I love myself</p>
              <area shape="rect" coords="45,74,582" href="#plata" alt="" />
              </map>
                  </h2>
                    <p class="my_2">Why this text text?</p>
              <p class="my_3">test text text</p>
                    <p class="my_2">test text text</p>
              <p class="my_3">test text text</p>
                </div>
                <p align="justify" class="justify_em">Yes</p>
               
                   <!-- MAIN FINAL -->
              
              <div align="center">
                    <table width="33" border="0">
                      <tr>
                        <td>
                    <h1 class="tre" itemprop="sfe">Text here</h1>
                        </td>
                      </tr>
                      <tr>
                        <td class="rest">Something, by Author</td>
                      </tr>
                    </table>
                  <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
                 <map name="goon" id="m2_34">
              <p class="my_2">I love myself</p>
              <area shape="rect" coords="45,74,582" href="#plata" alt="" />
              </map>
                  </h2>
                    <p class="my_2">Why this text text?</p>
              <p class="my_3">test text text</p>
                    <p class="my_2">test text text</p>
              <p class="my_3">test text text</p>
                </div>
                <p align="justify" class="justify_em">Yes</p>
              

              You should get the expected text :

              <div align="center">
                    <table width="33" border="0">
                      <tr>
                        <td>
                    <h1 class="tre" itemprop="sfe">Text here</h1>
                        </td>
                      </tr>
                      <tr>
                        <td class="rest">Something, by Author</td>
                      </tr>
                    </table>
                  <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
                 <map name="goon" id="m2_34">
              <p class="my_2">I love myself</p>
              <area shape="rect" coords="45,74,582" href="#plata" alt="" />
              </map>
                  </h2>
                    <p class="my_2">Why this text text?</p>
              <p class="my_3">test text text</p>
                    <p class="my_2">test text text</p>
              <p class="my_3">test text text</p>
                </div>
                <p align="justify" class="justify_em">Yes</p>
              
              <!-- MAIN START -->
              
              <p class="my_2">I love myself</p>
              <p class="my_2">Why this text text?</p>
              <p class="my_3">test text text</p>
              <p class="my_2">test text text</p>
              <p class="my_3">test text text</p>
              
              <!-- MAIN FINAL -->
              
              <div align="center">
                    <table width="33" border="0">
                      <tr>
                        <td>
                    <h1 class="tre" itemprop="sfe">Text here</h1>
                        </td>
                      </tr>
                      <tr>
                        <td class="rest">Something, by Author</td>
                      </tr>
                    </table>
                  <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
                 <map name="goon" id="m2_34">
              <p class="my_2">I love myself</p>
              <area shape="rect" coords="45,74,582" href="#plata" alt="" />
              </map>
                  </h2>
                    <p class="my_2">Why this text text?</p>
              <p class="my_3">test text text</p>
                    <p class="my_2">test text text</p>
              <p class="my_3">test text text</p>
                </div>
                <p align="justify" class="justify_em">Yes</p>
               
              <!-- MAIN START -->
              
              <p class="my_2">I love myself</p>
              <p class="my_2">Why this text text?</p>
              <p class="my_3">test text text</p>
              <p class="my_2">test text text</p>
              <p class="my_3">test text text</p>
              
              <!-- MAIN FINAL -->
              
              <div align="center">
                    <table width="33" border="0">
                      <tr>
                        <td>
                    <h1 class="tre" itemprop="sfe">Text here</h1>
                        </td>
                      </tr>
                      <tr>
                        <td class="rest">Something, by Author</td>
                      </tr>
                    </table>
                  <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
                 <map name="goon" id="m2_34">
              <p class="my_2">I love myself</p>
              <area shape="rect" coords="45,74,582" href="#plata" alt="" />
              </map>
                  </h2>
                    <p class="my_2">Why this text text?</p>
              <p class="my_3">test text text</p>
                    <p class="my_2">test text text</p>
              <p class="my_3">test text text</p>
                </div>
                <p align="justify" class="justify_em">Yes</p>
              

              Using the free-spacing mode, the search regex can be re-expressed as :

              (?xs-i)                          #  FREE-SPACING mode, regex DOT match ANY character and search is SENSITIVE to CASE
              (?:                              #  START of the 1st NON-CAPTURING group
                ^\h*                           #     Any LEADING BLANK characters, followed with ...
                (<!--[ ]MAIN[ ]START[ ]-->)    #     The string '<!-- MAIN START -->', STORED as group 1
                (?:\h*\R)+                     #     And followed with BLANK or EMPTY lines, in a NON-CAPTURING group
              |                                #   OR
              (?!\A)\G                         #     The EMPTY location RIGHT AFTER a previous MATCH
              )                                #  END of the 1st NON-CAPTURING group
              (?!->)                           #  If the TWO NEXT chars are DIFFERENT from the string '->'
              (?:                              #  START of the 2nd NON-CAPTURING group
                (?!<!--[ ]MAIN[ ]FINAL[ ]-->). #    If CURRENT character is NOT the BEGINNING of the string '<!-- MAIN FINAL -->'
              )                                #  END of the 2nd NON-CAPTURING group
              *?                               #  The SHORTEST, possibly EMPTY, range of ANY character, till... : See •, below
              (?:                              #  START of the 3rd NON-CAPTURING group
                ^\h*                           #  • Any LEADING BLANK characters, followed with ...
                (                              #    START of group 2
                  <p[ ]class=".+?</p>          #      The SHORTEST, NON EMPTY, range of characters between the strings '<p class="' and '</p>'
                  (\R)?                        #      And followed with an OPTIONAL line-break, STORED as group 3
                )                              #    END of group 2
              |                                #  OR
                ^(?:\h*\R)*\h*                 #  •  An OPTIONAL range of BLANK or EMPTY lines, followed with OPTIONAL HORIZONTAL BLANK chars
                (<!--[ ]MAIN[ ]FINAL[ ]-)      #     And followed with the string '<!-- MAIN FINAL -', STORED as group 4
              )                                #  END of the 3rd NON-CAPTURING group 
              

              Notes :

              • In this mode, any literal space char must be escaped with the \ character or written [ ] !

              • Following the same method, as previously described, we just search for the ending string <!-- MAIN FINAL - and the last two chars -> are inserted in the negative look-ahead (?!->)

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 3
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors