Regex: Delete everything that falls between the html comments except some tags

Hellena Crainicu

Hello, I have this html lines that starts with  and ending with 

<!-- MAIN START -->

<div align="center">
      <table width="33" border="0">
        <tr>
          <td>
      <h1 class="tre" itemprop="sfe">Text here</h1>
          </td>
        </tr>
        <tr>
          <td class="rest">Something, by Author</td>
        </tr>
      </table>
    <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
   <map name="goon" id="m2_34">
<p class="my_2">I love myself</p>
<area shape="rect" coords="45,74,582" href="#plata" alt="" />
</map>
    </h2>
      <p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>
      <p class="my_2">test text text</p>
<p class="my_3">test text text</p>
  </div>
  <p align="justify" class="justify_em">Yes</p>
 
     <!-- MAIN FINAL -->

I want to remove everything from these comments, and keep only the html tags such as <p class=...</p>

The Output should be"

<!-- MAIN START -->

 <p class="my_2">I love myself</p>
 <p class="my_2">Why this text text?</p>
 <p class="my_3">test text text</p>
 <p class="my_2">test text text</p>
 <p class="my_3">test text text</p>
   
<!-- MAIN FINAL -->

My regex is not working:

Find: (<\!-- MAIN START -->).*(?!<p class=.*</p>).*(<\!-- MAIN FINAL -->)

Replace by: \1\2\3

Vasile Caraus

I don’t know if it’s possible with Regex and Notepad++, but it’s definitely possible with PoweShell:

$sourcedir = "C:\Folder1\"
 $resultsdir = "C:\Folder2\"
 Get-ChildItem -Path $sourcedir -Filter *.html | ForEach-Object{
     $output=@()
     $content = Get-Content -Path $_.FullName
     $start = $content | Where-Object {$_ -match '<!-- MAIN START -->'} 
     $final = $content | Where-Object {$_ -match '<!-- MAIN FINAL -->'} 
     for($i=0;$i -lt $content.Count;$i++){
         if(($i -gt $content.IndexOf($start)) -and ($i -lt $content.IndexOf($final))){
             if($content[$i] -notmatch '<p class='){
                 continue
             }
         }
         $output += $content[$i]
     }
     $output | Out-File -FilePath $resultsdir\$($_.name)
 }

PeterJones

@Hellena-Crainicu ,

Our regex guru @guy038 has provided a “generic regex” in this post which provides a generic way of finding specific text (FR) and replacing it with specific text (RR), but only when found between a start-of-region and end-of-region pair (BSR and ESR).

You would have to figure out the FR, RR, BSR, and ESR, and then plug those values into the generic expression…

The find expression will be the hardest, because you want to find any text that isn’t an HTML paragraph. I don’t immediately have an idea of how to do that for sure… The others are pretty straightforward:

FR = … unsure yet … it’s going to involve lookaheads
RR = since you want to replace the found text with nothing, you can just leave this blank
BSR = 
ESR =

My initial guess was that the FR was going to be something like ((?!<p.*?</p>).)+… unfortunately, that exact sequence won’t work because when it advances one step beyond the < of the <p..., it is matching again:

My next guess is that the generic would have to be modified somewhat, probably adding an or-condition next to the ESR… but I’m not sure how. Maybe @guy038 will be able to find time to experiment with this one.

Alternately, it would be pretty easy in some tool other than pure Notepad++. @Vasile-Caraus has provided an off-topic powershell answer. If you wanted to stick closer to Notepad++, you could use one of the scripting plugins like PythonScript to access the file currently open in Notepad++ with your script, but implement the start and end checks similar to what @Vasile-Caraus did, but in Python and using the PythonScript interface to access the contents of the document open in Notepad++.

Hellena Crainicu

thank you @Vasile-Caraus Works very well with powershell, but it would have been faster and easier with regex

guy038

Hello, @hellena-crainicu and All,

Before giving, in the second part of this post, the specific solution for @hellena-crainicu, here is a simple example to show you the difficulties I had to face !

Let’s start with this text :


---000000---DEF---
START---12345---6789---1111199999---DEF---STOP
---GHI---000
00000---
START---123---PQR---456---789STOP
---00000000---00---GHI---0000---
START987---AAA---654
---ZZZ---321---STOP
---0000---000000000

And let’s suppose that we want to rewrite all numbers, between the boundaries START and STOP, each on a new line

If, in addition, we want to add a line-break, after the START opening section, we need, from the generic regex, discussed before by @peterjones, to slightly modify this regex, as we search for two independent strings, simultaneously. It leads to the regex S/R :

(A) SEARCH   (?sx-i)(?: (START) | (?!\A)\G )  (?: (?!STOP). )*?  (\d+ (\R)? )

(A) REPLACE  (?1\1\r\n\r\n)\2(?3:\r\n)

and would change the initial text as :

---000000---DEF---
START

12345
6789
1111199999
---DEF---STOP
---GHI---000
00000---
START

123
456
789
STOP
---00000000---00---GHI---0000---
START

987
654
321
---STOP
---0000---000000000

As you can see :

The START boundary is clearly defined
The different numbers, located between START and STOP are correctly rewritten one per line and extra stuff is deleted
However, between the last number and the closing boundary STOP, some extra characters are still not deleted :-(

No problem, we may modify this S/R to include the search of STOP, too, within a non-capturing group, giving :

(B) SEARCH   (?sx-i)(?: (START) | (?!\A)\G )  (?: (?!STOP). )*?  (?: ( \d+ (\R)? ) | (STOP) )

(B) REPLACE  (?1\1\r\n\r\n)(?2\2(?3:\r\n))(?4\r\n\4)

And we will take the opportunity to add a line-break, right before the closing section STOP

Thus, we obtain :

---000000---DEF---
START

12345
6789
1111199999

STOP000
00000
123
456
789

STOP00000000
00
0000
987
654
321

STOP0000
000000000

Unfortunately, it seems that the 0 digits are also processed like the other numbers, although they are not part of a START •••••STOP region :-((

Indeed, after matching some stuff ending with STOP, the search process restarts immediately and considers the following characters as we have specified the (?s) modifier ! So, how to tell the regex engine, to directly jump to the next START boundary ?

I had the idea to only search for the beginning of the STOP string, for instance the string ST and add a negative look-behind (?!OP), executed once only, after the START string or location of the previous match

So :

First, extra chars before STOP as well as ST are changed as the string \r\nST
Now, the regex engine is located right before the OP string of the word STOP. However, due to the look-ahead (?!OP), it must advance of one position in order that the condition (?!OP) is true. As this new match do not start where the previous match ends, the \G assertion forces the failure of the match attempt !
Thus, the string OP and further stuff should not be modified and the new match would necessarily catch an other string START, so the beginning of an other allowed region !

(C) SEARCH   (?sx-i)(?: (START) | (?!\A)\G )  (?!OP) (?: (?!STOP). )*?  (?: ( \d+ (\R)? ) | (ST) )

(C) REPLACE  (?1\1\r\n\r\n)(?2\2(?3:\r\n))(?4\r\n\4)

After replacement, we get :

---000000---DEF---
START

12345
6789
1111199999

STOP
---GHI---000
00000---
START

123
456
789

STOP
---00000000---00---GHI---0000---
START

987
654
321

STOP
---0000---000000000

This time, it easy to see that the parts of text :

Before the first START boundary
After a STOP boundary and before a START boundary
After the last STOP boundary

Are not modified at all by the replacement, as expected !

Now, @hellena-crainicu, as promised, here is the regex S/R to achieve what you want :

SEARCH (?s)(?:^\h*()(?:\h*\R)+|(?!\A)\G)(?!->)(?:(?!).)*?(?:^\h*(<p class=".+?</p>(\R)?)|^(?:\h*\R)*\h*(<!-- MAIN FINAL -))

REPLACE (?1\1\r\n\r\n)\2(?3:\r\n)\4

You may test it against this sample text, below, containing two sections  ••••• , embedded into three other sections !

<div align="center">
      <table width="33" border="0">
        <tr>
          <td>
      <h1 class="tre" itemprop="sfe">Text here</h1>
          </td>
        </tr>
        <tr>
          <td class="rest">Something, by Author</td>
        </tr>
      </table>
    <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
   <map name="goon" id="m2_34">
<p class="my_2">I love myself</p>
<area shape="rect" coords="45,74,582" href="#plata" alt="" />
</map>
    </h2>
      <p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>
      <p class="my_2">test text text</p>
<p class="my_3">test text text</p>
  </div>
  <p align="justify" class="justify_em">Yes</p>

			         <!-- MAIN START -->
   
		


<div align="center">
      <table width="33" border="0">
        <tr>
          <td>
      <h1 class="tre" itemprop="sfe">Text here</h1>
          </td>
        </tr>
        <tr>
          <td class="rest">Something, by Author</td>
        </tr>
      </table>
    <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
   <map name="goon" id="m2_34">
<p class="my_2">I love myself</p>
<area shape="rect" coords="45,74,582" href="#plata" alt="" />
</map>
    </h2>
      <p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>      <p class="my_2">test text text</p>
<p class="my_3">test text text</p>
  </div>
  <p align="justify" class="justify_em">Yes</p>
 
     <!-- MAIN FINAL -->

<div align="center">
      <table width="33" border="0">
        <tr>
          <td>
      <h1 class="tre" itemprop="sfe">Text here</h1>
          </td>
        </tr>
        <tr>
          <td class="rest">Something, by Author</td>
        </tr>
      </table>
    <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
   <map name="goon" id="m2_34">
<p class="my_2">I love myself</p>
<area shape="rect" coords="45,74,582" href="#plata" alt="" />
</map>
    </h2>
      <p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>
      <p class="my_2">test text text</p>
<p class="my_3">test text text</p>
  </div>
  <p align="justify" class="justify_em">Yes</p>
 
<!-- MAIN START -->

<div align="center">
      <table width="33" border="0">
        <tr>
          <td>
      <h1 class="tre" itemprop="sfe">Text here</h1>
          </td>
        </tr>
        <tr>
          <td class="rest">Something, by Author</td>
        </tr>
      </table>
    <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
   <map name="goon" id="m2_34">
<p class="my_2">I love myself</p>
<area shape="rect" coords="45,74,582" href="#plata" alt="" />
</map>
    </h2>
      <p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>
      <p class="my_2">test text text</p>
<p class="my_3">test text text</p>
  </div>
  <p align="justify" class="justify_em">Yes</p>
 
     <!-- MAIN FINAL -->

<div align="center">
      <table width="33" border="0">
        <tr>
          <td>
      <h1 class="tre" itemprop="sfe">Text here</h1>
          </td>
        </tr>
        <tr>
          <td class="rest">Something, by Author</td>
        </tr>
      </table>
    <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
   <map name="goon" id="m2_34">
<p class="my_2">I love myself</p>
<area shape="rect" coords="45,74,582" href="#plata" alt="" />
</map>
    </h2>
      <p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>
      <p class="my_2">test text text</p>
<p class="my_3">test text text</p>
  </div>
  <p align="justify" class="justify_em">Yes</p>

You should get the expected text :

<div align="center">
      <table width="33" border="0">
        <tr>
          <td>
      <h1 class="tre" itemprop="sfe">Text here</h1>
          </td>
        </tr>
        <tr>
          <td class="rest">Something, by Author</td>
        </tr>
      </table>
    <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
   <map name="goon" id="m2_34">
<p class="my_2">I love myself</p>
<area shape="rect" coords="45,74,582" href="#plata" alt="" />
</map>
    </h2>
      <p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>
      <p class="my_2">test text text</p>
<p class="my_3">test text text</p>
  </div>
  <p align="justify" class="justify_em">Yes</p>

<!-- MAIN START -->

<p class="my_2">I love myself</p>
<p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>
<p class="my_2">test text text</p>
<p class="my_3">test text text</p>

<!-- MAIN FINAL -->

<div align="center">
      <table width="33" border="0">
        <tr>
          <td>
      <h1 class="tre" itemprop="sfe">Text here</h1>
          </td>
        </tr>
        <tr>
          <td class="rest">Something, by Author</td>
        </tr>
      </table>
    <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
   <map name="goon" id="m2_34">
<p class="my_2">I love myself</p>
<area shape="rect" coords="45,74,582" href="#plata" alt="" />
</map>
    </h2>
      <p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>
      <p class="my_2">test text text</p>
<p class="my_3">test text text</p>
  </div>
  <p align="justify" class="justify_em">Yes</p>
 
<!-- MAIN START -->

<p class="my_2">I love myself</p>
<p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>
<p class="my_2">test text text</p>
<p class="my_3">test text text</p>

<!-- MAIN FINAL -->

<div align="center">
      <table width="33" border="0">
        <tr>
          <td>
      <h1 class="tre" itemprop="sfe">Text here</h1>
          </td>
        </tr>
        <tr>
          <td class="rest">Something, by Author</td>
        </tr>
      </table>
    <h2 class="blast2"><img src="sfa.jpg" alt="hip" />
   <map name="goon" id="m2_34">
<p class="my_2">I love myself</p>
<area shape="rect" coords="45,74,582" href="#plata" alt="" />
</map>
    </h2>
      <p class="my_2">Why this text text?</p>
<p class="my_3">test text text</p>
      <p class="my_2">test text text</p>
<p class="my_3">test text text</p>
  </div>
  <p align="justify" class="justify_em">Yes</p>

Using the free-spacing mode, the search regex can be re-expressed as :

(?xs-i)                          #  FREE-SPACING mode, regex DOT match ANY character and search is SENSITIVE to CASE
(?:                              #  START of the 1st NON-CAPTURING group
  ^\h*                           #     Any LEADING BLANK characters, followed with ...
  (<!--[ ]MAIN[ ]START[ ]-->)    #     The string '<!-- MAIN START -->', STORED as group 1
  (?:\h*\R)+                     #     And followed with BLANK or EMPTY lines, in a NON-CAPTURING group
|                                #   OR
(?!\A)\G                         #     The EMPTY location RIGHT AFTER a previous MATCH
)                                #  END of the 1st NON-CAPTURING group
(?!->)                           #  If the TWO NEXT chars are DIFFERENT from the string '->'
(?:                              #  START of the 2nd NON-CAPTURING group
  (?!<!--[ ]MAIN[ ]FINAL[ ]-->). #    If CURRENT character is NOT the BEGINNING of the string '<!-- MAIN FINAL -->'
)                                #  END of the 2nd NON-CAPTURING group
*?                               #  The SHORTEST, possibly EMPTY, range of ANY character, till... : See •, below
(?:                              #  START of the 3rd NON-CAPTURING group
  ^\h*                           #  • Any LEADING BLANK characters, followed with ...
  (                              #    START of group 2
    <p[ ]class=".+?</p>          #      The SHORTEST, NON EMPTY, range of characters between the strings '<p class="' and '</p>'
    (\R)?                        #      And followed with an OPTIONAL line-break, STORED as group 3
  )                              #    END of group 2
|                                #  OR
  ^(?:\h*\R)*\h*                 #  •  An OPTIONAL range of BLANK or EMPTY lines, followed with OPTIONAL HORIZONTAL BLANK chars
  (<!--[ ]MAIN[ ]FINAL[ ]-)      #     And followed with the string '<!-- MAIN FINAL -', STORED as group 4
)                                #  END of the 3rd NON-CAPTURING group

Notes :

In this mode, any literal space char must be escaped with the \ character or written [ ] !
Following the same method, as previously described, we just search for the ending string <!-- MAIN FINAL - and the last two chars -> are inserted in the negative look-ahead (?!->)

Best Regards,

guy038