Regular expression to delete specified text between html Tags



  • Hi,
    Can you please help with a regular expression to delete text from an html document. In the below example, I would like to delete block 1 if block 1 contains the word, “Done”.
    Block 2 should remain unchanged.

    I tried to write a regex that says something like this:

    match line with <TR>
    next line match <TD> and </TD>
    next line match “Done”
    next line match <TD> and </TD>
    next line match <TD> and </TD>
    next line match <TD> and </TD>
    next line match </TR>

    I want to delete all blocks that match this expression.

    I’ve been racking my brains for weeks.

    Thanks.

    block 1

    <TR>
    <TD>1</TD>
    <TD><FONT COLOR="#808080">Done</TD>
    <TD>SystemUtil</TD>
    <TD>17:27:07</TD>
    <TD>SystemUtil</TD>
    </TR>

    block 2

    <TR>
    <TD>1</TD>
    <TD><FONT COLOR="#009900">Passed</TD>
    <TD>Run “taskkill”,1</TD>
    <TD>17:27:07</TD>
    <TD>Run “taskkill”,1</TD>
    </TR>



  • @Lonnie-Hailey

    racking my brains for weeks

    Really? Maybe there is some nuance missing that you did not describe that would cause this “racking”?

    This “simplistic” approach would seem to do it:

    Find-what: (?-si)<TR>\R<TD>.+</TD>\R<TD>.*Done.*</TD>\R(?:<TD>.+</TD>\R){3}</TR>\R



  • Glory Be! That did the trick, but
    after executing I found 2 more blocks that were left over. How do I get rid of these:

    Block 1
    ----------------------------------------------------------------------------------

    <TR>
    <TD>1</TD>
    <TD><FONT COLOR="#808080">Done</TD>
    <TD>WinObject:[ I-2400 : INPUT PORT 5 - NSR LIST

    NSR     C  B  A  U  K  L  M  V  W  X  Y        Z  DLINK C2/SGL   TYPE   
    
    0001:  00 01 01 00 04 00 00 00 00 00 00 00000000  GFP   GFP      C4    
    
    0002:  00 01 02 00 04 00 00 00 00 0 ]</TD>
    

    <TD>17:46:58</TD>
    <TD>Type “&lt__MicCtrlDwn&gt&lt__MicAltDwn&gtp&lt__MicAltUp&gt&lt__MicCtrlUp&gt”</TD>
    </TR>

    Block 2
    ----------------------------------------------------------------------------------

    <TR>
    <TD>1</TD>
    <TD><FONT COLOR="#808080">Done</TD>
    <TD>WinObject:[ I-2400 : GIGABIT ETHERNET STATISTICS

    ±-----------------------+ Output Port 14 (1 GbE) ±-----------------------+

    | |

    | -Tx Counts- ]</TD>
    <TD>17:47:49</TD>
    <TD>Type “&lt__MicCtrlDwn&gt&lt__MicAltDwn&gtp&lt__MicAltUp&gt&lt__MicCtrlUp&gt”</TD>
    </TR>

    Can you explain tom me the regex expression when you get a chance?



  • @Lonnie-Hailey

    So the 2 new cases you show weren’t matched because the <TD>…</TD> just after the Done line spans multiple lines. The leading (?-s) prevents the . (dot) metacharacter from matching across line boundaries. The regular expression can be altered so that it matches these additional cases as well, but it gets more complicated:

    (?-si)<TR>\R<TD>.+</TD>\R<TD>.*Done.*</TD>\R<TD>(?:(?s).+?)</TD>\R(?:<TD>.+</TD>\R){2}</TR>\R

    Rather than explain it myself, I’ll let RegexBuddy explain it (I never got an answer from RB’s author about posting some of its output here, so I’m going to go with it being OK as it is a good advertisement for the program; here’s some subliminal messages: IF YOU USE REGEX, BUY A REGEXBUDDY LICENSE…IF YOU USE REGEX, BUY A REGEXBUDDY LICENSE…IF YOU USE REGEX, BUY A REGEXBUDDY LICENSE… OK, enough of that)…here’s the explanation:

    Use these options for the whole regular expression «(?-si)»
       (hyphen inverts the meaning of the letters that follow) «-»
       Dot doesn’t match line breaks «s»
       Case sensitive «i»
    Match the character string “<TR>” literally (case sensitive) «<TR>»
    Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed) «\R»
    Match the character string “<TD>” literally (case sensitive) «<TD>»
    Match any single character that is NOT a line break character (line feed, carriage return, form feed) «.+»
       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
    Match the character string “</TD>” literally (case sensitive) «</TD>»
    Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed) «\R»
    Match the character string “<TD>” literally (case sensitive) «<TD>»
    Match any single character that is NOT a line break character (line feed, carriage return, form feed) «.*»
       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
    Match the character string “Done” literally (case sensitive) «Done»
    Match any single character that is NOT a line break character (line feed, carriage return, form feed) «.*»
       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
    Match the character string “</TD>” literally (case sensitive) «</TD>»
    Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed) «\R»
    Match the character string “<TD>” literally (case sensitive) «<TD>»
    Match the regular expression below «(?:(?s).+?)»
       Use these options for the remainder of the group «(?s)»
          Dot matches line breaks «s»
       Match any single character «.+?»
          Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
    Match the character string “</TD>” literally (case sensitive) «</TD>»
    Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed) «\R»
    Match the regular expression below «(?:<TD>.+</TD>\R){2}»
       Exactly 2 times «{2}»
       Match the character string “<TD>” literally (case sensitive) «<TD>»
       Match any single character that is NOT a line break character (line feed, carriage return, form feed) «.+»
          Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
       Match the character string “</TD>” literally (case sensitive) «</TD>»
       Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed) «\R»
    Match the character string “</TR>” literally (case sensitive) «</TR>»
    Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed) «\R»
    
    Created with RegexBuddy
    


  • You have been extremely helpful and you have gone beyond the call of duty in helping me resolve my issues.
    Is there anything I can do for you in this forum to show my appreciation?

    Thanks again



  • @Lonnie-Hailey said:

    Is there anything I can do for you in this forum to show my appreciation?

    HAHA, well I really don’t care about this, but about the only thing you could do is “upvote”…see where it says ^ 0 v on the right? Click on the ^ !

    :-D


Log in to reply