Search and replace an html tag block (table)



  • Hi,

    I have 200 html files in which there is an html table which I want to delete.
    That table starts with :
    <table cellpadding="1" cellspacing="0" border="0">
    and ends with :
    </table>

    That table is present twice in each file.
    There are other tables in the files (not starting with the same attributes).

    I made a lot of searches and tries but can’t have a result.

    What works with Search (Regexp):
    <table(.*)</table>
    easy, works, but finds ALL tables in all files.

    <table cellpadding=\"1\"(.*?)cellspacing=\"0\"(.*?)border=\"0\">(.*?)
    better for my case, works, but it is not enough because there is not the end tag, so it does not include all the table I want to replace.

    What does not work:
    I can’t have results with something like
    <table cellpadding=\"1\"(.*?)cellspacing=\"0\"(.*?)border=\"0\">(.*)</table>
    or
    <table cellpadding=\"1\"(.*?)cellspacing=\"0\"(.*?)border=\"0\">(.*?)</table>
    I don’t understand why it does not work and what is wrong.



  • @Greg-Jeu

    So it seems the tables you want to delete have the three attributes with corresponding values you’ve stated, but they can also have more attribute/value pairs as well? Is this correct?

    I would suggest taking a cue from this thread and try the following:

    Find what zone: (?-is)<table (?=.*cellpadding="1")(?=.*cellspacing="0")(?=.*border="0").+>(?s).*?</table>
    Replace with zone: Make sure it is empty
    Search mode: Regular expression
    Wrap around: ticked
    Action: Press Replace All button

    If this works fine for a single file, then extend it to a Replace in Files action…

    Of course, maybe this is more complicated; maybe I missed something…

    An assumption I made is that the <table .....> info all appears on one line as in your example…tough to tell if this will exactly work; more data might have helped…

    One comment to make is that in your trials you are escaping double quotes with the backslash (i.e. \"), but this is not necessary; double quotes are not special characters to the regular expression engine.



  • Hello, @greg-jeu, @scott-sumner and All,

    I slightly changed the Scott’s search regex into this syntax :

    SEARCH (?-is)^\h*<table\x20(?=.*cellpadding="1")(?=.*cellspacing="0")(?=.*border="0").+>(?s).*?</table>

    In order to catch any <table>.......</table> block, event if this block is indented ;-))

    For instance, it will match all the block <table.....</table contents, below :

        <form.......>
            <td>
            <table cellpadding="1" cellspacing="0" border="0">
                bla bla blah
                blah blah
                bla bla bla
            </table>
            <br>
        </form>
    

    This regex is quite interesting because it shows how the look-around structures work ! In our case, we have 3 consecutive positive look-ahead ( (?=......) syntaxes ! Let’s explain…


    • First, in that regex, the (?-is) syntax means that :

      • The search is sensitive to the case of letters ( -i meaning NON insensitive )

      • The search engine considers that any special dot character . represents a single standard character ( not and EOL one )

    • Then, ^\h* matches any optional consecutive range of horizontal blank character(s), as, for instance, a space or a tabulation, from beginning of line ( ^ )

    • Now, the part <table\x20 matches, obviously, the string <table followed with a space character

    • Then we have 3 positive look-ahead structures ( (?=......) ), which are conditions to verify :

      • An overall match needs that these 3 conditions are, simultaneously, TRUE

      • Each of them means : from current position till the end of the current line scanned :

        • Is the string cellpadding=“1” present ?

        • Is the string cellspacing=“0” present ?

        • Is the string border=“0” present ?

      • The VERY IMPORTANT thing to understand is that the working location of the regex engine has NOT changed, when evaluating, one after another, these three conditions. It’s still be right after the space character , located after the <table string !

      • In other words :

        • The order of appearance of each block could have been different ! For instance, (?=.*border="0")(?=.*cellspacing="0")(?=.*cellpadding="1")

        • And the two blocks, below, would, also, have been matched, either !

            <table cellspacing="0" cellpadding="1" border="0">
                bla bla blah
                blah blah
                bla bla bla
            </table>
    
            <table border="0" cellpadding="1" cellspacing="0">
                bla bla blah
                blah blah
                bla bla bla
            </table>		
    
    • From above, we can deduct that the .+> part matches the longest range of standard characters, after the string <table with a space char, till a > symbol, located in current line

    • Then, the (?s) modifier means that, from now on, the dot . character will match, absolutely, any character, even EOL chars !

    • So, the part .*? represents the smallest multi-lines range of characters, possibly empty, of any character, after the > symbol…

    • …Till the string </table>, due to the final part of the regex </table>

    Best Regards,

    guy038

    P.S. :

    To force the regex engine to catch the 3 attributes cellpadding, cellspacing and border, in THAT order, exclusively, two solutions :

    • (?-is)^\h*<table\x20(?=.*cellpadding="1" cellspacing="0" border="0").+>(?s).*?</table> ( a single look-ahead )

    • (?-is)^\h*<table\x20(?=.*cellpadding="1"(?=.*cellspacing="0"(?=.*border="0"))).+>(?s).*?</table> ( 3 nested look-aheads )



  • wow! thank you very much!
    It works perfectly with scott-sumner’s solution.
    guy038’s only finds the first (of two) table I’m looking for in the file.

    But your explains are great and help me understand better regexp!

    Thank you very much.


Log in to reply