Search and replace an html tag block (table)
Greg Jeu last edited by
I have 200 html files in which there is an html table which I want to delete.
That table starts with :
<table cellpadding="1" cellspacing="0" border="0">
and ends with :
That table is present twice in each file.
There are other tables in the files (not starting with the same attributes).
I made a lot of searches and tries but can’t have a result.
What works with Search (Regexp):
easy, works, but finds ALL tables in all files.
better for my case, works, but it is not enough because there is not the end tag, so it does not include all the table I want to replace.
What does not work:
I can’t have results with something like
I don’t understand why it does not work and what is wrong.
Scott Sumner last edited by Scott Sumner
So it seems the tables you want to delete have the three attributes with corresponding values you’ve stated, but they can also have more attribute/value pairs as well? Is this correct?
I would suggest taking a cue from this thread and try the following:
Find what zone:
Replace with zone: Make sure it is empty
Search mode: Regular expression
Wrap around: ticked
Action: Press Replace All button
If this works fine for a single file, then extend it to a Replace in Files action…
Of course, maybe this is more complicated; maybe I missed something…
An assumption I made is that the
<table .....>info all appears on one line as in your example…tough to tell if this will exactly work; more data might have helped…
One comment to make is that in your trials you are escaping double quotes with the backslash (i.e.
\"), but this is not necessary; double quotes are not special characters to the regular expression engine.
guy038 last edited by guy038
I slightly changed the Scott’s search regex into this syntax :
In order to catch any
<table>.......</table>block, event if this block is indented ;-))
For instance, it will match all the block
<table.....</tablecontents, below :
<form.......> <td> <table cellpadding="1" cellspacing="0" border="0"> bla bla blah blah blah bla bla bla </table> <br> </form>
This regex is quite interesting because it shows how the look-around structures work ! In our case, we have
3consecutive positive look-ahead (
(?=......)syntaxes ! Let’s explain…
First, in that regex, the
(?-is)syntax means that :
The search is sensitive to the case of letters (
-imeaning NON insensitive )
The search engine considers that any special dot character
.represents a single standard character ( not and EOL one )
^\h*matches any optional consecutive range of horizontal blank character(s), as, for instance, a space or a tabulation, from beginning of line (
Now, the part
<table\x20matches, obviously, the string <table followed with a space character
Then we have
3positive look-ahead structures (
(?=......)), which are conditions to verify :
An overall match needs that these
3conditions are, simultaneously,
Each of them means : from current position till the end of the current line scanned :
Is the string cellpadding=“1” present ?
Is the string cellspacing=“0” present ?
Is the string border=“0” present ?
The VERY IMPORTANT thing to understand is that the working location of the regex engine has NOT changed, when evaluating, one after another, these three conditions. It’s still be right after the space character , located after the <table string !
In other words :
The order of appearance of each block could have been different ! For instance,
And the two blocks, below, would, also, have been matched, either !
<table cellspacing="0" cellpadding="1" border="0"> bla bla blah blah blah bla bla bla </table> <table border="0" cellpadding="1" cellspacing="0"> bla bla blah blah blah bla bla bla </table>
From above, we can deduct that the
.+>part matches the longest range of standard characters, after the string <table with a space char, till a
>symbol, located in current line
(?s)modifier means that, from now on, the dot
.character will match, absolutely, any character, even EOL chars !
So, the part
.*?represents the smallest multi-lines range of characters, possibly empty, of any character, after the
…Till the string </table>, due to the final part of the regex
To force the regex engine to catch the
3attributes cellpadding, cellspacing and border, in THAT order, exclusively, two solutions :
(?-is)^\h*<table\x20(?=.*cellpadding="1" cellspacing="0" border="0").+>(?s).*?</table>( a single look-ahead )
3nested look-aheads )
Greg Jeu last edited by
But your explains are great and help me understand better regexp!
Thank you very much.