Delete all lines except
-
Here is an example of two businesses, one with an email address and one without. Each one always begins with the same code and ends with the same code so selecting for an email is easy by finding those with the @ symbol but I need to removed all lines for the 2nd example and any subsequent business listings that do not have an email address, keeping all lines for each business that does have an email address.
Here’s the html code for a business listing with an email address.
<table align=“center” style=“width:95%;” border=“0”><tr><td align=‘left’ colspan=3><hr style=‘border: solid 1px black;’/>
<table align=‘left’ cellspacing=‘0’ cellpadding=‘3’ width=‘500’><tr><td align=‘left’ width=‘60%’ valign=‘top’>
<font color=‘#595f75’><strong>A & L INDUSTRIAL SERVICES</strong><br>Misty Martinez<br></font>
<font color=‘#595f75’>2910 East P Street<br>Deer Park, TX 77536</font> <a href=‘http://maps.google.com/?q=2910+East+P+Street%2C+Deer+Park%2C+TX+77536’ target=‘_blank’ style=“color: ##228dc1;”>Map</a></td>
<td width=‘40%’ align=‘right’ valign=‘top’>281 470-9805<br>FAX: 281 470-9899<br><a href=‘http://www.anlindustrial.com’ target=‘_blank’><font color=‘#228dc1’>www.anlindustrial.com</font></a><br><a href=‘mailto:misty.martinez@anlindustrial.com’><font color=‘#228dc1’>Email</font></a></td>
</tr><tr><td align=‘left’ colspan=3><span style=‘font-style: italic; font-weight: bold;’>
</span></td></tr>
</table></td></tr>This business does **not **have an email address so all lines for this business need to be deleted.
<tr><td align=‘left’ colspan=3><hr style=‘border: solid 1px black;’/>
<table align=‘left’ cellspacing=‘0’ cellpadding=‘3’ width=‘500’><tr><td align=‘left’ width=‘60%’ valign=‘top’>
<font color=‘#595f75’><strong>A Life to Live Animal Shelter & Adoption Center</strong><br>Megan Gonzales<br></font>
<font color=‘#595f75’>P.O. Box 873<br>Baytown, TX 77522</font></td>
<td width=‘40%’ align=‘right’ valign=‘top’>832 821-5420<br><a href=‘http://www.adopttosave.org’ target=‘_blank’><font color=‘#228dc1’>www.adopttosave.org</font></a></td>
</tr><tr><td align=‘left’ colspan=3><span style=‘font-style: italic; font-weight: bold;’>
</span></td></tr>
</table></td></tr>Any tips and suggestions for the correct code is appreciated. I hope to learn something so I can help others too.
-
There is probably a way to generate a regular expression that will do what you want – this FAQ will give a good starting place for understanding regular expressions for fancy search/replace/delete actions.
But it’s going to be really hard to craft accurately – though there are a couple here who will probably try, even based on your limited examples – because of all the nestings and balances that need to be checked to ensure that the regex doesn’t delete multiple entries instead of just one, and doesn’t go from the middle of one to the middle of another. (Regular expressions are not well-suited for parsing HTML/XML and the like. You can sometimes do it for a limited circumstance, but it’s not a very general solution.)
If you really just had a few, I would just do it manually. But if you have an HTML file with large numbers of entries, I can see why you would want to automate the problem: I’m just not convinced that a regular expression from within Notepad++ is really the best plan. Especially if in another day or week or month, you’re going to want to do something similar, but with a tweaked requirement.
However, if you have an HTML file with a large number of entries, I somehow doubt it was originally generated manually in a text editor. My guess is that it came from a database. If you have access to the original program/website-backend/whatever that generated the HTML originally (or can contact a person who has access), it would be much easier to filter the data before it makes it into the HTML to begin with – and to regenerate the HTML from the source data, rather that post-processing.
If you are in the unfortunately circumstance where the HTML was generated “a long time ago”, and whoever or whatever generated the HTML is lost or gone or otherwise no longer available, I’m still not sure Notepad++ with regular expressions are your best option, especially if you are going to have to filter the data more than once in your life (the regex would have to be re-worked for every slight variation in what you wanted to filter, and tweaking the regex might be more maintenance hassle than the time it saves by eliminating the manual edit. Most programming languages – especially so called “scripting” languages, like Perl, Python, or Lua, which are languages designed for having fast development times to solve problems like this – have libraries/modules/packages available which can easily parse HTML and extract the data; once it’s parsed, it’s easily filtered, and then the remaining HTML can be spit back out – or, even better, the underlying data could be put into a new database, and then a script could be used to extract filtered data from that database to generate the HTML.
If you have no access to the original generator of the HTML, and if you don’t have the programming skills (and no one else in your organization does, either) to extract and filter, then maybe a regex supplied by friendly helpers in this forum is your next remaining option. In which case, please help us help you.
-
Repost the data in this forum in a way that will not mess up the data: right now, all your single and double quote marks are “smart quotes”, and those aren’t valid HTML, so I am assuming that the forum “kindly” converted your data. There may be other markdown characters that get eliminated as well, that we cannot begin to guess at. The “?” next to the COMPOSE in the post-editing window will direct you to a help page for markdown formatting. Or this help-with-markdown post will show details of good ways to use it in the forum. But in short, to quote a block of text verbatim, you can either put blank lines before and after, and indent every line of the quote by 4 spaces or a tab (this can easily be done by putting the raw text in Notepad++, selecting it all, hitting TAB, COPY that selected-and-indented data, and pasting in the forum) – it will show up in a black text block in the PREVIEW pane to the right of the EDIT pane. Or, if you are worried about “editing” the raw data even that much, format it like:
```z <table><tr>... ...</tr></table> ```
Where you put a blank line, then
```z
, then the raw data, then```
, then a blank line. This will render like<table><tr>... ...</tr></table>
-
Include more examples of both data that will stay and data that will be deleted; often times, it’s the things that don’t get edited that give more clues as to how to craft the regex than what does get edited.
-
If there is always some other separator HTML (which you have not shown), it might make it easier to define the beginning and ending of an entry based on that separator.
-
-
Thanks for taking the time for your extensive help. Many sites that allow posting of html provide a way to isolate the code so it is recognized as such but I did not see that option here.
I thought this would not be too difficult a problem to solve since every block of code begins and ends with the exact same code in this case and the exceptions I want to keep all will have the @ symbol somewhere in that block of code if there is an email address. If there is no @ symbol then all code between the start and end code would be deleted. How is that different than finding a single line and deleting it?
Regardless I will follow you advice and try to structure my question better.
-
@Raymond-Lee-Fellers said:
since every block of code begins and ends with the exact same code
Well, in your example above this is NOT the case (the 2 cases don’t start out the same way, although they do end the same way). One could guess at what is missing, but you seem so sure it is all there. Unless I am just not seeing it somehow.
-
@Alan-Kilborn , well, if you fix smartquotes, assume nothing else was markdown-ed away, and ignore the initial
<table align="center" style="width:95%;" border="0">
, then both blocks start with<tr><td align='left' colspan=3><hr style='border: solid 1px black;'/>
and end with</table></td></tr>
.I am just not good enough with finding long strings that don’t contain a specific subsequence to be able to come up with one that will work. But I see @guy038 is browsing this topic now, so I assume magic regex will soon be appearing…
-
Wait, when I say it that way, it was easier than I thought. Assuming the
@
sign is sufficient to mark it as having an email, then the following should work to find and delete the ones without an email:- REGULAR EXPRESSION mode
- FIND =
(?s)<tr><td align='left' colspan=3><hr style='border: solid 1px black;'/>[^@]*?</table></td></tr>
- REPLACE = `` (empty)
If you need a longer string than just the
@
to determine that it’s got an email, then I defer to Guy.edit: yes, when I run it on:
<table align="center" style="width:95%;" border="0"><tr><td align='left' colspan=3><hr style='border: solid 1px black;'/> <table align='left' cellspacing='0' cellpadding='3' width='500'><tr><td align='left' width='60%' valign='top'> <font color='#595f75'><strong>A & L INDUSTRIAL SERVICES</strong><br>Misty Martinez<br></font> <font color='#595f75'>2910 East P Street<br>Deer Park, TX 77536</font> <a href='http://maps.google.com/?q=2910+East+P+Street%2C+Deer+Park%2C+TX+77536' target='_blank' style="color: ##228dc1;">Map</a></td> <td width='40%' align='right' valign='top'>281 470-9805<br>FAX: 281 470-9899<br><a href='http://www.anlindustrial.com' target='_blank'><font color='#228dc1'>www.anlindustrial.com</font></a><br><a href='mailto:misty.martinez@anlindustrial.com'><font color='#228dc1'>Email</font></a></td> </tr><tr><td align='left' colspan=3><span style='font-style: italic; font-weight: bold;'> </span></td></tr> </table></td></tr> <tr><td align='left' colspan=3><hr style='border: solid 1px black;'/> <table align='left' cellspacing='0' cellpadding='3' width='500'><tr><td align='left' width='60%' valign='top'> <font color='#595f75'><strong>A Life to Live Animal Shelter & Adoption Center</strong><br>Megan Gonzales<br></font> <font color='#595f75'>P.O. Box 873<br>Baytown, TX 77522</font></td> <td width='40%' align='right' valign='top'>832 821-5420<br><a href='http://www.adopttosave.org' target='_blank'><font color='#228dc1'>www.adopttosave.org</font></a></td> </tr><tr><td align='left' colspan=3><span style='font-style: italic; font-weight: bold;'> </span></td></tr> </table></td></tr> <tr><td align='left' colspan=3><hr style='border: solid 1px black;'/> <table align='left' cellspacing='0' cellpadding='3' width='500'><tr><td align='left' width='60%' valign='top'> <font color='#595f75'><strong>A & L INDUSTRIAL SERVICES</strong><br>Misty Martinez<br></font> <font color='#595f75'>2910 East P Street<br>Deer Park, TX 77536</font> <a href='http://maps.google.com/?q=2910+East+P+Street%2C+Deer+Park%2C+TX+77536' target='_blank' style="color: ##228dc1;">Map</a></td> <td width='40%' align='right' valign='top'>281 470-9805<br>FAX: 281 470-9899<br><a href='http://www.anlindustrial.com' target='_blank'><font color='#228dc1'>www.anlindustrial.com</font></a><br><a href='mailto:misty.martinez@anlindustrial.com'><font color='#228dc1'>Email</font></a></td> </tr><tr><td align='left' colspan=3><span style='font-style: italic; font-weight: bold;'> </span></td></tr> </table></td></tr> <tr><td align='left' colspan=3><hr style='border: solid 1px black;'/> <table align='left' cellspacing='0' cellpadding='3' width='500'><tr><td align='left' width='60%' valign='top'> <font color='#595f75'><strong>A & L INDUSTRIAL SERVICES</strong><br>Misty Martinez<br></font> <font color='#595f75'>2910 East P Street<br>Deer Park, TX 77536</font> <a href='http://maps.google.com/?q=2910+East+P+Street%2C+Deer+Park%2C+TX+77536' target='_blank' style="color: ##228dc1;">Map</a></td> <td width='40%' align='right' valign='top'>281 470-9805<br>FAX: 281 470-9899<br><a href='http://www.anlindustrial.com' target='_blank'><font color='#228dc1'>www.anlindustrial.com</font></a><br><a href='mailto:misty.martinez@anlindustrial.com'><font color='#228dc1'>Email</font></a></td> </tr><tr><td align='left' colspan=3><span style='font-style: italic; font-weight: bold;'> </span></td></tr> </table></td></tr> <tr><td align='left' colspan=3><hr style='border: solid 1px black;'/> <table align='left' cellspacing='0' cellpadding='3' width='500'><tr><td align='left' width='60%' valign='top'> <font color='#595f75'><strong>A Life to Live Animal Shelter & Adoption Center</strong><br>Megan Gonzales<br></font> <font color='#595f75'>P.O. Box 873<br>Baytown, TX 77522</font></td> <td width='40%' align='right' valign='top'>832 821-5420<br><a href='http://www.adopttosave.org' target='_blank'><font color='#228dc1'>www.adopttosave.org</font></a></td> </tr><tr><td align='left' colspan=3><span style='font-style: italic; font-weight: bold;'> </span></td></tr> </table></td></tr> <tr><td align='left' colspan=3><hr style='border: solid 1px black;'/> <table align='left' cellspacing='0' cellpadding='3' width='500'><tr><td align='left' width='60%' valign='top'> <font color='#595f75'><strong>A Life to Live Animal Shelter & Adoption Center</strong><br>Megan Gonzales<br></font> <font color='#595f75'>P.O. Box 873<br>Baytown, TX 77522</font></td> <td width='40%' align='right' valign='top'>832 821-5420<br><a href='http://www.adopttosave.org' target='_blank'><font color='#228dc1'>www.adopttosave.org</font></a></td> </tr><tr><td align='left' colspan=3><span style='font-style: italic; font-weight: bold;'> </span></td></tr> </table></td></tr> </table>
I get
<table align="center" style="width:95%;" border="0"><tr><td align='left' colspan=3><hr style='border: solid 1px black;'/> <table align='left' cellspacing='0' cellpadding='3' width='500'><tr><td align='left' width='60%' valign='top'> <font color='#595f75'><strong>A & L INDUSTRIAL SERVICES</strong><br>Misty Martinez<br></font> <font color='#595f75'>2910 East P Street<br>Deer Park, TX 77536</font> <a href='http://maps.google.com/?q=2910+East+P+Street%2C+Deer+Park%2C+TX+77536' target='_blank' style="color: ##228dc1;">Map</a></td> <td width='40%' align='right' valign='top'>281 470-9805<br>FAX: 281 470-9899<br><a href='http://www.anlindustrial.com' target='_blank'><font color='#228dc1'>www.anlindustrial.com</font></a><br><a href='mailto:misty.martinez@anlindustrial.com'><font color='#228dc1'>Email</font></a></td> </tr><tr><td align='left' colspan=3><span style='font-style: italic; font-weight: bold;'> </span></td></tr> </table></td></tr> <tr><td align='left' colspan=3><hr style='border: solid 1px black;'/> <table align='left' cellspacing='0' cellpadding='3' width='500'><tr><td align='left' width='60%' valign='top'> <font color='#595f75'><strong>A & L INDUSTRIAL SERVICES</strong><br>Misty Martinez<br></font> <font color='#595f75'>2910 East P Street<br>Deer Park, TX 77536</font> <a href='http://maps.google.com/?q=2910+East+P+Street%2C+Deer+Park%2C+TX+77536' target='_blank' style="color: ##228dc1;">Map</a></td> <td width='40%' align='right' valign='top'>281 470-9805<br>FAX: 281 470-9899<br><a href='http://www.anlindustrial.com' target='_blank'><font color='#228dc1'>www.anlindustrial.com</font></a><br><a href='mailto:misty.martinez@anlindustrial.com'><font color='#228dc1'>Email</font></a></td> </tr><tr><td align='left' colspan=3><span style='font-style: italic; font-weight: bold;'> </span></td></tr> </table></td></tr> <tr><td align='left' colspan=3><hr style='border: solid 1px black;'/> <table align='left' cellspacing='0' cellpadding='3' width='500'><tr><td align='left' width='60%' valign='top'> <font color='#595f75'><strong>A & L INDUSTRIAL SERVICES</strong><br>Misty Martinez<br></font> <font color='#595f75'>2910 East P Street<br>Deer Park, TX 77536</font> <a href='http://maps.google.com/?q=2910+East+P+Street%2C+Deer+Park%2C+TX+77536' target='_blank' style="color: ##228dc1;">Map</a></td> <td width='40%' align='right' valign='top'>281 470-9805<br>FAX: 281 470-9899<br><a href='http://www.anlindustrial.com' target='_blank'><font color='#228dc1'>www.anlindustrial.com</font></a><br><a href='mailto:misty.martinez@anlindustrial.com'><font color='#228dc1'>Email</font></a></td> </tr><tr><td align='left' colspan=3><span style='font-style: italic; font-weight: bold;'> </span></td></tr> </table></td></tr> </table>
-
too slow for another edit: if you prefix and suffix the FIND string with
\R*
, you can get rid of the blank lines, too:- FIND =
(?s)\R*<tr><td align='left' colspan=3><hr style='border: solid 1px black;'/>[^@]*?</table></td></tr>\R*
- FIND =
-
@PeterJones said:
well, if you fix smartquotes, assume nothing else was markdown-ed away, and ignore the initial
Or better yet, you fix nothing, and just…move…on…
It is actually amazing the number of people here that make you want to figure out what they are trying to ask before you give them help.
-
Given that it’s the probably the forum itself that clobbered the real quotes into smart quotes, and the little “?” isn’t super-obvious, being light grey and pretty small and a one-character-wide click, I’m willing to fix smartquotes for first-time posters (and since in his first thread, no one pointed out to Raymond how to properly format in these forums, I gave an extension).
But now that I’ve explained it, and pointed out multiple times to @Raymond-Lee-Fellers that he needs to apply Markdown formatting to force the blocks to be rendered unedited, I will expect any further clarifications or posts from him to be formatted better
-
For my own curiosity (one of these days, I want this idiom to stick): starting with my FIND =
(?s)\R*<tr><td align='left' colspan=3><hr style='border: solid 1px black;'/>[^@]*?</table></td></tr>\R*
as the baseline, how would you fix[^@]*?
to search for "any non-greedy sequence of characters that does not containmailto:
" instead of “any non-greedy sequence of characters that does not contain an@
sign” -
@Alan-Kilborn oops, you’re right. The 1st example is different from the second one; however it appears that all subsequent blocks do begin and end with the same code. My apologies.
-
No problem…helping you out is the most important thing.
This may be the idiom you seek?:
((?!mailto:).)*
It is not the easiest thing to remember.
-
@PeterJones said:
(?s)<tr><td align=‘left’ colspan=3><hr style=‘border: solid 1px black;’/>[^@]*?</table></td></tr>
Works perfectly. Didn’t have to fix anything. Your code works out of the box. Thank you.
-
Hi, @raymond-lee-fellers, @peterjones, @alan-kilborn and All,
First of all, Peter, I was just about to reply when I saw your solution. My solution is quite similar and just a bit less accurate than your one, as my opening boundary is simply
<tr><td align='left' !
!I’ve found the regex :
(?s)<tr><td align='left'[^@]+</table></td></tr>\R
Regarding your question, Peter, the right answer is the Alan’s one ! So, if you’re looking for the smallest area of characters, even on several lines, between the opening boundary
START
and the ending boundaryEND
, in that exact case, which should not contain, for instance, the number123
, the correct regex is :(?s-i)START((?!123).)*?END
Indeed, this regex means that, before each position reached by the regex engine, after the word START, it tests the negative look-ahead, i.e. it asks : On the next three characters, is there a
123
string ? If NOT, the negative look-ahead is TRUE and, then, allows the regex engine to continue the process and to move to the next character
Of course, if you want to select this same area of characters, which does contain the number
123
, this regex is simply :(?s-i)START.*?123.*?END
Test it, against the text, below :
......START123......................END................START123......................END.......... ......START.........................END................START.........................END.......... ......START...........1.............END................START...........1.............END.......... ......START............2............END................START............2............END.......... ......START.............3...........END................START.............3...........END.......... ......START...........123...........END................START...........123...........END.......... ......START...........12............END................START...........12............END.......... ......START............23...........END................START............23...........END.......... ......START...........1.3...........END................START...........1.3...........END.......... ......START......................123END................START......................123END..........
You’ve certainly noticed that, if you look for areas, containing the
123
number, in my sample text, the best is to use the regex(?-is)START.*?123.*?END
, which limit to a single-line range ;-))Best regards,
guy038
P.S. :
Note that the regex
(?s-i)START((?!123).)*?END
could be rewritten, in a complicated way;, as :(?s-i)START(((?!1)|(?=1(?!2))|(?=12(?!3))).)*?END
Just the use of the Boole algebra !. Indeed, if we consider
3
consecutive chars, in order to match all cases different of number “123”, we may use :NOT (123) = NOT 1 OR ( 1 AND NOT 2 ) OR ( 12 AND NOT 3 ) V V V ... 1.. 12. .2. 1.3 ..3 .23
-
@Raymond-Lee-Fellers : glad it worked
@guy038 : thanks for the details
@Alan-Kilborn : a negative lookahead as the start of the match segment. No wonder I cannot store it. I’ve bookmarked it instead.
Now back to studying Python (since two of the forum’s python experts are not actively helping here anymore, I need to up my PythonScript output)
-
since two of the forum’s python experts are not actively helping here anymore, I need to up my PythonScript output …
i really wished there was an easy way to lure both back somehow … for example with a complementary free cake for all returning members 🥧🍰🎂 :D
i’ve still got the hope that someday maybe one or both will return for either historic, good times reasons, or 'cause you lured them back with future’s most ultimate py guru knowledge 👍
(ps: no pressure, i think your py is already pretty good and way better than eg. mine)reader’s note:
this was slightly off topic, so i give my sincere apologies to everyone in advance.