Delete all rows of a text file except company names
-
This is a text file in hmtl containing company information. I want to delete everything except the company names.
Every row containing the company names has this code before the name. <font color=‘#595f75’><strong>
And this code after the company name. </strong>I tried using a regex find and replace with that supposedly would do this but it didn’t work so I’m asking here for suggestions.
An example of one companies html code listing. There could be as many as a 1000 companies in each list so automating this would be a big help.
<table align=‘left’ cellspacing=‘0’ cellpadding=‘3’ width=‘500’><tr><td align=‘left’ width=‘60%’ valign=‘top’>
<font color=‘#595f75’><strong>A & L INDUSTRIAL SERVICES</strong><br>Misty Martinez<br></font>
<font color=‘#595f75’>2910 East P Street<br>Deer Park, TX 77536</font> <a href=‘http://maps.google.com/?q=2910+East+P+Street%2C+Deer+Park%2C+TX+77536’ target=‘_blank’ style=“color: ##228dc1;”>Map</a></td>
<td width=‘40%’ align=‘right’ valign=‘top’>281 470-9805<br>Fax: 281 470-9899<br><a href=‘http://www.anlindustrial.com’ target=‘_blank’><font color=‘#228dc1’>www.anlindustrial.com</font></a><br><a href=‘mailto:misty.martinez@anlindustrial.com’><font color=‘#228dc1’>Email</font></a></td>
</tr><tr><td align=‘left’ colspan=3><span style=‘font-style: italic; font-weight: bold;’>
</span></td></tr>
</table></td></tr>Thanks for any suggestions. It’s probably simple for you but not for me.
-
Firstly I can see from your example that possibly at least 1 line has wrapped and now appears as at least 2 lines. The examples are very important as when we create a regex knowing how the line REALLY appears is very important.
Can I therefore suggest you read the FAQ, specifically the posting called
“Request for Help without sufficient information to help you”.
In there is how to represent the data (example) so that the markdown interpreter (which runs these posts) does NOT interfere with the formatting.Terry
-
Actually there is another way to delete the lines you don’t want. I’ll explain as it seems you may have some regex knowledge already.
Under the Search menu there is a “mark” option. Now you use the text you know that exists for the companies (this MUST NOT occur any any lines you want to delete, only the ones to remain) and insert into the Mark “find what” field. Click on the bookmark line option and then click on “mark all”. So this has now marked all the lines you want to keep. From here you can use the Search menu, Bookmark (near bottom) and select either “remove bookmarked lines” or “remove unmarked lines”. If the first option, then open another tab in NPP and paste them there.I hope that helps.
Terry
-
@Raymond-Lee-Fellers
Sorry, slight mistake in previous post, I meant to say you could use the “cut bookmarked lines” and then paste in another tab in NPP. However the easiest option is to use “remove unmarked lines”, which will leave the lines you DO want.Terry
-
Hello, @raymond-lee-fellers, @terry-r and All,
So, Raymond, you would like to delete everything except the company names which are located :
-
After the string
<font color='#595f75'><strong>
-
Before the string
</strong>
No problem at all with regular expressions ;-))
-
Copy / Paste your html file in a new Notepad++ tab
-
Open the Replace dialog (
Ctrl + H
) -
SEARCH
(?s).+?<font color='#595f75'><strong>((?-s).+?)</strong>|.+
-
REPLACE
\1\r\n
( or\1\n
if you work with UNIX files ) -
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button
Et voilà !
Notes :
-
First, the global modifier
(?s)
means that, by default, the dot character will match any single char ( standard or EOL one ) -
Then the part
.+?<font color='#595f75'><strong>
looks, from cursor position, for the smallest range, even on multi-lines, of any char till the literal string<font color='#595f75'><strong>
-
Now, the part
((?-s).+?)</strong>
tries to match the smallest range of standard characters, in a single line due to the(?-s)
modifier, till the literal string</strong>
. That range is stored as group1
, because of the parentheses -
If no more range
<font color='#595f75'><strong>............</strong>
cannot be found, the regex tries the second alternative, after the|
symbol (.+
) which catches all the remaining chars till the very end of the file -
In replacement, any company name is rewritten,
\1
, followed with the EOL chars\r\n
and remaining chars at end of the file are simply replaced with a single line-break as, in that second alternative, the group1
is not defined !
Remark :
If you do not tick the
Wrap around
option, in order to run the regex S/R from current location till the end of file, only, be sure that cursor is at the very beginning of the current line, before replacement !Best Regards,
guy038
-
-
Thanks to everyone who helped with this question. Each answer contributed to the solution. Special thanks to guy038 who gave me a better understanding of how the code works and his solution worked perfectly.
Ray Fellers