Search/Replace to delete between two words
-
Fellow Notepad++ Users,
Could you please help me the the following search-and-replace problem I am having?
I have a large dataset of newspaper articles from the Nexis database. In order to analyse them, I want to try and get rid of some of the repeating non-relevant lines of text which occur in every article. I have indicated the texts I want removed between **. I understand I need to leave the replace empty, and that for some of the lines can be simply removed using Find and Replace. But for a few sections, e.g. between Section: and Body, there are a few variable words and numbers (like the word count of the article) that mean I need some form of regular expression.
I have had a look at some expressions to delete between two strings but they don’t seem to register when I try them, please could someone provide me a (or a few if needed) regular expressions that would help me remove these bits?Here is one example of the data I currently have (“before” data):
MPs' duty to challenge assisted dying claims; Letters to the Editor *The Daily Telegraph (London) December 3, 2024 Tuesday Edition 1, National Edition* *Copyright 2024 Telegraph Media Group Holdings Limited All Rights Reserved* *Section: LETTERS; Pg. 17 Length: 372 words Body* SIR - It is to be hoped that, as the Terminally Ill Adults (End of Life) Bill proceeds ("Patients may have to pay for assisted dying", report, December 2), MPs will subject its supporters' claims to the closest scrutiny, including the following. First, Dame Esther Rantzen claimed the Bill "offers everyone equal choice". However, while the Bill grants a choice to someone predicted to die within six months, even if they are not suffering, it denies one to those who are suffering gravely from an illness that will last six decades. If the Bill is enacted, such obviously arbitrary discrimination will be challenged in the courts. The slippery slope is logically inherent in the very arguments of "choice" and "relief of suffering". Second, Alicia Kearns MP objected to talk of "assisted suicide", suggesting "it comes from a very religious place". It comes from a very secular place: the Suicide Act 1961, which defines the crime of assisting suicide. It is, rather, the tendentious and misleading euphemism "assisted dying" that is objectionable, conflating euthanasia, physician-assisted suicide, palliative medicine and the withdrawal of treatment. To their credit, the Dutch talk plainly of euthanasia and physician-assisted suicide. Third, Kim Leadbeater MP claimed that no jurisdiction that started out with "terminal illness" has expanded its law. Not so. Colombia's law is no longer limited to the "terminal patient", and Canada no longer requires death to be "reasonably foreseeable". The former governor of Washington State (who had Parkinson's) admitted that he campaigned for an Oregontype law "as a first step" in the hope that other states would follow, "the nation's resistance will subside, the culture will shift and laws with more latitude will be passed". Finally, Ms Leadbeater claimed that her Bill has the strongest safeguards in the world. Yet MPs should be reminded that Sir James Munby, the distinguished former president of the Family Division of the High Court, concluded that, in respect of the proposed involvement of the judiciary, "the Leadbeater Bill falls lamentably short of providing adequate safeguards". Professor John Keown Kennedy Institute of Ethics Georgetown University Washington DC, United States *Load-Date: December 3, 2024* * * End of Document*
-
Assuming I’ve understood correctly that the asterisks in your example are not actually part of the text (putting them there, rather than doing separate before/after data, makes things more confusing), I can come up with an example regex or two for getting rid of some of the boiler-plate sections (with the REPLACE being left empty, unless otherwise noted):
- FIND =
(?s)The Daily Telegraph.*?National Edition\R
- FIND =
Copyright .*?All Rights Reserved\R
- FIND =
(?s)Section:.*?Body\R
- FIND =
Load-Date:.*\R
Notes:
(?s)
inside a regular expression is the same as. matches newline
being on.*?
says “match 0 or more, as few as possible”, which generally would prevent it from matchingThe Daily Telegraph
in the first andNational Edition
in the last. However, to be safe, make sure you have a backup of your data before running any regex you are given\R
will match the newline sequence at the end of the line, so it won’t leave an empty line when you do the replacement.- I made the regex pretty generic, assuming that you won’t have text that’s similar to your boilerplate that is inside the stuff that you want to keep. They could be made a lot more specific (for example, only allowing different years but requiring the exact copyright text otherwise, or something). But I think it’s a sufficient starting point for you to start experimenting and learning.
(It is technically possible to merge those all into one power-regex, but I don’t tend to do that; if this is a bulk action you will be taking often, I would recommend recording it as a macro with the individual steps, which is easier to implement and easier to understand if you later go back and want to change one or more of the searches inside the macro)
BTW: thank you for using the Template for Search/Replace Questions – using that formatting makes it easier to be certain of the data you have (though next time, I highly recommend keeping the “before” and “after” separate, rather than using asterisks, because there will always be some confusion as to whether any given asterisk in your example data is really there or is meant to be your “delete-me-indicator”
----
Useful References
- FIND =
-
@Alessandro-Pace said in Search/Replace to delete between two words:
Here is one example of the data I currently have (“before” data):
Whilst looking at your example I note that it would seem very likely that the variability of the data to be removed will prove the be the major stumbling block. For example I presume the first 3 lines (from the Daily Telegraph to Edition) will change greatly dependent on source. The only item that may help would be the date in the middle line as a focal point.
I also presume (from your post) that the
*
are inserted by you to identify those lines to be removed, and are not part of the actual text being edited.I would head down the line of marking the lines to be removed, then possibly cutting out the unmarked lines and inserting into another tab/file. Then do the final check manually.
Terry
-
@Alessandro-Pace said in Search/Replace to delete between two words:
I have had a look at some expressions to delete between two strings but they don’t seem to register when I try them, please could someone provide me a (or a few if needed) regular expressions that would help me remove these bits?
As my solution is only marking the lines that may be fit for deletion I’ve combined all the 4 sections to be removed into the one regex.
Using the Mark function.
Find What:(?-is)(.*\R\w+ \d+, \d{4}.*\R.*\R)|(^Copyright \d{4}.*\R)|(^Section(.*\R)*Body\R)|(^Load-Date.*(\R.*)*End of Document)
. You will tick the bookmark line, search mode obviously is regular expression. Click on Mark All.At this point the lines marked will be highlighted and should have an icon (generally a blue sphere) in the left column (before line starts). If you right-click on any icon (or in that column) you get an option to cut bookmarked lines (or indeed remove bookmarked lines). Which option you take is largely a personal choice. If you cut then you would then open a new tab and insert those lines removed. So now you have the file cut into hopefully “relevant lines to keep” and "irrelevant lines, or just relevant if you removed irrelevant lines. Hopefully a quick manual check would identify if my solution has any problems. From there it would be a matter of adjusting the regex to suit. If no issues then you can just toss the irrelevant lines, leaving that which you seek.
Terry
-
@PeterJones Thank you so much these all worked perfectly!!