How to delete a duplicate paragraph at a particular place in multiple files

dr ramaanand

<H1…>Heading1</H1>
<H2…>Some text</H2>
<H2…>Different text</H2>
<H2…>Altogether different text</H2>
Some paragraphs here
</P> (or </ul>)
<P…><span…><b>Please E-mail us</b></span></P>
<H2…>Heading that should not be reproduced</H2>
Some paragraphs here
</ul> (or </P>)
<P…>We have</P>
<P …>Some text</P>
<P …>Different text</P>
<P …>Same text</P>
<P …>Same text</P>
<P…><b><span…>Please E-mail us</span></b></P>

dr ramaanand

@dr-ramaanand said in How to delete a duplicate paragraph at a particular place in multiple files:

<H1…>Heading1</H1>
<H2…>Some text</H2>
<H2…>Different text</H2>
<H2…>Altogether different text</H2>
Some paragraphs here
</P> (or </ul>)
<P…><span…><b>Please E-mail us</b></span></P>
<H2…>Heading that should not be reproduced</H2>
Some paragraphs here
</ul> (or </P>)
<P…>We have</P>
<P …>Some text</P>
<P …>Different text</P>
<P …>Same text</P>
<P …>Same text</P>
<P…><b><span…>Please E-mail us</span></b></P>

For the above test string, if I put (?s)\A.+?\K((<h2.+?</h2>\R)+).*\K(?=<p.*?</p>\R<p.*?Please\s*E-mail\s*us) in the Find field and select the Regular expression mode, I can find (and remove) a paragraph just before the paragraph with the “Please E-mail us” text as that has the same text as the paragraph above it in most files of a folder. However, in some cases (in other files), it doesn’t have the same text, so how do I avoid finding/removing it if it doesn’t have the same text? I believe this paragraph with the same text was added by Notepad++ during my previous find and replace exercise due to a bug.

dr ramaanand

@dr-ramaanand Please don’t tell me to do it on my own. I have tried and failed already.

Alan Kilborn

@dr-ramaanand said in How to delete a duplicate paragraph at a particular place in multiple files:

Please don’t tell me to do it on my own. I have tried and failed already.

Probably the best thing to do is to seek help on a site that specializes in regular-expression help.

dr ramaanand

@Alan-Kilborn I asked at www.regex101.com and they told me to put (?s)^(<p.*?<\/p>\R)(\1<p.*?Please\s*E-mail\s*us) in the Find field, select the Regular Expression mode and $2 in the Replace field and hit “Replace All” and all the duplicate paragraphs disappeared.

Alan Kilborn

@dr-ramaanand said in How to delete a duplicate paragraph at a particular place in multiple files:

and all the duplicate paragraphs disappeared.

So that’s good, right?
What you wanted?

dr ramaanand

@Alan-Kilborn yes and thanks for your time also. Please keep this community going as there are lots of people who will ask for solutions here (notepad++ community)!

Alan Kilborn

@dr-ramaanand

Our goals are to get you the best help available.
We can answer regex questions here, but the same/similar questions from the same poster get tiring as we are interested in much more diverse Notepad++ topics than just data conversion with regex.
So, if we can redirect you to a site where they are excited about regex, and only regex, well, we’ll do that.
I think maybe you’ve found a site for that now.
But I encourage you to learn to do it yourself – if someone else can write something that works, then so can you!

dr ramaanand

@Alan-Kilborn I have learnt quite a bit but not everything which is why I seek solutions here. Notepad++ has a “delete duplicate lines” in an open file feature which is why I asked for a solution here first.

dr ramaanand

@Alan-Kilborn I can even explain the above. In that RegEx, (?s)^(<p.*?<\/p>\R)(\1<p.*?Please\s*E-mail\s*us) - (?s) means “search”, ^ means at the beginning of the line, (<p.*?<\/p>\R) means the first captured group, from <p...................</p> including the next line (which is done with the \R) and the rest is the second captured group in which \1 is to search for a duplicate of the first captured group, followed by another <p...................</p> string, followed by, “Please E-mail us”. The \s* before and after the, “E-mail” will make the words, “Please E-mail us” to be captured even if they are all on different lines (as well as if they are all on the same line).
The $2 in the Replace field (“Replace in files” in this case) is to reproduce the second captured group in the final result.