Help how to find duplicate sentence

NASSER AL NAAMANI

Help how to find duplicate sentence, means how to filter out repeated sentence or phrase in the notepad opened file, example

abcde,abcde

PeterJones

@NASSER-AL-NAAMANI ,

You are way too vague for us to give a complete answer. You say “sentence”, but then use an example of the single word abcde repeated. Since we have no idea what characters are allowed in a sentence for your data, we would have to make our answer super-generic. And because we don’t know whether they are always separated by just one character (like your example comma), or if it could have dozens of other “sentences” between, then we’d have to make it allow a generic amount of characters between. The problem with super-generic regex that allow lots of variation in the number of characters between is that it’s an exceedingly inefficient regex, and if there are too many characters in your file, the regex engine can decide it’s out of memory/resources for it’s capture-and-backtrack algorithm, and stop before finding matches.

If each of your “sentences” were a separate line of text, it would be much easier – and in fact, Notepad++ already has “Remove duplicate lines” or “Remove consecutive duplicate lines” built into the Search > Line Operations menu.

There’s also the matter of “filter out” – there are two ways to think of it: if there are two matching “sentences”, then “filter out” (delete) the second; or if there are two matching “sentences”, then “filter out” (delete) the first. Both of those variants involve difficulties of their own.

In general, the regex solution will involve the concepts of capture groups and backreferenes, and either a lookahead assertion or (because true lookbehinds cannot be variable width) using the \K control flow to mimic a variable-width lookbehind.

To do the delete-the-first, you would use a lookahead. So it would end up being something like

FIND = (.*)(?=.*?\1)
REPLACE = <leave empty>
SEARCH MODE = Regular Expression
- the (.*) is the capture group, capturing just about anything of any length
- the (?=...) is the lookahead, which means the characters have to match, but the replacement will not affect anything in that lookahead
- the .*? inside the lookahead means “match anything in between”
- the \1 inside the lookahead means "match exactly the same text as was already matched in capture group 1

To do the delete-the-second, you would want a lookbehind; but since the number of characters could be any length, you have to use \K to “reset” instead:

FIND = (.*).*?\K\1
REPLACE = <leave empty>
SEARCH MODE = Regular Expression
- Must use REPLACE ALL; cannot do a whole bunch of single REPLACE, because using \K
- you might have to run it multiple times, because it may have advanced beyond the “first” of some matching pair when it did the previous.
- the \K says “anything before the \K must match, but only stuff after the \K will be replaced by the REPLACE WITH text”
- the other concepts are the same

However, neither of those will actually work for you, because if your file was just the word tenet, either logic would see that the t is in your document twice, and delete one of them; then would see that e was in the document twice, and delete one of those, too. So with the delete-the-first solution, you’d end up with net; if you used the delete-the-second solution, the first time you REPLACE ALL, it would delete the t at the end, to produce tene, and then running it a second time, it would delete the e at the end, to produce ten . Even worse, the document tenet is a funny word, but it is a word. would end up with ten isafuyword,b. using the delete-the-second, and efny,butisa word. using the first.

If that’s not the weird results you want, then you’d have to make some sort of real regex definition for a “sentence”. And that’s more difficult that you might think. And I am quite certain, for any definition or regex that you think you had which would reliably match a sentence, I could come up with an exception that your definition/regex doesn’t cover. But if by some miracle, you could come up with a definition for a sentence or phrase that works for you in all edge cases, you would put it inside the first parentheses in one of the two regex I showed above.

To give more examples of why you are likely to not find what you want:

Using the definition of “a ‘sentence or phrase’` is any whole word (set of word characters between word boundaries) followed by anything else”, then
- delete-the-first: FIND = (\b\w+\b.*)(?=.*?\1) , which would transform tenet is a funny word, but it is a word. into tenet funny , but it is a word.
- delete-the-second: FIND = (\b\w+\b.*).*?\K\1, which would result in tenet is a funny word, but it .
Using the definition of "a ‘sentence or phrase’` is two or more whole words (set of word characters between word boundaries) separated by only spaces:
- delete-the-first: FIND = (\b\w+\b.*)(?=.*?\1) , which would transform tenet is a funny word, but it is a word. into tenet funny word, but it is a word.
- delete-the-second: FIND = (\b\w+\b(?:\h+\b\w+\b)+).*?\K\1, which would result in tenet is a funny word, but it word.
- this deleted just the phrase is a , which meets your generic phrasing of “sentence or phrase”.

I’m not sure anything much more complicated would truly get better, and I am sure you’d run across weird exceptions the more complicated you got.