Help how to find duplicate sentence
-
Help how to find duplicate sentence, means how to filter out repeated sentence or phrase in the notepad opened file, example
abcde,abcde
-
You are way too vague for us to give a complete answer. You say “sentence”, but then use an example of the single word
abcde
repeated. Since we have no idea what characters are allowed in a sentence for your data, we would have to make our answer super-generic. And because we don’t know whether they are always separated by just one character (like your example comma), or if it could have dozens of other “sentences” between, then we’d have to make it allow a generic amount of characters between. The problem with super-generic regex that allow lots of variation in the number of characters between is that it’s an exceedingly inefficient regex, and if there are too many characters in your file, the regex engine can decide it’s out of memory/resources for it’s capture-and-backtrack algorithm, and stop before finding matches.If each of your “sentences” were a separate line of text, it would be much easier – and in fact, Notepad++ already has “Remove duplicate lines” or “Remove consecutive duplicate lines” built into the Search > Line Operations menu.
There’s also the matter of “filter out” – there are two ways to think of it: if there are two matching “sentences”, then “filter out” (delete) the second; or if there are two matching “sentences”, then “filter out” (delete) the first. Both of those variants involve difficulties of their own.
In general, the regex solution will involve the concepts of capture groups and backreferenes, and either a lookahead assertion or (because true lookbehinds cannot be variable width) using the
\K
control flow to mimic a variable-width lookbehind.To do the delete-the-first, you would use a lookahead. So it would end up being something like
- FIND =
(.*)(?=.*?\1)
REPLACE = <leave empty>
SEARCH MODE = Regular Expression- the
(.*)
is the capture group, capturing just about anything of any length - the
(?=...)
is the lookahead, which means the characters have to match, but the replacement will not affect anything in that lookahead - the
.*?
inside the lookahead means “match anything in between” - the
\1
inside the lookahead means "match exactly the same text as was already matched in capture group 1
- the
To do the delete-the-second, you would want a lookbehind; but since the number of characters could be any length, you have to use
\K
to “reset” instead:- FIND =
(.*).*?\K\1
REPLACE = <leave empty>
SEARCH MODE = Regular Expression- Must use REPLACE ALL; cannot do a whole bunch of single REPLACE, because using
\K
- you might have to run it multiple times, because it may have advanced beyond the “first” of some matching pair when it did the previous.
- the
\K
says “anything before the\K
must match, but only stuff after the\K
will be replaced by the REPLACE WITH text” - the other concepts are the same
- Must use REPLACE ALL; cannot do a whole bunch of single REPLACE, because using
However, neither of those will actually work for you, because if your file was just the word
tenet
, either logic would see that thet
is in your document twice, and delete one of them; then would see thate
was in the document twice, and delete one of those, too. So with the delete-the-first solution, you’d end up withnet
; if you used the delete-the-second solution, the first time you REPLACE ALL, it would delete thet
at the end, to producetene
, and then running it a second time, it would delete thee
at the end, to produceten
. Even worse, the documenttenet is a funny word, but it is a word.
would end up withten isafuyword,b.
using the delete-the-second, andefny,butisa word.
using the first.If that’s not the weird results you want, then you’d have to make some sort of real regex definition for a “sentence”. And that’s more difficult that you might think. And I am quite certain, for any definition or regex that you think you had which would reliably match a sentence, I could come up with an exception that your definition/regex doesn’t cover. But if by some miracle, you could come up with a definition for a sentence or phrase that works for you in all edge cases, you would put it inside the first parentheses in one of the two regex I showed above.
To give more examples of why you are likely to not find what you want:
-
Using the definition of “a ‘sentence or phrase’` is any whole word (set of word characters between word boundaries) followed by anything else”, then
- delete-the-first: FIND =
(\b\w+\b.*)(?=.*?\1)
, which would transformtenet is a funny word, but it is a word.
intotenet funny , but it is a word.
- delete-the-second: FIND =
(\b\w+\b.*).*?\K\1
, which would result intenet is a funny word, but it .
- delete-the-first: FIND =
-
Using the definition of "a ‘sentence or phrase’` is two or more whole words (set of word characters between word boundaries) separated by only spaces:
- delete-the-first: FIND =
(\b\w+\b.*)(?=.*?\1)
, which would transformtenet is a funny word, but it is a word.
intotenet funny word, but it is a word.
- delete-the-second: FIND =
(\b\w+\b(?:\h+\b\w+\b)+).*?\K\1
, which would result intenet is a funny word, but it word.
- this deleted just the phrase
is a
, which meets your generic phrasing of “sentence or phrase”.
- delete-the-first: FIND =
I’m not sure anything much more complicated would truly get better, and I am sure you’d run across weird exceptions the more complicated you got.
- FIND =