Regex: Find Consecutive Duplicate Words (words that are repeated) in a particular tag
-
hello. In the example below, two words are repeated
our our
<p class="bebe">My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
So, I need to find all tags such as
<p class="bebe">
that contains 2 or 3 repeated words.I find on google a regex for repeated words
\b(\w+)\b\s+\1\b
and I want to integrate into the tag<p class="bebe"><\/p>
such as:<p class="bebe">(\b(\w+)\b\s+\1\b)<\/p>
Is not working. Can anyone help me?
-
@Vasile-Caraus said:
<p class="bebe">(\b(\w+)\b\s+\1\b)<\/p>
There are two issues that I can see with that regex.
- It doesn’t allow for any text between the
>
and the repeated words, or between the repeated words and the<
. I would recommend using.*?
just after the>
and just before the<
to match anything (but as little as possible) in those regions. - The outer level of parentheses in
(\b(\w+)\b\s+\1\b)
changes the numbering of the groups relative to the regex\b(\w+)\b\s+\1\b
you found online. You don’t need the outer parentheses, so just get rid of them.
<p class="bebe">.*?\b(\w+)\b\s+\1\b.*?<\/p>
might work for you. (ie, it matches that paragraph for me)If you’re HTML paragraphs are multiline (have CR or LF or CRLF newline characters in them), then either click the
☑ . matches newline
box, or use(?s)
at the beginning of the regex.However, I’m willing to bet another five minutes of my time now that “find all tags” wasn’t really your end goal. My guess is that you end goal is to find then fix all those matches. In which case, I might do it using lookbehinds and lookaheads with something like:
- FIND =
<p class="bebe">.*?\b(\w+)\b\K\s+\1\b(?=.*?<\/p>)
- REPLACE = `` (empty)
(the same multiline advice applies)
Here, it finds everything up to the first word, but the
\K
makes it not part of the “match”. Then it finds one or more whitespace characters, followed by the repeat of the word. Then it looks ahead to make sure there’s an end-of-HTML-paragraph tag somewhere in the future (but doesn’t keep the rest of the paragraph in the match). It then replaces the space and the second instance of the word with emptiness (ie, deletes it).This would take
<p class="bebe">My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="baba">My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
and convert it to
<p class=“bebe”>My husband and I are transforming our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
<p class=“baba”>My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>(it keeps the
our our
in the second paragraph because it’sclass="baba"
, notclass="bebe"
)-----
FYI: I often add this to my response in regex threads, unless I am sure the original poster has seen it before. Here is some helpful information for finding out more about regular expressions, and for formatting posts in this forum (especially quoting data) so that we can fully understand what you’re trying to ask:This forum is formatted using Markdown, with a help link buried on the little grey
?
in the COMPOSE window/pane when writing your post. For more about how to use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end. It is very important that you use these formatting tips – using single backtick marks around small snippets, and using code-quoting for pasting multiple lines from your example data files – because otherwise, the forum will change normal quotes (""
) to curly “smart” quotes (“”
), will change hyphens to dashes, will sometimes hide asterisks (or if your text isc:\folder\*.txt
, it will show up asc:\folder*.txt
, missing the backslash). If you want to clearly communicate your text data to us, you need to properly format it.If you have further search-and-replace (“matching”, “marking”, “bookmarking”, regular expression, “regex”) needs, study this FAQ and the documentation it points to. Before asking a new regex question, understand that for future requests, many of us will expect you to show what data you have (exactly), what data you want (exactly), what regex you already tried (to show that you’re showing effort), why you thought that regex would work (to prove it wasn’t just something randomly typed), and what data you’re getting with an explanation of why that result is wrong. When you show that effort, you’ll see us bend over backward to get things working for you. If you need help formatting, see the paragraph above.
Please note that for all regex and related queries, it is best if you are explicit about what needs to match, and what shouldn’t match, and have multiple examples of both in your example dataset. Often, what shouldn’t match helps define the regular expression as much or more than what should match.
- It doesn’t allow for any text between the
-
nice answer, thank you !
-
@PeterJones said:
<p class=“bebe”>My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
<p class=“baba”>My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>now, I see there may be another case. DIACRITICS. Suppose I have 2 different words, starting with the same letters:
our
andourş
<p class=“bebe”>My husband and I are transforming our
ourş
long-neglected, creaky old Victorian house into our favorite place on earth.</p>Also, your regex is great, but in this case I need also not to find the words with symbols, but only those strictly the same.
-
got it. Some
\s+
should be addFIND =
<p class="bebe">.*?\b\s+(\w+)\b\K\s+\1\s+\b(?=.*?<\/p>)
REPLACE = `` (empty) -
@Vasile-Caraus said:
got it. Some \s+ should be add
Makes sense. the
\b
in the regex looks for a word-boundary, which is usually the boundary between alphanumeric and a space, but might also be the boundary between an alphanumeric and the&
which starts the HTML entity. By requiring one or more spaces as well, you have told it that you want more than just a boundary, but a space-defined boundary.Glad you were able to work it out.
-
Hello, @vasile-caraus, @peterjones and All,
Here is my attempt, which is able to match and
delete
all duplicate words, one at a time, in any line<p class="bebe">..........</p>
SEARCH
(?-s)(?:<p class="bebe">|\G).*?\h+((&\#\d+;|[\w'-])+)\h\K\h*\1[\h,;.]+(?=.*?</p>)
REPLACE
Leave EMPTY
If you prefer to use the free spacing mode and in-line comments, here is the search regex :
(?x) # FREE-SPACING mode (?-s) # The DOT represents a single STANDARD character (?:<p[ ]class="bebe">|\G ) # The string <p class="bebe"> or the CURRENT position, in a NON-CAPTURING group .*? # The SMALLEST range of STANDARD characters, ONLY \h+ # A NON null range of HORIZONTAL BLANK characters ( (&\#\d+;|[\w'-])+ ) # The string &#, followed with digit(s) + a SEMICOLON or a WORD character or a SINGLE QUOTE or a DASH # possibly REPEATED, so a WORD stored as GROUP 1 \h # One HORIZONTAL BLANK character \K # Everything ALREADY matched is DISCARDED \h* # A range, possibly NULL of HORIZONTAL BLANK character(s) \1 # The DUPLICATE word [\h,;.]+ # Any NON null range of HORIZONTAL BLANK character or a COMMA or a SEMICOLON or a DOT, possibly REPEATED (?=.*?</p>) # ONLY IF followed with the SMALLEST range of STANDARD characters + the STRING </p>) in the CURRENT line
Just test these two identical versions, against the sample text, below :
<p class="bebe">My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming our ourş long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourş our long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourş ourş long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourş ourő long-neglected, creaky old Victorian house into our favorite place on earth.</p> --- <p class="bebe">My husband and I are transforming our şour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming şour our long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming şour şour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming şour őour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming şour ourő long-neglected, creaky old Victorian house into our favorite place on earth.</p> --- <p class="bebe">My husband and I are transforming our ourşour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourşour our long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourş ourşour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourşour ourş long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming şour ourşour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourşour şour long-neglected, creaky old Victorian house into our favorite place on earth.</p> --- <p class="bebe">My husband and I are transforming ourşour ourşour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourşour ourőour long-neglected, creaky old Victorian house into our favorite place on earth.</p> --- <p class="bebe">My husband and I are transforming our long-neglected long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming our long-neglected long-neglected; creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming our long-neglected long-neglected. creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming our fisherman's fisherman's hut.</p> --- <p class="bebe">My husband and I are transforming our our long-neglected, creaky old old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
To be logic, only the non empty lines
1
,4
,8
,17
, and from19
to24
match !So, after a click on the
Replace All
button, exclusively, it should give the following text :<p class="bebe">My husband and I are transforming our long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming our ourş long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourş our long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourş long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourş ourő long-neglected, creaky old Victorian house into our favorite place on earth.</p> --- <p class="bebe">My husband and I are transforming our şour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming şour our long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming şour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming şour őour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming şour ourő long-neglected, creaky old Victorian house into our favorite place on earth.</p> --- <p class="bebe">My husband and I are transforming our ourşour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourşour our long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourş ourşour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourşour ourş long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming şour ourşour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourşour şour long-neglected, creaky old Victorian house into our favorite place on earth.</p> --- <p class="bebe">My husband and I are transforming ourşour long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming ourşour ourőour long-neglected, creaky old Victorian house into our favorite place on earth.</p> --- <p class="bebe">My husband and I are transforming our long-neglected creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming our long-neglected creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming our long-neglected creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming our fisherman's hut.</p> --- <p class="bebe">My husband and I are transforming our long-neglected, creaky old Victorian house into our favorite place on earth.</p> <p class="bebe">My husband and I are transforming our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
As you noticed, in line
23
, the two duplicate wordsour
andold
are, both, found and deleted ;-))Cheers,
guy038