Regex: Find Consecutive Duplicate Words (words that are repeated) in a particular tag



  • hello. In the example below, two words are repeated our our

    <p class="bebe">My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    

    So, I need to find all tags such as <p class="bebe"> that contains 2 or 3 repeated words.

    I find on google a regex for repeated words \b(\w+)\b\s+\1\b and I want to integrate into the tag <p class="bebe"><\/p> such as:

    <p class="bebe">(\b(\w+)\b\s+\1\b)<\/p>

    Is not working. Can anyone help me?



  • @Vasile-Caraus said:

    <p class="bebe">(\b(\w+)\b\s+\1\b)<\/p>

    There are two issues that I can see with that regex.

    1. It doesn’t allow for any text between the > and the repeated words, or between the repeated words and the <. I would recommend using .*? just after the > and just before the < to match anything (but as little as possible) in those regions.
    2. The outer level of parentheses in (\b(\w+)\b\s+\1\b) changes the numbering of the groups relative to the regex \b(\w+)\b\s+\1\b you found online. You don’t need the outer parentheses, so just get rid of them.

    <p class="bebe">.*?\b(\w+)\b\s+\1\b.*?<\/p> might work for you. (ie, it matches that paragraph for me)

    If you’re HTML paragraphs are multiline (have CR or LF or CRLF newline characters in them), then either click the ☑ . matches newline box, or use (?s) at the beginning of the regex.

    However, I’m willing to bet another five minutes of my time now that “find all tags” wasn’t really your end goal. My guess is that you end goal is to find then fix all those matches. In which case, I might do it using lookbehinds and lookaheads with something like:

    • FIND = <p class="bebe">.*?\b(\w+)\b\K\s+\1\b(?=.*?<\/p>)
    • REPLACE = `` (empty)

    (the same multiline advice applies)

    Here, it finds everything up to the first word, but the \K makes it not part of the “match”. Then it finds one or more whitespace characters, followed by the repeat of the word. Then it looks ahead to make sure there’s an end-of-HTML-paragraph tag somewhere in the future (but doesn’t keep the rest of the paragraph in the match). It then replaces the space and the second instance of the word with emptiness (ie, deletes it).

    This would take

    <p class="bebe">My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    <p class="baba">My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    

    and convert it to
    <p class=“bebe”>My husband and I are transforming our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    <p class=“baba”>My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>

    (it keeps the our our in the second paragraph because it’s class="baba", not class="bebe")

    -----
    FYI: I often add this to my response in regex threads, unless I am sure the original poster has seen it before. Here is some helpful information for finding out more about regular expressions, and for formatting posts in this forum (especially quoting data) so that we can fully understand what you’re trying to ask:

    This forum is formatted using Markdown, with a help link buried on the little grey ? in the COMPOSE window/pane when writing your post. For more about how to use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end. It is very important that you use these formatting tips – using single backtick marks around small snippets, and using code-quoting for pasting multiple lines from your example data files – because otherwise, the forum will change normal quotes ("") to curly “smart” quotes (“”), will change hyphens to dashes, will sometimes hide asterisks (or if your text is c:\folder\*.txt, it will show up as c:\folder*.txt, missing the backslash). If you want to clearly communicate your text data to us, you need to properly format it.

    If you have further search-and-replace (“matching”, “marking”, “bookmarking”, regular expression, “regex”) needs, study this FAQ and the documentation it points to. Before asking a new regex question, understand that for future requests, many of us will expect you to show what data you have (exactly), what data you want (exactly), what regex you already tried (to show that you’re showing effort), why you thought that regex would work (to prove it wasn’t just something randomly typed), and what data you’re getting with an explanation of why that result is wrong. When you show that effort, you’ll see us bend over backward to get things working for you. If you need help formatting, see the paragraph above.

    Please note that for all regex and related queries, it is best if you are explicit about what needs to match, and what shouldn’t match, and have multiple examples of both in your example dataset. Often, what shouldn’t match helps define the regular expression as much or more than what should match.



  • nice answer, thank you !



  • @PeterJones said:

    <p class=“bebe”>My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    <p class=“baba”>My husband and I are transforming our our long-neglected, creaky old Victorian house into our favorite place on earth.</p>

    now, I see there may be another case. DIACRITICS. Suppose I have 2 different words, starting with the same letters: our and our&#351;

    <p class=“bebe”>My husband and I are transforming our our&#351; long-neglected, creaky old Victorian house into our favorite place on earth.</p>

    Also, your regex is great, but in this case I need also not to find the words with symbols, but only those strictly the same.



  • got it. Some \s+ should be add

    FIND = <p class="bebe">.*?\b\s+(\w+)\b\K\s+\1\s+\b(?=.*?<\/p>)
    REPLACE = `` (empty)



  • @Vasile-Caraus said:

    got it. Some \s+ should be add

    Makes sense. the \b in the regex looks for a word-boundary, which is usually the boundary between alphanumeric and a space, but might also be the boundary between an alphanumeric and the & which starts the HTML entity. By requiring one or more spaces as well, you have told it that you want more than just a boundary, but a space-defined boundary.

    Glad you were able to work it out.



  • Hello, @vasile-caraus, @peterjones and All,

    Here is my attempt, which is able to match and delete all duplicate words, one at a time, in any line <p class="bebe">..........</p>

    SEARCH (?-s)(?:<p class="bebe">|\G).*?\h+((&\#\d+;|[\w'-])+)\h\K\h*\1[\h,;.]+(?=.*?</p>)

    REPLACE Leave EMPTY

    If you prefer to use the free spacing mode and in-line comments, here is the search regex :

    (?x)                        # FREE-SPACING mode
    (?-s)                       # The DOT represents a single STANDARD character
    (?:<p[ ]class="bebe">|\G )  # The string <p class="bebe"> or the CURRENT position, in a NON-CAPTURING group
    .*?                         # The SMALLEST range of STANDARD characters, ONLY
    \h+                         # A NON null range of HORIZONTAL BLANK characters
    ( (&\#\d+;|[\w'-])+ )       # The string &#, followed with digit(s) + a SEMICOLON or a WORD character or a SINGLE QUOTE or a DASH
                                #   possibly REPEATED, so a WORD stored as GROUP 1
    \h                          # One HORIZONTAL BLANK character
    \K                          # Everything ALREADY matched is DISCARDED   
    \h*                         # A range, possibly NULL of HORIZONTAL BLANK character(s)
    \1                          # The DUPLICATE word
    [\h,;.]+                    # Any NON null range of HORIZONTAL BLANK character or a COMMA or a SEMICOLON or a DOT, possibly REPEATED
    (?=.*?</p>)                 # ONLY IF followed with the SMALLEST range of STANDARD characters + the STRING </p>) in the CURRENT line
    

    Just test these two identical versions, against the sample text, below :

    <p class="bebe">My husband and I are transforming our           our           long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming our           our&#351;     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming our&#351;     our           long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming our&#351;     our&#351;     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming our&#351;     our&#337;     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    ---                                                                             
                                                                                    
    <p class="bebe">My husband and I are transforming  our           &#351;our     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming &#351;our     our           long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming &#351;our     &#351;our     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming &#351;our     &#337;our     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming &#351;our     our&#337;     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    ---
    
    <p class="bebe">My husband and I are transforming our           our&#351;our  long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our&#351;our  our           long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    
    <p class="bebe">My husband and I are transforming our&#351;     our&#351;our  long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our&#351;our  our&#351;     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    
    <p class="bebe">My husband and I are transforming &#351;our     our&#351;our  long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our&#351;our  &#351;our     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    ---
    
    <p class="bebe">My husband and I are transforming our&#351;our  our&#351;our  long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our&#351;our  our&#337;our  long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    ---
    
    <p class="bebe">My husband and I are transforming our long-neglected         long-neglected,  creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our long-neglected         long-neglected;  creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our long-neglected         long-neglected.  creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our fisherman's            fisherman's      hut.</p>
    
    ---
    
    <p class="bebe">My husband and I are transforming our      our      long-neglected, creaky old      old   Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our           our           long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    

    To be logic, only the non empty lines 1, 4, 8, 17, and from 19 to 24 match !

    So, after a click on the Replace All button, exclusively, it should give the following text :

    <p class="bebe">My husband and I are transforming our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming our           our&#351;     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming our&#351;     our           long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming our&#351; long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming our&#351;     our&#337;     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    ---                                                                             
                                                                                    
    <p class="bebe">My husband and I are transforming  our           &#351;our     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming &#351;our     our           long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming &#351;our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming &#351;our     &#337;our     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
                                                                                    
    <p class="bebe">My husband and I are transforming &#351;our     our&#337;     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    ---
    
    <p class="bebe">My husband and I are transforming our           our&#351;our  long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our&#351;our  our           long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    
    <p class="bebe">My husband and I are transforming our&#351;     our&#351;our  long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our&#351;our  our&#351;     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    
    <p class="bebe">My husband and I are transforming &#351;our     our&#351;our  long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our&#351;our  &#351;our     long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    ---
    
    <p class="bebe">My husband and I are transforming our&#351;our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our&#351;our  our&#337;our  long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    ---
    
    <p class="bebe">My husband and I are transforming our long-neglected creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our long-neglected creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our long-neglected creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our fisherman's hut.</p>
    
    ---
    
    <p class="bebe">My husband and I are transforming our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    
    <p class="bebe">My husband and I are transforming our long-neglected, creaky old Victorian house into our favorite place on earth.</p>
    

    As you noticed, in line 23, the two duplicate words our and old are, both, found and deleted ;-))

    Cheers,

    guy038