Regex: Finds words that are repeated in multiple lines



  • hello. I have this lines with regex expressions, separated by |, of type Regex_A|Regex_B

    (?s)((^.*)(<div class="entry-excerpt">)|(<!-- //.entry -->)(.*$))
    (?s)((^.*)(<ul class="smallThumb-mainList">)|(<div class="navigation">)(.*$))
    (?s)((^.*)(word_2)|(<!-- //.entry -->)(.*$))
    (?s)((^.*)(word_2)|(<!-- //.ambro34 -->)(.*$))
    

    I want to find all those words\regex that are repeated before | and those that repeats after |

    I try a regex, but doesn’t work too good: (?m)(.*)^(.*)\|(.*)(?=.*\1)



  • Basic, I want after search and replace to remain only one instance of:

    (?s)((^.*)(word_2) because is repeated 2 times before | (on line 3 and 4)

    (<!-- //.entry -->)(.*$)) because is repeated after | (on line 1 and 3)



  • Maybe, a simple example will be much better:

    Word_1 | Word_2
    Word_3 | Word_2
    Word_4 | Word_5
    Word_4 | Word_6

    In this case, Word_4 and Word_2 are repeated. So, I want after search to remain only this ones.



  • @Vasile-Caraus

    As stated before here (https://notepad-plus-plus.org/community/topic/13248/regex-datetime) I think you’ve worn out everyone’s good nature (with the possible exception of @guy038) with your infinite regex questions. @MAPJe71 pointed out some good references for you to self-learn; that advice still holds. Sorry, but that’s the way I see it.



  • Hello, @Vasile-Caraus, @alan-kilborn and @MapJe71

    First of all, @alan-kilborn and @MapJe71, although I do understand your point of view and the advices that you give to @Vasile-Caraus, this present exercise seems, however, interesting. You may simply consider that it would allow you to know, in a two-columns table, any text which is repeated, one or more times, in each column !


    So @Vasile-Caraus, let’s go !

    To begin with, some statements and hypotheses :

    • I’ll limit this topic to the general case of two parts of text, only, separated with one Vertical Line character ( Text_A|Text_B ), which, of course, matches the sub-problem of two regexes, separated by the alternative symbol ( Regex_A|Regex_B )

    • For syntaxes, as Text_A|Text_B|Text_C or more, it would be more expensive !! Well, set your mind at ease, I’m joking :-))

    • Of course, these two parts of text do NOT contain the Vertical Line character ( | ), themselves !

    • I chose the Commercial At sign as a temporary character. If your regexes may contain this character, just choose an other symbol, which, preferably, won’t be a special regex symbol !

    • I’ll use the 12-lines original text, below :

    Text_0|Text_C
    Text_1|Text_2
    Text_4|Text_5
    Text_3|Text_2
    Text_4|Text_6
    Text_7|Text_8
    Text_9|Text_2
    Text_4|Text_5
    Text_7|Text_A
    Text_0|Text_B
    Text_2|Text_7
    Text_6|Text_7
    
    • Of course, the different NON-null strings Text_? can have any size !

    So :

    • Open a new tab

    • Copy/Paste the original text, above

    • Hit the Backspace key to suppress the possible End of Line character(s), of the last line ( Line 12 )

    • Open the Replace dialog

    • Then the first regex S/R, below :

    SEARCH (?=(\|))|$

    REPLACE @(?1A-:B-)@

    should produce the text :

    Text_0@A-@|Text_C@B-@
    Text_1@A-@|Text_2@B-@
    Text_4@A-@|Text_5@B-@
    Text_3@A-@|Text_2@B-@
    Text_4@A-@|Text_6@B-@
    Text_7@A-@|Text_8@B-@
    Text_9@A-@|Text_2@B-@
    Text_4@A-@|Text_5@B-@
    Text_7@A-@|Text_A@B-@
    Text_0@A-@|Text_B@B-@
    Text_2@A-@|Text_7@B-@
    Text_6@A-@|Text_7@B-@
    
    • Now, choose the Edit > Column Editor…, or hit the ALT + C shortcut

    • Select the zone Number to Insert

    • Choose 1, as Initial number

    • Choose 1, in the Increase by field

    • Select the Dec format of numbers

    • Place the caret, on the first line, between the strings @A- and @|

    • Click on the OK button

    => A list of numbers, between 1 and 12, is inserted at caret position

    Now, move the caret, on the first line, between the strings @B- and the last @

    • Re-open the Column Editor, with the ALT + C shortcut

    • Hit the Enter key

    => The same list of numbers is inserted, before the last @, of each line :

    Text_0@A-1 @|Text_C@B-1 @
    Text_1@A-2 @|Text_2@B-2 @
    Text_4@A-3 @|Text_5@B-3 @
    Text_3@A-4 @|Text_2@B-4 @
    Text_4@A-5 @|Text_6@B-5 @
    Text_7@A-6 @|Text_8@B-6 @
    Text_9@A-7 @|Text_2@B-7 @
    Text_4@A-8 @|Text_5@B-8 @
    Text_7@A-9 @|Text_A@B-9 @
    Text_0@A-10@|Text_B@B-10@
    Text_2@A-11@|Text_7@B-11@
    Text_6@A-12@|Text_7@B-12@
    

    Then, with that second regex S/R :

    SEARCH \|

    REPLACE \r\n

    we get the one-column list, below :

    Text_0@A-1 @
    Text_C@B-1 @
    Text_1@A-2 @
    Text_2@B-2 @
    Text_4@A-3 @
    Text_5@B-3 @
    Text_3@A-4 @
    Text_2@B-4 @
    Text_4@A-5 @
    Text_6@B-5 @
    Text_7@A-6 @
    Text_8@B-6 @
    Text_9@A-7 @
    Text_2@B-7 @
    Text_4@A-8 @
    Text_5@B-8 @
    Text_7@A-9 @
    Text_A@B-9 @
    Text_0@A-10@
    Text_B@B-10@
    Text_2@A-11@
    Text_7@B-11@
    Text_6@A-12@
    Text_7@B-12@
    

    Now, let’s use the menu option Edit > Line Operations > Sort lines Lexicographically Ascending

    We obtain the sorted text, below :

    Text_0@A-1 @
    Text_0@A-10@
    Text_1@A-2 @
    Text_2@A-11@
    Text_2@B-2 @
    Text_2@B-4 @
    Text_2@B-7 @
    Text_3@A-4 @
    Text_4@A-3 @
    Text_4@A-5 @
    Text_4@A-8 @
    Text_5@B-3 @
    Text_5@B-8 @
    Text_6@A-12@
    Text_6@B-5 @
    Text_7@A-6 @
    Text_7@A-9 @
    Text_7@B-11@
    Text_7@B-12@
    Text_8@B-6 @
    Text_9@A-7 @
    Text_A@B-9 @
    Text_B@B-10@
    Text_C@B-1 @
    

    Then, the third regex S/R, below :

    SEARCH (^.+@.).+\R(?:\1.+\R)+|.+\R

    REPLACE ?1$0

    should delete any text, which is unique, in its column and keeps, only, the different texts, which occur several times, in their column :

    Text_0@A-1 @
    Text_0@A-10@
    Text_2@B-2 @
    Text_2@B-4 @
    Text_2@B-7 @
    Text_4@A-3 @
    Text_4@A-5 @
    Text_4@A-8 @
    Text_5@B-3 @
    Text_5@B-8 @
    Text_7@A-6 @
    Text_7@A-9 @
    Text_7@B-11@
    Text_7@B-12@
    

    Finally, use the fourth and last regex S/R, below :

    SEARCH (^(.+?)@B-|@A-)|\x20*@

    REPLACE ?1|(?2\2)\x20\x20\x20\x20\x20

    Notes :

    • You may replace any syntax \x20 with a single space character !

    • In the replacement regex, you may add some other spaces or replace the spaces by several tabulation characters

    This S/R displays the different texts :

    • With the syntax Text_?|, if this text was located BEFORE the Vertical Line symbol

    • With the syntax |Text_?, if this text was located AFTER the Vertical Line symbol

    • The number, ending each line, represents, by increasing order, the number of each line, where the string Text_? occurs, in order to easily localize this string !

    Text_0|     1
    Text_0|     10
    |Text_2     2
    |Text_2     4
    |Text_2     7
    Text_4|     3
    Text_4|     5
    Text_4|     8
    |Text_5     3
    |Text_5     8
    Text_7|     6
    Text_7|     9
    |Text_7     11
    |Text_7     12
    

    Best Regards,

    guy038

    P.S. :

    If any of the four S/R, above, seems a bit tricky, just tell me about it !



  • Test it and it WORKS. I believe I will use Macros for this long regex.

    thanks, guy038. I believe you are my only friend around here. ;)


Log in to reply