Regex: Finds words that are repeated in multiple lines

Vasile Caraus

hello. I have this lines with regex expressions, separated by |, of type Regex_A|Regex_B

(?s)((^.*)(<div class="entry-excerpt">)|(<!-- //.entry -->)(.*$))
(?s)((^.*)(<ul class="smallThumb-mainList">)|(<div class="navigation">)(.*$))
(?s)((^.*)(word_2)|(<!-- //.entry -->)(.*$))
(?s)((^.*)(word_2)|(<!-- //.ambro34 -->)(.*$))

I want to find all those words\regex that are repeated before | and those that repeats after |

I try a regex, but doesn’t work too good: (?m)(.*)^(.*)\|(.*)(?=.*\1)

Vasile Caraus

Basic, I want after search and replace to remain only one instance of:

(?s)((^.*)(word_2) because is repeated 2 times before | (on line 3 and 4)

()(.*$)) because is repeated after | (on line 1 and 3)

Vasile Caraus

Maybe, a simple example will be much better:

Word_1 | Word_2
Word_3 | Word_2
Word_4 | Word_5
Word_4 | Word_6

In this case, Word_4 and Word_2 are repeated. So, I want after search to remain only this ones.

Alan Kilborn

@Vasile-Caraus

As stated before here (https://notepad-plus-plus.org/community/topic/13248/regex-datetime) I think you’ve worn out everyone’s good nature (with the possible exception of @guy038) with your infinite regex questions. @MAPJe71 pointed out some good references for you to self-learn; that advice still holds. Sorry, but that’s the way I see it.

guy038

Hello, @Vasile-Caraus, @alan-kilborn and @MapJe71

First of all, @alan-kilborn and @MapJe71, although I do understand your point of view and the advices that you give to @Vasile-Caraus, this present exercise seems, however, interesting. You may simply consider that it would allow you to know, in a two-columns table, any text which is repeated, one or more times, in each column !

So @Vasile-Caraus, let’s go !

To begin with, some statements and hypotheses :

I’ll limit this topic to the general case of two parts of text, only, separated with one Vertical Line character ( Text_A|Text_B ), which, of course, matches the sub-problem of two regexes, separated by the alternative symbol ( Regex_A|Regex_B )
For syntaxes, as Text_A|Text_B|Text_C or more, it would be more expensive !! Well, set your mind at ease, I’m joking :-))
Of course, these two parts of text do NOT contain the Vertical Line character ( | ), themselves !
I chose the Commercial At sign as a temporary character. If your regexes may contain this character, just choose an other symbol, which, preferably, won’t be a special regex symbol !
I’ll use the 12-lines original text, below :

Text_0|Text_C
Text_1|Text_2
Text_4|Text_5
Text_3|Text_2
Text_4|Text_6
Text_7|Text_8
Text_9|Text_2
Text_4|Text_5
Text_7|Text_A
Text_0|Text_B
Text_2|Text_7
Text_6|Text_7

Of course, the different NON-null strings Text_? can have any size !

So :

Open a new tab
Copy/Paste the original text, above
Hit the Backspace key to suppress the possible End of Line character(s), of the last line ( Line 12 )
Open the Replace dialog
Then the first regex S/R, below :

SEARCH (?=(\|))|$

REPLACE @(?1A-:B-)@

should produce the text :

Text_0@A-@|Text_C@B-@
Text_1@A-@|Text_2@B-@
Text_4@A-@|Text_5@B-@
Text_3@A-@|Text_2@B-@
Text_4@A-@|Text_6@B-@
Text_7@A-@|Text_8@B-@
Text_9@A-@|Text_2@B-@
Text_4@A-@|Text_5@B-@
Text_7@A-@|Text_A@B-@
Text_0@A-@|Text_B@B-@
Text_2@A-@|Text_7@B-@
Text_6@A-@|Text_7@B-@

Now, choose the Edit > Column Editor…, or hit the ALT + C shortcut
Select the zone Number to Insert
Choose 1, as Initial number
Choose 1, in the Increase by field
Select the Dec format of numbers
Place the caret, on the first line, between the strings @A- and @|
Click on the OK button

=> A list of numbers, between 1 and 12, is inserted at caret position

Now, move the caret, on the first line, between the strings @B- and the last @

Re-open the Column Editor, with the ALT + C shortcut
Hit the Enter key

=> The same list of numbers is inserted, before the last @, of each line :

Text_0@A-1 @|Text_C@B-1 @
Text_1@A-2 @|Text_2@B-2 @
Text_4@A-3 @|Text_5@B-3 @
Text_3@A-4 @|Text_2@B-4 @
Text_4@A-5 @|Text_6@B-5 @
Text_7@A-6 @|Text_8@B-6 @
Text_9@A-7 @|Text_2@B-7 @
Text_4@A-8 @|Text_5@B-8 @
Text_7@A-9 @|Text_A@B-9 @
Text_0@A-10@|Text_B@B-10@
Text_2@A-11@|Text_7@B-11@
Text_6@A-12@|Text_7@B-12@

Then, with that second regex S/R :

SEARCH \|

REPLACE \r\n

we get the one-column list, below :

Text_0@A-1 @
Text_C@B-1 @
Text_1@A-2 @
Text_2@B-2 @
Text_4@A-3 @
Text_5@B-3 @
Text_3@A-4 @
Text_2@B-4 @
Text_4@A-5 @
Text_6@B-5 @
Text_7@A-6 @
Text_8@B-6 @
Text_9@A-7 @
Text_2@B-7 @
Text_4@A-8 @
Text_5@B-8 @
Text_7@A-9 @
Text_A@B-9 @
Text_0@A-10@
Text_B@B-10@
Text_2@A-11@
Text_7@B-11@
Text_6@A-12@
Text_7@B-12@

Now, let’s use the menu option Edit > Line Operations > Sort lines Lexicographically Ascending

We obtain the sorted text, below :

Text_0@A-1 @
Text_0@A-10@
Text_1@A-2 @
Text_2@A-11@
Text_2@B-2 @
Text_2@B-4 @
Text_2@B-7 @
Text_3@A-4 @
Text_4@A-3 @
Text_4@A-5 @
Text_4@A-8 @
Text_5@B-3 @
Text_5@B-8 @
Text_6@A-12@
Text_6@B-5 @
Text_7@A-6 @
Text_7@A-9 @
Text_7@B-11@
Text_7@B-12@
Text_8@B-6 @
Text_9@A-7 @
Text_A@B-9 @
Text_B@B-10@
Text_C@B-1 @

Then, the third regex S/R, below :

SEARCH (^.+@.).+\R(?:\1.+\R)+|.+\R

REPLACE ?1$0

should delete any text, which is unique, in its column and keeps, only, the different texts, which occur several times, in their column :

Text_0@A-1 @
Text_0@A-10@
Text_2@B-2 @
Text_2@B-4 @
Text_2@B-7 @
Text_4@A-3 @
Text_4@A-5 @
Text_4@A-8 @
Text_5@B-3 @
Text_5@B-8 @
Text_7@A-6 @
Text_7@A-9 @
Text_7@B-11@
Text_7@B-12@

Finally, use the fourth and last regex S/R, below :

SEARCH (^(.+?)@B-|@A-)|\x20*@

REPLACE ?1|(?2\2)\x20\x20\x20\x20\x20

Notes :

You may replace any syntax \x20 with a single space character !
In the replacement regex, you may add some other spaces or replace the spaces by several tabulation characters

This S/R displays the different texts :

With the syntax Text_?|, if this text was located BEFORE the Vertical Line symbol
With the syntax |Text_?, if this text was located AFTER the Vertical Line symbol
The number, ending each line, represents, by increasing order, the number of each line, where the string Text_? occurs, in order to easily localize this string !

Text_0|     1
Text_0|     10
|Text_2     2
|Text_2     4
|Text_2     7
Text_4|     3
Text_4|     5
Text_4|     8
|Text_5     3
|Text_5     8
Text_7|     6
Text_7|     9
|Text_7     11
|Text_7     12

Best Regards,

guy038

P.S. :

If any of the four S/R, above, seems a bit tricky, just tell me about it !

Vasile Caraus

Test it and it WORKS. I believe I will use Macros for this long regex.

thanks, guy038. I believe you are my only friend around here. ;)