Regex: Finds words that are repeated in multiple lines
-
hello. I have this lines with regex expressions, separated by
|
, of typeRegex_A|Regex_B
(?s)((^.*)(<div class="entry-excerpt">)|(<!-- //.entry -->)(.*$)) (?s)((^.*)(<ul class="smallThumb-mainList">)|(<div class="navigation">)(.*$)) (?s)((^.*)(word_2)|(<!-- //.entry -->)(.*$)) (?s)((^.*)(word_2)|(<!-- //.ambro34 -->)(.*$))
I want to find all those words\regex that are repeated before | and those that repeats after |
I try a regex, but doesn’t work too good:
(?m)(.*)^(.*)\|(.*)(?=.*\1)
-
Basic, I want after search and replace to remain only one instance of:
(?s)((^.*)(word_2)
because is repeated 2 times before|
(on line 3 and 4)(<!-- //.entry -->)(.*$))
because is repeated after|
(on line 1 and 3) -
Maybe, a simple example will be much better:
Word_1 | Word_2
Word_3 | Word_2
Word_4 | Word_5
Word_4 | Word_6In this case, Word_4 and Word_2 are repeated. So, I want after search to remain only this ones.
-
As stated before here (https://notepad-plus-plus.org/community/topic/13248/regex-datetime) I think you’ve worn out everyone’s good nature (with the possible exception of @guy038) with your infinite regex questions. @MAPJe71 pointed out some good references for you to self-learn; that advice still holds. Sorry, but that’s the way I see it.
-
Hello, @Vasile-Caraus, @alan-kilborn and @MapJe71
First of all, @alan-kilborn and @MapJe71, although I do understand your point of view and the advices that you give to @Vasile-Caraus, this present exercise seems, however, interesting. You may simply consider that it would allow you to know, in a two-columns table, any text which is repeated, one or more times, in each column !
So @Vasile-Caraus, let’s go !
To begin with, some statements and hypotheses :
-
I’ll limit this topic to the general case of two parts of text, only, separated with one Vertical Line character (
Text_A|Text_B
), which, of course, matches the sub-problem of two regexes, separated by the alternative symbol (Regex_A|Regex_B
) -
For syntaxes, as
Text_A|Text_B|Text_C
or more, it would be more expensive !! Well, set your mind at ease, I’m joking :-)) -
Of course, these two parts of text do NOT contain the Vertical Line character (
|
), themselves ! -
I chose the Commercial At sign as a temporary character. If your regexes may contain this character, just choose an other symbol, which, preferably, won’t be a special regex symbol !
-
I’ll use the 12-lines original text, below :
Text_0|Text_C Text_1|Text_2 Text_4|Text_5 Text_3|Text_2 Text_4|Text_6 Text_7|Text_8 Text_9|Text_2 Text_4|Text_5 Text_7|Text_A Text_0|Text_B Text_2|Text_7 Text_6|Text_7
- Of course, the different NON-null strings Text_? can have any size !
So :
-
Open a new tab
-
Copy/Paste the original text, above
-
Hit the Backspace key to suppress the possible End of Line character(s), of the last line ( Line 12 )
-
Open the Replace dialog
-
Then the
first
regex S/R, below :
SEARCH
(?=(\|))|$
REPLACE
@(?1A-:B-)@
should produce the text :
Text_0@A-@|Text_C@B-@ Text_1@A-@|Text_2@B-@ Text_4@A-@|Text_5@B-@ Text_3@A-@|Text_2@B-@ Text_4@A-@|Text_6@B-@ Text_7@A-@|Text_8@B-@ Text_9@A-@|Text_2@B-@ Text_4@A-@|Text_5@B-@ Text_7@A-@|Text_A@B-@ Text_0@A-@|Text_B@B-@ Text_2@A-@|Text_7@B-@ Text_6@A-@|Text_7@B-@
-
Now, choose the Edit > Column Editor…, or hit the
ALT + C
shortcut -
Select the zone Number to Insert
-
Choose 1, as Initial number
-
Choose 1, in the Increase by field
-
Select the Dec format of numbers
-
Place the caret, on the first line, between the strings
@A-
and@|
-
Click on the OK button
=> A list of numbers, between 1 and 12, is inserted at caret position
Now, move the caret, on the first line, between the strings
@B-
and the last@
-
Re-open the Column Editor, with the
ALT + C
shortcut -
Hit the Enter key
=> The same list of numbers is inserted, before the last
@
, of each line :Text_0@A-1 @|Text_C@B-1 @ Text_1@A-2 @|Text_2@B-2 @ Text_4@A-3 @|Text_5@B-3 @ Text_3@A-4 @|Text_2@B-4 @ Text_4@A-5 @|Text_6@B-5 @ Text_7@A-6 @|Text_8@B-6 @ Text_9@A-7 @|Text_2@B-7 @ Text_4@A-8 @|Text_5@B-8 @ Text_7@A-9 @|Text_A@B-9 @ Text_0@A-10@|Text_B@B-10@ Text_2@A-11@|Text_7@B-11@ Text_6@A-12@|Text_7@B-12@
Then, with that
second
regex S/R :SEARCH
\|
REPLACE
\r\n
we get the one-column list, below :
Text_0@A-1 @ Text_C@B-1 @ Text_1@A-2 @ Text_2@B-2 @ Text_4@A-3 @ Text_5@B-3 @ Text_3@A-4 @ Text_2@B-4 @ Text_4@A-5 @ Text_6@B-5 @ Text_7@A-6 @ Text_8@B-6 @ Text_9@A-7 @ Text_2@B-7 @ Text_4@A-8 @ Text_5@B-8 @ Text_7@A-9 @ Text_A@B-9 @ Text_0@A-10@ Text_B@B-10@ Text_2@A-11@ Text_7@B-11@ Text_6@A-12@ Text_7@B-12@
Now, let’s use the menu option Edit > Line Operations > Sort lines Lexicographically Ascending
We obtain the sorted text, below :
Text_0@A-1 @ Text_0@A-10@ Text_1@A-2 @ Text_2@A-11@ Text_2@B-2 @ Text_2@B-4 @ Text_2@B-7 @ Text_3@A-4 @ Text_4@A-3 @ Text_4@A-5 @ Text_4@A-8 @ Text_5@B-3 @ Text_5@B-8 @ Text_6@A-12@ Text_6@B-5 @ Text_7@A-6 @ Text_7@A-9 @ Text_7@B-11@ Text_7@B-12@ Text_8@B-6 @ Text_9@A-7 @ Text_A@B-9 @ Text_B@B-10@ Text_C@B-1 @
Then, the
third
regex S/R, below :SEARCH
(^.+@.).+\R(?:\1.+\R)+|.+\R
REPLACE
?1$0
should delete any text, which is unique, in its column and keeps, only, the different texts, which occur several times, in their column :
Text_0@A-1 @ Text_0@A-10@ Text_2@B-2 @ Text_2@B-4 @ Text_2@B-7 @ Text_4@A-3 @ Text_4@A-5 @ Text_4@A-8 @ Text_5@B-3 @ Text_5@B-8 @ Text_7@A-6 @ Text_7@A-9 @ Text_7@B-11@ Text_7@B-12@
Finally, use the
fourth
and last regex S/R, below :SEARCH
(^(.+?)@B-|@A-)|\x20*@
REPLACE
?1|(?2\2)\x20\x20\x20\x20\x20
Notes :
-
You may replace any syntax
\x20
with a single space character ! -
In the replacement regex, you may add some other spaces or replace the spaces by several tabulation characters
This S/R displays the different texts :
-
With the syntax
Text_?|
, if this text was located BEFORE the Vertical Line symbol -
With the syntax
|Text_?
, if this text was located AFTER the Vertical Line symbol -
The number, ending each line, represents, by increasing order, the number of each line, where the string
Text_?
occurs, in order to easily localize this string !
Text_0| 1 Text_0| 10 |Text_2 2 |Text_2 4 |Text_2 7 Text_4| 3 Text_4| 5 Text_4| 8 |Text_5 3 |Text_5 8 Text_7| 6 Text_7| 9 |Text_7 11 |Text_7 12
Best Regards,
guy038
P.S. :
If any of the four S/R, above, seems a bit tricky, just tell me about it !
-
-
Test it and it WORKS. I believe I will use Macros for this long regex.
thanks, guy038. I believe you are my only friend around here. ;)