Proximity search

guy038

Hello, @iona-hine

So,@iona-hine, here are my clear instructions :-;))

To search for the string WordA, separated from an identical string WordA, by the GREATEST range of characters, containing between 0 and 50 words max, use the following regex :

(?si)(?<=\W)(WordA)(?:\W+\w+){0,50}\W+\1(?=\W)

To search for the string WordA, separated from an identical string WordA, by the SHORTTEST range of characters, containing between 0 and 50 words max, use the following regex :

(?si)(?<=\W)(WordA)(?:\W+\w+){0,50}?\W+\1(?=\W)

To search for the string Word1, separated from an other string Word2 ( or the opposite ) by the GREATEST range of characters, containing between 0 and 50 words max, use the following regex :

(?si)(?<=\W)(Word1)((?:\W+\w+){0,50}\W+)(Word2)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

To search for the string Word1, separated from an other string Word2 ( or the opposite ) by the SHORTEST range of characters, containing between 0 and 50 words max, use the following regex :

(?si)(?<=\W)(Word1)((?:\W+\w+){0,50}?\W+)(Word2)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

Notes :
Of course, change the strings WordA / Word1 and Word2 by whatever you like, as well as the 50 number, of “gap” words, if necessary
The (?si) syntax, at beginning of these regexes, are modifiers which mean that :
- The dot ( . ) special character matches, absolutely, any single character ( standard or EOL one )
- The search will be perform, in an insensitive case way ( If you need a sensitive search, just use the syntax (?s-i) )
The part (?<=\W)(Word1) OR (?<=\W)(WordA) matches the strings Word1 or WordA, ONLY IF they are preceded by a non-Word character
The (?:\W+\w+){0,50}\W+ part, represents the greatest range of words, each preceded by non-Word character(s), up to 50, and followed by some non-Word character(s). Note that the sub-part (?:\W+\w+) is, itself, a non-capturing group !
If a, exclamation mark, ?, is added, right after the quantifier range {0,50}, the regex engine will look, instead, for the shortest quantifier, which can match the overall regex
The part (Word2)(?=\W) OR \1(?=\W) ( \1 stands for the string WordA ) looks for the string Word2 or WordA, ONLY IF it’s followed by a non-Word character
Finally, the part (?<=\W)(?3)(?2)(?1)(?=\W), located after the alternative symbol, in the last two regexes, tries to match the opposite form Word2........Word1.
We cannot use backreferences and must use the called subpattern construction ( (?#) ). Indeed, when the regex engine matches the second part of the alternative, the back-references \1, \2 and \3 would not have been defined. Whereas :
- The called subpattern (?1) is identical to the string Word1
- The called subpattern (?2) is identical to the regex (?:\W+\w+){0,50}\W+
- The called subpattern (?3) is identical to the string Word2
Remember, also, that the parentheses, surrounding the look-around features, do not represent any group

You may test these regexes, with real examples, using the license.txt file, provided with any N++ release !

For instance, in the N++ v7.3.3 license.txt file, with a maximum of 50 words between :

To look for the shortest ranges, between two occurrences of the article the, whatever its case, use :

(?si)(?<=\W)(the)(?:\W+\w+){0,50}?\W+\1(?=\W)

=> 86 matches

To look for the greatest ranges, between two occurrences of the article a, whatever its case, use :

(?si)(?<=\W)(a)(?:\W+\w+){0,50}\W+\1(?=\W)

=> 16 matches

To look for the greatest ranges, between the article the and the article a ( or between a and the ), whatever their case, use :

(?si)(?<=\W)(the)((?:\W+\w+){0,50}\W+)(a)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

=> 34 matches

To look for the shortest ranges, between the article the and the article a ( or between a and the ) , whatever their case, use :

**(?si)(?<=\W)(the)((?:\W+\w+){0,50}?\W+)(a)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)**

=> 49 matches

IMPORTANT :

I just realize that we must add a blank line at the very beginning AND an other at the very end of the current file.

Indeed, if the string Word1 or WordA is located at the very beginning of the current file AND / OR the string Word2 or WordA is located at the very end of the current file, without any additional line-break, the look-arounds (?<=\W) and (?=\W) could not be satisfied, leading to a non-overall match !

Best Regards,

guy038

P.S. :

I forgot to add that, if you use, for instance, the Courrier New font, a word character, which can be found with the simple regex \w, is any character from the list, below :

------------------------------------------------------------------ BASIC Word characters -----------------
0123456789  _

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
------------------------------------------------------------------ LATIN Letters -------------------------
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝŸÞ  ĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİĲĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŹŻŽ 
àáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿþ  āăąćĉċčďđēĕėęěĝğġģĥħĩīĭįiĳĵķĺļľŀłńņňŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżž

ƏƠƯǍǏǑǓǕǗǙǛǺǼǾ  ẀẂẄ  ẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸ
əơưǎǐǒǔǖǘǚǜǻǽǿ  ẁẃẅ  ạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹ
------------------------------------------------------------------ Unique LATIN Letters ------------------
ß ĸ ŉ ſ ƒ ℓ ﬁ ﬂ
------------------------------------------------------------------ MISCELLANEOUS Symbols -----------------
¹ ² ³ ⁿ    ª º    µ Ω
------------------------------------------------------------------ GREEK Letters -------------------------
ΆΈΉΊΌΎΏ ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫ
άέήίόύώ αβγδεζηθικλμνξοπρστυφχψωϊϋ ΐΰ ς
------------------------------------------------------------------ CYRILLIC Letters ----------------------
ЁЂЃЄЅІЇЈЉЊЋЌЎЏ АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ ҐҒҖҚҜҢҮҰҲҸҺӘӨ
ёђѓєѕіїјљњћќўџ абвгдежзийклмнопрстуфхцчшщъыьэюя ґғҗқҝңүұҳҹһәө
------------------------------------------------------------------ HEBRAIC Letters -----------------------
אבגדהוזחטיךכלםמןנסעףפץצקרשת
פֿﭏﬠשׁשׂשּׁשּׂאַאָאּבּגּדּהּוּזּטּיּךּכּלּמּנּסּףּפּצּקּרּשּתּוֹבֿכֿ
------------------------------------------------------------------ ARABIC Letters ------------------------
ءآأؤإئابةتثجحخدذرزسشصضطظعغفقكلمنهوىي
ً ٌ ٍ َ ُ ِ ّ ْ 
٠١٢٣٤٥٦٧٨٩
ٰ ٱٴپچژڤکگیە
 ۤ
۰۱۲۳۴۵۶۷۸۹
ﭐﭑﭖﭗﭘﭙﭪﭫﭬﭭﭺﭻﭼﭽﮊﮋﮎﮏﮐﮑﮒﮓﮔﮕﯼﯽﯾﯿﰈﰉﰎﰒﰱﰲﰿﱀﱁﱂﱃﱄﱎﱏﱘﱙ
ﱞﱟﱠﱡﱢ
ﱪﱭﱮﱯﱰﱳﱴﱵﲎﲏﲑﲔﲜﲝﲞﲟﲠﲡﲢﲣﲤﲥﲦﲨﲪﲬﲰﳉﳊﳋﳌﳍﳎﳏﳐﳑﳒﳓﳔﳕﳘﳚﳛﳜﳝﴰﴼﴽﶈﷲﺀﺁﺂﺃﺄﺅﺆﺇﺈﺉﺊﺋﺌﺍﺎﺏﺐﺑﺒﺓﺔﺕﺖﺗﺘﺙﺚﺛﺜﺝﺞﺟﺠﺡﺢﺣﺤ
ﺥﺦﺧﺨﺩﺪﺫﺬﺭﺮﺯﺰﺱﺲﺳﺴﺵﺶﺷﺸﺹﺺﺻﺼﺽﺾﺿﻀﻁﻂﻃﻄﻅﻆﻇﻈﻉﻊﻋﻌﻍﻎﻏﻐﻑﻒﻓﻔﻕﻖﻗﻘﻙﻚﻛﻜﻝﻞﻟﻠﻡﻢﻣﻤﻥﻦﻧﻨﻩﻪﻫﻬﻭﻮﻯﻰﻱﻲﻳﻴﻵﻶﻷﻸﻹﻺﻻﻼ

So, a Non-Word character ( \W ) is any character, of the Courrier New font, which does not belong to the list just above !

Concerning the (?#) regex syntax, you may, also, refer to the last part of the two posts, below :

https://notepad-plus-plus.org/community/topic/12948/2-search-strings-in-a-group-of-files-with-the-search-function/3

https://notepad-plus-plus.org/community/topic/13518/regular-expression-to-find-two-words-in-files-in-folder/3

Iona Hine

Thank you — for taking the time to explain the expression to me as well as providing the necessary answer. I have done my best to follow the explanation. If I can just ask for one clarification:

I ran the following Find in Files search:

(?si)(?<=\W)(word1)((?:\W+\w+){0,50}\W+)(word2)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

and got the following result: 1908 hits in 185 files

I then inverted the search as a check:

(?si)(?<=\W)(word2)((?:\W+\w+){0,50}\W+)(word1)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

and got 1946 hits in 181 files.

Could the difference be accounted for by the absence of initial and final blank lines?
(I’m searching a few hundred files so haven’t amended them. I’m using Notepad++ to investigate a much larger discrepancy, so the results are already an improvement. But it would help to be confident about where the difference lies.)

Iona Hine

I tried to unravel this for myself by checking through one file manually (with the help of Ctrl + F) and comparing my findings with the search result. This has left me even more confused.

Manually, I find that (checking the space of 50 words before and 50 words after in each case, noting that punctuation had already been removed):

word2 appears 17 times in the file

There are 15 instances of word1 preceding word2
There are 8 instances of word 1 following word 2.
And the sum is also true: there are 23 instances of word1 around word2

I also find that:

10 instances of word2 are preceded by at least one instance of word1.
7 instances of word2 are followed by at least one instance of word1.
5 instances of word2 have no proximate instances of word1.
12 instances of word2 have at least one proximate instance of word1
5 instances of word2 are both preceded and followed by at least one instance of word1.

Now, when I run the regex search (see above/below), I get a report of 11 hits in this file, regardless of the order in which I position word1 and word2. So what event is being reported?

“(?si)(?<=\W)(word1)((?:\W+\w+){0,50}\W+)(word2)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)”
“(?si)(?<=\W)(word2)((?:\W+\w+){0,50}\W+)(word1)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)”

Alan Kilborn

In a related but different direction, I was looking to automate a find-and-replace to add the blank lines mentioned at the top and bottom of the file(s).

I came up with this regex for the Find-what box: (?-m)^|\z, using a Replace-with of \r\n, and that does the job when I try it in RegexBuddy, but when trying same in Notepad++, I get different (and bad) results.

For example, if I start with this text:

abc
def
ghi

after the replace I end up with:

blank line
a
b
c
blank line
d
e
f
blank line
g
h
i
blank line

I expected to get (and I did get it in RB, with boost 1.54-1.57):

blank line
abc
def
ghi
blank line

Can anyone fill me in on what goes wrong with this find-n-replace in N++?

guy038

Hello, @alan-kilborn,

May be I’ll give you a detailed answer, after I solve the Iona problem :-)). @iona-hine, I’m preparing my next post to you. Just wait one hour about !

In the meanwhile, Alan just try this simple S/R :

SEARCH (?s).+

REPLACE \r\n$0\r\n

Et voilà !

Cheers,

guy038

guy038

Hi, @iona-hine,

Ah yes ! I didn’t think about switching the two words. I, immediately, verified with the my test file : the license.txt file ! And, luckily, using the count command, in the Find dialog, I always found the same results, between the regexes :

(?si)(?<=\W)(Word1)((?:\W+\w+){0,50}\W+)(Word2)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

AND

(?si)(?<=\W)(Word2)((?:\W+\w+){0,50}\W+)(Word1)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

and, also, between the regexes :

(?si)(?<=\W)(Word1)((?:\W+\w+){0,50}?\W+)(Word2)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

AND

(?si)(?<=\W)(Word2)((?:\W+\w+){0,50}?\W+)(Word1)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

I tried with some couples of words ( the and a, you and it, you and the, a and you, the and it… ). And, each time, the number of occurrences found was identical for both regexes

I also repeated the test with a 750K file and the results are the same, whatever the form Word1........Word2 or Word2........Word1 was used !

I finally changed the range between the two boundaries, from {50} to {10} => Results OK, all cases.

So, Iona, I don’t know, exactly, why you got a difference !?

Could you, first, identify one of your files, which does not give the same results, in the two cases.
Then, send me an e-mail, with this attached file at :
Don’t forget to tell me about the real boundaries strings that you’re using, as Word1 and Word2 ! Thanks :-))

BTW, the absence of the initial and final blank lines doesn’t matter. It would be a problem, ONLY IF Word1 and/or Word2 were located at the very beginning and/or at the very end of the current file scanned !

Finally, at the current file location, reached by the regex engine, ( each character after another ! ) it tests the regex, in order to match, either, the range Word1.....Word2 OR the range Word2.....Word1. To be convinced, just copy this short part of the license.txt file, below, in a new tab :

1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program.

You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee.

2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions:

Select, preferably, the option View > Word wrap, to split long lines

Now, using the Find dialog, and the two words the and a, of my previous post, in, for instance, the regex :

(?si)(?<=\W)(the)((?:\W+\w+){0,50}?\W+)(a)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

OR

(?si)(?<=\W)(a)((?:\W+\w+){0,50}?\W+)(the)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

If you click, successively, on the Find Next button, you’ll find 5 occurrences :

The fisrt three ones are of the type the..........a
The last two ones are of the type a..........the

See you later,

Cheers,

guy038

Alan Kilborn

@guy038 said:

May be I’ll give you a detailed answer, after I solve the Iona problem :-)
In the meanwhile, Alan just try this simple S/R :

Sure, @guy038, the alternative regex you supplied accomplishes the goal, but the more interesting thing is why the regex I had does what it does withing Notepad++…

guy038

Hello, @alan-kilborn,

The problem, Alan, that you raised, is due to the fact that the present N++ Boost regex engine, does NOT handle backward assertions properly !! It’s the case, for instance, for the syntaxes \A ( or (?m)^ ), \b, \B, \< as well as some look-behinds syntaxes :-((

Indeed, with your regex (?-m)^|\z` and the sample text, below :

abc
def
ghi

When you hit the Find Next button, repeatedly, it matches, wrongly, 12 times, the zero-length string, instead of 2 times ( at the very beginning and at the very end of the file )

To get the right behaviour, like with the RegexBuddy software ( except for my work-around ), the ONLY way is to use the Francois-R Boyer regex code, which is a very good N++ implementation of the Boost regex library !

If you install the improved François-R Boyer version, of the BOOST regex engine, you’ll get some strong new regex features :

Search is performed in 32 bits code-points, so it can handle characters, over the BMP ( Basic Multilingual Plane ). An interesting feature for most Asiatic people !
It can manage NUL characters, both, in search and in replacement, too.
Look-behinds are correctly handled, even in case of OVERLAPPING, with the end of the previous match.
It can handle ALL the Universal Character Names ( UCN) of the UCS Transformation Format , from \x{0} to \x{7FFFFFFF}, particularly, all those of code-points over \x{FFFF}, which are outside the BMP.
The backward regex search isn’t stopped, on matching a character, with Unicode code-point over \x{00FF}
The case modifiers \u, \l, \U and \L, in replacement, do change any accentuated letter !!
Most of the time, the step by step replacement, often broken with the present version, works nice with François-R Boyer version

Here are, below, a non exhaustive list of issues, with the CURRENT regex engine, which DO NOT occur, with the François-R Boyer’s version :

Overlapping lookbehinds and matched strings are NOT correctly handled. For instance, giving the 20 characters subject string aaaabaaababbbaabbabb and SEARCH = (?<!a)ba*, we get 6 matches, but, unfortunately, 2 results are wrong. With the improved version of François, it’s all OK !
We can’t use the NUL character in replacement. For example, the simple S/R : SEARCH = ABC and REPLACE = DEF\x00GHI, the result is the string DEF only :-(. The François’s version do insert the NUL character between the strings DEF and GHI !
BACKWARD assertions are NOT correctly supported. E.g. : SEARCH = \A. matches, successively, all the characters of the FIRST line. With the François’s version it only matches the FIRST character of the current file
It doesn’t search and replace characters, which are outside the Basic Multilingual Plane (BMP ). For instance, in an full UTF-8 file ( with a BOM ), if SEARCH = \x{104A5}\x{20AC} and REPLACE = \x{A3}\x{10482}, The present regex engine answers Invalid regular expression ! as for the François’s version does the replacement correctly ! ( Of course, your text font must handle these “Over BMP” characters )
Now, let’s suppose, for instance, the French subject string Un évènement, on a new line, and the simple SEARCH regex \w. After a click on the Find Next button, close the Replace dialog, and keep on searching some word characters, by hitting the F3 key. When you’re, about, at the end of the string, just go searching backwards, by hitting the SHIFT + F3 key. You’ll notice _that it CAN’T go backwards, past the è character !!!. The François’s version does works well, in both directions !
A last example : if you try to mark the matches of the simple SEARCH regex (?<=.)., the present regex engine marks any character, EVERY OTHER time. With the François’s version, it correctly find all characters, except for the very first of each line !
The SEARCH = (.*) and REPLACE = \U\1\E does change any lower-case letter, even accentuated, into its associated upper-case letter !
François-R Boyer also created a new option SCFIND_REGEXP_LOCALEORDER, to get ranges of characters, in a locale order, NOT in Unicode order. For instance, the regex range (?-i)[A-B] would match all the following characters AÀÁÂÃÄÅĀĂĄǍǺẠẢẤẦẨẪẬẮẰẲẴẶǼB, in a true UTF-8 file, with a suitable font !
To end with, the François-R Boyer’s version could display the EXACT error messages, instead of the generic message Invalid regular expression. For instance, the regex (\d+ab would report the Unmatched marking parenthesis error message !

VERY IMPORTANT : The Beta N++ regex code, of François-R Boyer DO NOT work, will all versions of N++, AFTER the 6.9.0 version ! So, just follow the few steps, below :

Download, first, the .zip archive of Notepad++ v6.9.0, from the link, below :

https://notepad-plus-plus.org/repository/6.x/6.9/npp.6.9.bin.zip

Extract all the contents, of this archive, in any folder, in order to get a local N++ installation :-))
Inside this new folder, rename the SciLexer.dll file as, for instance, SciLexer.690
Then, to get this Beta N++ regex code ( that has, BTW, NEVER been part of ANY official N++ release ) :
Download, from the link below, the modified SciLexer.dll file. of François-R Boyer

http://sourceforge.net/projects/npppythonplugsq/files/Beta N%2B%2B regex code/

Copy this file, in the installation folder, along with the Notepad++.exe and the SciLexer.690 files
Download, too, the readme.txt file, for the infos
Restart Notepad++ and enjoy it !

Remark :

Remember that this modified SciLexer.dll file, build on May 2013, is, also, based on the old Scintilla version v2.2.7 !

Of course, I long for, ( since more than 3 years ! ), that this nice version would be fully integrated with, both, the latest version of N++ and Scintilla. Unfortunately, up to now, NO ONE feels interesting to implement the new regex code and, as my C++ skills are, unfortunately, rather, near 0, I’m just trying hard to be satisfied with the present bugged N++ regex code :-((

So, to conclude, in the meanwhile, I keep, in addition to the last N++ version, a local installation of N++ v6.9 , with the François-R Boyer modified version of SciLexer.dll, on my laptop, in order to see the correct search behaviour of most of the regexes and to perform, from time to time, special replacements ;-)

Best Regards,

guy038

Alan Kilborn

@guy038

Okay, @guy038…I guess I didn’t expect it to go THERE. :-D

I vaguely remember something about this “Boyer” thing before, probably in previous posts by you.

So…how do we bring Francois R Boyer out of retirement to get a new version of this that will work with a newer Notepad++? :-D I guess that isn’t an option or it would have been done already.

So…I have a pretty good C++ (15 years?) background. My first thought was that maybe I could so something. So last night I started looking at that code. I don’t have trouble with the C++ part, but I have problems with seeing how that code “fits in”. A usual approach for something like this is to use Beyond Compare to do some code differencing to see the changes from one version to another, then take those changes and attempt to apply them to, in this case, a codebase that has evolved (i.e., the current Notepad++ codebase). Unfortunately, in this case, it appears difficult to get a solid starting point.

Maybe others are more familiar with the background of this and are considering working on it? Or would be willing to help me figure out where it stands and how to take it forward?

It seems a pity that Notepad++'s regex engine is known to be deficient and no one cares to make it better.

Claudia Frank

@guy038

Unfortunately, up to now, NO ONE feels interesting to implement the new regex code

You wanna make me feel guilty, don’t you ;-)
I’m sorry, I totally forgot that I wanted to take a look and give it a try.
I can say, I gave it a try but failed terribly - but now I do have an excuse - no windows anymore :-)

Cheers
Claudia

Scott Sumner

So I am interested in helping out if I somehow can. I have a good deal of C++ experience, although I took a very quick look at the referenced code and sort of feel the same way about it that @Alan-Kilborn does… I guess @guy038 would have to “drive the bus” on this effort? Not saying he needs C++ experience to do this; some of my best managers ever didn’t understand code at all! Managers facilitate, and have a good understanding of the problem, and there is nobody better on “Notepad++ plus regex” than @guy038

And @Claudia-Frank , I have zero feeling that @guy038 was taking a shot at you! Just not in his nature! (I know you know this).
:)

guy038

Hello @scott-sumner and @claudia-frank,

Oh no, Claudia, please, don’t feel guilty at all ! Scott is totally right about this : I don’t want to “put pressure” on anyone ! I just would like to suggest the nice N++ improvement it would be, to use a stable regex engine implementation, with new features and without some unacceptable bugs of the current version :-))

On the contrary, Claudia It’s rather me which should feel guilty to have left out, since a long time, the testing of your Python script RegexTesterPro :-((

So, cool ! We, all, have plenty of things to do, all day long and, as the common proverb says, tomorrow is an other day !

Cheers,

guy038

Vasile Caraus

hello friends, and Happy Easter !
Just read this topic, and guy038, I don’t understand 2 things on those regex of yours, like this one:

(?si)(?<=\W)(Word1)((?:\W+\w+){0,50}\W+)(Word2)(?=\W)|(?<=\W)(?3)(?2)(?1)(?=\W)

You forgot to mention what means the second part of your regex (?<=\W)(?3)(?2)(?1)(?=\W) ?

You already know me, I ask a lot of question, to learn more. And if I don’t understand, I ask.