Regex: search the nearest words at a maximum distance of 6 words

Vasile Caraus

hello, again. I want to FIND THE NEAREST WORDS at a maximum distance of 6 words. For exemple:

bla bla WORD_A blah bla blah bla bla blah WORD_B blah blah…

So, between word can be 6 other words, or seven, etc, depends on what I want.

Scott Sumner

What have you tried?

guy038

Hi, Vasile,

As it’s impossible to get broken selections, while searching, I just select all the range of characters, between WORD_A and WORD_B. By this means, it’s quite easy to, both, notice the gap of the six words and the two delimiter words !

So I propose four simple regexes :

(?<=WORD_A)(\h+\w+){6}\h+(?=WORD_B) matches the gap, between the two words, WORD_A / WORD_B, separated by 6 words, exactly
(?<=WORD_A)(\h+\w+){0,6}\h+(?=WORD_B) matches the gap, between the two words, WORD_A / WORD_B, separated by a maximum of 6 words
(?<=WORD_A)(\h+\w+){6,}\h+(?=WORD_B) matches the gap, between the two words, WORD_A / WORD_B, separated by a minimum of 6 words
(?<=WORD_A)(\h+\w+){6,12}\h+(?=WORD_B) matches the gap, between the two words, WORD_A / WORD_B, separated by a minimum of 6 words and a maximum of 12 words

Of course, you may change the different numbers, by any other value !

Cheers,

guy038

Vasile Caraus

hello Guy, works, but in the case I have 2 or more instances of WORD_A , and more instances of WORD_2. For example:

bla bla WORD_A_1 WORD_A_2 blah bla blah bla bla blah WORD_B_1 WORD_B_1 blah blah…

guy038

Hi Vasile,

If we consider the sample text, below :

bla bla WORD_A WORD_A WORD_A WORD_A WORD_A blah bla blah bla bla blah WORD_B WORD_B WORD_B WORD_B WORD_B blah blah…

where the boundaries ( WORD_A and WORD_B ) are repeated and referring to the first of the four rexeges, described in my previous post, I would say :

(?<=WORD_A)(\h+(?!WORD_A|WORD_B)\w+){6}\h+(?=WORD_B)

If the quantifier is changed, for instance, into the values {5,9} or {0,7} or {4,}, it will be OK, as any of them covers the 6 number !

On the contrary, if you choose, for instance, the quantifiers {2,5} or {8} or {7,} or {0,2}, it will NOT match anything in the subject sentence !

Note :

The syntax (?!WORD_A|WORD_B), placed just before the \w+ regex ( = a single word ), is a negative look-ahead that ensures that the found six words, between the two boundaries, WORD_A and WORD_B, are different from the two boundaries !

Cheers,

guy038

Vasile Caraus

I find a similar formula, that works fine. In this case WORD_A and WORD_B could be a group of words, not just singular words.

\bWORD_A\W+(?:\w+\W+){6}?WORD_B\b matches the gap, between the two words, WORD_A / WORD_B, separated by 6 words, exactly

\bWORD_A\W+(?:\w+\W+){0,6}?WORD_B\b matches the gap, between the two words, WORD_A / WORD_B, separated by a maximum of 6 words

\bWORD_A\W+(?:\w+\W+){6,}?WORD_B\b matches the gap, between the two words, WORD_A / WORD_B, separated by a minimum of 6 words

\bWORD_A\W+(?:\w+\W+){6,12}?WORD_B\b matches the gap, between the two words, WORD_A / WORD_B, separated by a minimum of 6 words and a maximum of 12

guy038

Hello, Vasile,

I would like to raise four points :

1) You’re right about adding the \b assertion, which ensures that the starting boundary is the beginning of a word and the ending boundary is an end of a word ! BTW, the expressions WORD_A and WORD_B should be renamed STRING_A and STRING_B :-))
2) It’s also safer to use the syntax \W+ which stands for any NON-word character(s) between words, instead of my syntax \h+ which, only, refers to horizontal blank characters ( Just think about the simple string “Word1—Word2” ! )

So, thanks to the \b assertions and the NON-word characters \W+, the expressions STRING_A and STRING_B matched, are necessarily, true words :-))

3) Seemingly, your prefer to include the two boundaries STRING_A and STRING_B, in the selection. And I ,also, noticed that you use a non-capturing group : for a six-words length, it’s probably useless. But, generally speaking, it’s a good practice to do so, as the regex engine does not need to store the value of the group. This can increase the S/R speed, significantly, in some cases :-))
4) Finally, your regex contains a lazy quantifier {...}?, to be sure that you’ll always get the shortest string , which satisfies the whole regex, when using the {n,} or {n,m} quantifiers. However, the regex, given in my previous post, does not matter about lazyness or greediness ! Indeed, as the words, between STRING_A and STRING_B, cannot be the boundaries, themselves, the two syntaxes give the same results :-))

Therefore, accordingly to your syntax, my regex, with the adding of the negative look-ahead, that I proposed in my last post, it, finally, gives the four new regexes, below :

\bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){6}STRING_B\b
\bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){0,6}STRING_B\b
\bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){6,}STRING_B\b
\bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){6,12}STRING_B\b

So, according to the real text scanned, your regexes, or my version, will be used, preferably ! To get an idea of the differences of behaviour of these regexes, let’s consider the sample text, of 9 lines , below :

STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv STRING_B STRING_B STRING_B STRING_B STRING_B
STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu STRING_B STRING_B STRING_B STRING_B STRING_B
STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv STRING_B STRING_B STRING_B STRING_B STRING_B
STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu STRING_B STRING_B STRING_B STRING_B STRING_B
STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu ozerdfj STRING_B STRING_B STRING_B STRING_B STRING_B
STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu ozerdfj dsfjqsd STRING_B STRING_B STRING_B STRING_B STRING_B
STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu ozerdfj dsfjqsd xcvuo STRING_B STRING_B STRING_B STRING_B STRING_B
STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu ozerdfj dsfjqsd xcvuo eroze STRING_B STRING_B STRING_B STRING_B STRING_B
STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu ozerdfj dsfjqsd xcvuo eroze dfodf STRING_B STRING_B STRING_B STRING_B STRING_B

and apply, successively, the fourth case, ( the one with {n,m} quantifier ) of your and my regex, with, either, the lazy or greedy quantifier. That is to say, the searched regex :

\bSTRING_A\W+(?:\w+\W+){6,12}?STRING_B\b , with the lazy quantifier {6,12}?
\bSTRING_A\W+(?:\w+\W+){6,12}STRING_B\b , with the greedy quantifier {6,12}`
\bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){6,12}?STRING_B\b , with the lazy quantifier {6,12}?
\bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){6,12}STRING_B\b , with the greedy quantifier {6,12}`

Observe the different ranges of the selection, as well as the beginning and the end of each selection ! As for my two regexes, they give exactly the same results, due to the negative look-ahead feature !

Cheers,

guy038

Vasile Caraus

hello Guy. Yes, works. But I don’t understand what is the difference between “lazy quantifier” and “greedy quantifier” ?

Scott Sumner

Google is a useful tool, you’d be surprised at the amount of information you can obtain from it. For instance, here’s something on the topic that I quickly found that explains it pretty well:
http://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions

guy038

Hi Vasile,

Just consider the simple string “Vasile Caraus” You won’t forget it, won’t you !!! Then :

The regex s.+a, with a greeedy quantifier +, would match the string sile Cara
The regex s.+?a, with the lazy quantifier +?, would match the string sile Ca

More seriously, let’s imagine this HTML code, split on four lines, below :

<td width="80">
  <a href="javascript:doit%20('Act_V_Next',1,1)"><font face="arial, verdana" size="1" color="#006699">
  <b>Suivant&gt;&gt;</b></font></a>
</td>

Type in, firstly, the regex <.+>, and click, several times, on the Find Next button…
Type in, secondly, the regex <.+?>, and click, several times, on the Find Next button…

In the second case, you always get individual correct tags !

In order to get the same behaviour, without a lazy quantifier, you could use the regex <[^>]+>. And, if you refer to the link, below :

http://www.regular-expressions.info/repeat.html

This third solution is even better, as it prevents the regex engine from any backtracking !! Please, read, particularly, the two sections “Laziness Instead of Greediness” and “An Alternative to Laziness”

Look, also, at the chapter, section “How Possessive Quantifiers Work” , at :

http://www.regular-expressions.info/possessive.html

So, in short :

Add the ? meta-character, AFTER a greedy quantifier ( Default case ) to make this quantifier lazy
Add the + meta-character, AFTER a greedy quantifier ( Default case ) to make this quantifier possessive

And :

A greedy quantifier first, tries to repeat the token as MANY times as possible, and gradually GIVES UP matches, as the engine BACKTRACKS, to find an OVERALL match.
A lazy quantifier, first, repeats the token as FEW times as required, and gradually EXPANDS the match, as the engine BACKTRACKS through the regex, to find an OVERALL match.
A possessive quantifier, first, tries to repeat the token as MANY times as possible , and :
- IF the REMAINDER, of the regex, can be matched => An OVERALL match is found
- IF the REMAINDER, of the regex CANNOT be matched => The match attempt fails IMMEDIATELY, without trying any BACKTRACHING step, in order to get an OVERALL match !

Best Regards,

guy038

P.S. :

When I, first, wanted to add this reply, I was told, by Askinet that it could not add such a reply, which was considered as SPAM :-((

Luckily, I could get it to work, by adding, little by little, a few lines of my original reply and clicking, each time, on the blue Submit button !

I, finally, succeeded to get my complete original reply ! I, first thought that it worries about special characters or the links or the HTML code. But, as the present reply contents seem identical to my original contents, except for building up the reply, in some steps, I don’t see what disturbed Askinet site, leading to the SPAM declaration !

Anyway, if such a problem occurs to you, just try to split your reply in some parts, putting them one after another, on our forum ! Click on the vertical three dots symbol, on the right part of the screen and choose the Edit option

Vasile Caraus

nice answer, thanks for replying me every time, Guy.

Bipulkumarsingh

But i want to asked if i need to check they are not near in range { 6,12 }.