Community
    • Login

    Regex: search the nearest words at a maximum distance of 6 words

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 4 Posters 9.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Scott SumnerS
      Scott Sumner
      last edited by

      What have you tried?

      1 Reply Last reply Reply Quote 2
      • guy038G
        guy038
        last edited by guy038

        Hi, Vasile,

        As it’s impossible to get broken selections, while searching, I just select all the range of characters, between WORD_A and WORD_B. By this means, it’s quite easy to, both, notice the gap of the six words and the two delimiter words !

        So I propose four simple regexes :

        • (?<=WORD_A)(\h+\w+){6}\h+(?=WORD_B) matches the gap, between the two words, WORD_A / WORD_B, separated by 6 words, exactly

        • (?<=WORD_A)(\h+\w+){0,6}\h+(?=WORD_B) matches the gap, between the two words, WORD_A / WORD_B, separated by a maximum of 6 words

        • (?<=WORD_A)(\h+\w+){6,}\h+(?=WORD_B) matches the gap, between the two words, WORD_A / WORD_B, separated by a minimum of 6 words

        • (?<=WORD_A)(\h+\w+){6,12}\h+(?=WORD_B) matches the gap, between the two words, WORD_A / WORD_B, separated by a minimum of 6 words and a maximum of 12 words

        Of course, you may change the different numbers, by any other value !

        Cheers,

        guy038

        1 Reply Last reply Reply Quote 0
        • Vasile CarausV
          Vasile Caraus
          last edited by

          hello Guy, works, but in the case I have 2 or more instances of WORD_A , and more instances of WORD_2. For example:

          bla bla WORD_A_1 WORD_A_2 blah bla blah bla bla blah WORD_B_1 WORD_B_1 blah blah…

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi Vasile,

            If we consider the sample text, below :

            bla bla WORD_A WORD_A WORD_A WORD_A WORD_A blah bla blah bla bla blah WORD_B WORD_B WORD_B WORD_B WORD_B blah blah…
            

            where the boundaries ( WORD_A and WORD_B ) are repeated and referring to the first of the four rexeges, described in my previous post, I would say :

            (?<=WORD_A)(\h+(?!WORD_A|WORD_B)\w+){6}\h+(?=WORD_B)

            If the quantifier is changed, for instance, into the values {5,9} or {0,7} or {4,}, it will be OK, as any of them covers the 6 number !

            On the contrary, if you choose, for instance, the quantifiers {2,5} or {8} or {7,} or {0,2}, it will NOT match anything in the subject sentence !

            Note :

            • The syntax (?!WORD_A|WORD_B), placed just before the \w+ regex ( = a single word ), is a negative look-ahead that ensures that the found six words, between the two boundaries, WORD_A and WORD_B, are different from the two boundaries !

            Cheers,

            guy038

            1 Reply Last reply Reply Quote 0
            • Vasile CarausV
              Vasile Caraus
              last edited by

              I find a similar formula, that works fine. In this case WORD_A and WORD_B could be a group of words, not just singular words.

              \bWORD_A\W+(?:\w+\W+){6}?WORD_B\b matches the gap, between the two words, WORD_A / WORD_B, separated by 6 words, exactly

              \bWORD_A\W+(?:\w+\W+){0,6}?WORD_B\b matches the gap, between the two words, WORD_A / WORD_B, separated by a maximum of 6 words

              \bWORD_A\W+(?:\w+\W+){6,}?WORD_B\b matches the gap, between the two words, WORD_A / WORD_B, separated by a minimum of 6 words

              \bWORD_A\W+(?:\w+\W+){6,12}?WORD_B\b matches the gap, between the two words, WORD_A / WORD_B, separated by a minimum of 6 words and a maximum of 12

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by guy038

                Hello, Vasile,

                I would like to raise four points :

                • 1) You’re right about adding the \b assertion, which ensures that the starting boundary is the beginning of a word and the ending boundary is an end of a word ! BTW, the expressions WORD_A and WORD_B should be renamed STRING_A and STRING_B :-))

                • 2) It’s also safer to use the syntax \W+ which stands for any NON-word character(s) between words, instead of my syntax \h+ which, only, refers to horizontal blank characters ( Just think about the simple string “Word1—Word2” ! )

                So, thanks to the \b assertions and the NON-word characters \W+, the expressions STRING_A and STRING_B matched, are necessarily, true words :-))

                • 3) Seemingly, your prefer to include the two boundaries STRING_A and STRING_B, in the selection. And I ,also, noticed that you use a non-capturing group : for a six-words length, it’s probably useless. But, generally speaking, it’s a good practice to do so, as the regex engine does not need to store the value of the group. This can increase the S/R speed, significantly, in some cases :-))

                • 4) Finally, your regex contains a lazy quantifier {...}?, to be sure that you’ll always get the shortest string , which satisfies the whole regex, when using the {n,} or {n,m} quantifiers. However, the regex, given in my previous post, does not matter about lazyness or greediness ! Indeed, as the words, between STRING_A and STRING_B, cannot be the boundaries, themselves, the two syntaxes give the same results :-))

                Therefore, accordingly to your syntax, my regex, with the adding of the negative look-ahead, that I proposed in my last post, it, finally, gives the four new regexes, below :

                • \bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){6}STRING_B\b

                • \bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){0,6}STRING_B\b

                • \bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){6,}STRING_B\b

                • \bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){6,12}STRING_B\b


                So, according to the real text scanned, your regexes, or my version, will be used, preferably ! To get an idea of the differences of behaviour of these regexes, let’s consider the sample text, of 9 lines , below :

                STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv STRING_B STRING_B STRING_B STRING_B STRING_B
                STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu STRING_B STRING_B STRING_B STRING_B STRING_B
                STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv STRING_B STRING_B STRING_B STRING_B STRING_B
                STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu STRING_B STRING_B STRING_B STRING_B STRING_B
                STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu ozerdfj STRING_B STRING_B STRING_B STRING_B STRING_B
                STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu ozerdfj dsfjqsd STRING_B STRING_B STRING_B STRING_B STRING_B
                STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu ozerdfj dsfjqsd xcvuo STRING_B STRING_B STRING_B STRING_B STRING_B
                STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu ozerdfj dsfjqsd xcvuo eroze STRING_B STRING_B STRING_B STRING_B STRING_B
                STRING_A STRING_A STRING_A STRING_A dfsdf sdfsdf dfgdfg xcvwfv xcvxcv tyutyu vbcvbcv ytyutyu ozerdfj dsfjqsd xcvuo eroze dfodf STRING_B STRING_B STRING_B STRING_B STRING_B
                

                and apply, successively, the fourth case, ( the one with {n,m} quantifier ) of your and my regex, with, either, the lazy or greedy quantifier. That is to say, the searched regex :

                • \bSTRING_A\W+(?:\w+\W+){6,12}?STRING_B\b , with the lazy quantifier {6,12}?

                • \bSTRING_A\W+(?:\w+\W+){6,12}STRING_B\b , with the greedy quantifier {6,12}`

                • \bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){6,12}?STRING_B\b , with the lazy quantifier {6,12}?

                • \bSTRING_A\W+(?:(?!STRING_A|STRING_B)\w+\W+){6,12}STRING_B\b , with the greedy quantifier {6,12}`

                Observe the different ranges of the selection, as well as the beginning and the end of each selection ! As for my two regexes, they give exactly the same results, due to the negative look-ahead feature !

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 0
                • Vasile CarausV
                  Vasile Caraus
                  last edited by

                  hello Guy. Yes, works. But I don’t understand what is the difference between “lazy quantifier” and “greedy quantifier” ?

                  1 Reply Last reply Reply Quote 0
                  • Scott SumnerS
                    Scott Sumner
                    last edited by

                    Google is a useful tool, you’d be surprised at the amount of information you can obtain from it. For instance, here’s something on the topic that I quickly found that explains it pretty well:
                    http://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions

                    1 Reply Last reply Reply Quote 1
                    • guy038G
                      guy038
                      last edited by guy038

                      Hi Vasile,

                      Just consider the simple string “Vasile Caraus” You won’t forget it, won’t you !!! Then :

                      • The regex s.+a, with a greeedy quantifier +, would match the string sile Cara

                      • The regex s.+?a, with the lazy quantifier +?, would match the string sile Ca


                      More seriously, let’s imagine this HTML code, split on four lines, below :

                      <td width="80">
                        <a href="javascript:doit%20('Act_V_Next',1,1)"><font face="arial, verdana" size="1" color="#006699">
                        <b>Suivant&gt;&gt;</b></font></a>
                      </td>
                      
                      • Type in, firstly, the regex <.+>, and click, several times, on the Find Next button…

                      • Type in, secondly, the regex <.+?>, and click, several times, on the Find Next button…

                      In the second case, you always get individual correct tags !

                      In order to get the same behaviour, without a lazy quantifier, you could use the regex <[^>]+>. And, if you refer to the link, below :

                      http://www.regular-expressions.info/repeat.html

                      This third solution is even better, as it prevents the regex engine from any backtracking !! Please, read, particularly, the two sections “Laziness Instead of Greediness” and “An Alternative to Laziness”

                      Look, also, at the chapter, section “How Possessive Quantifiers Work” , at :

                      http://www.regular-expressions.info/possessive.html


                      So, in short :

                      • Add the ? meta-character, AFTER a greedy quantifier ( Default case ) to make this quantifier lazy

                      • Add the + meta-character, AFTER a greedy quantifier ( Default case ) to make this quantifier possessive

                      And :

                      • A greedy quantifier first, tries to repeat the token as MANY times as possible, and gradually GIVES UP matches, as the engine BACKTRACKS, to find an OVERALL match.

                      • A lazy quantifier, first, repeats the token as FEW times as required, and gradually EXPANDS the match, as the engine BACKTRACKS through the regex, to find an OVERALL match.

                      • A possessive quantifier, first, tries to repeat the token as MANY times as possible , and :

                        • IF the REMAINDER, of the regex, can be matched => An OVERALL match is found

                        • IF the REMAINDER, of the regex CANNOT be matched => The match attempt fails IMMEDIATELY, without trying any BACKTRACHING step, in order to get an OVERALL match !

                      Best Regards,

                      guy038

                      P.S. :

                      When I, first, wanted to add this reply, I was told, by Askinet that it could not add such a reply, which was considered as SPAM :-((

                      Luckily, I could get it to work, by adding, little by little, a few lines of my original reply and clicking, each time, on the blue Submit button !

                      I, finally, succeeded to get my complete original reply ! I, first thought that it worries about special characters or the links or the HTML code. But, as the present reply contents seem identical to my original contents, except for building up the reply, in some steps, I don’t see what disturbed Askinet site, leading to the SPAM declaration !

                      Anyway, if such a problem occurs to you, just try to split your reply in some parts, putting them one after another, on our forum ! Click on the vertical three dots symbol, on the right part of the screen and choose the Edit option

                      1 Reply Last reply Reply Quote 0
                      • Vasile CarausV
                        Vasile Caraus
                        last edited by

                        nice answer, thanks for replying me every time, Guy.

                        1 Reply Last reply Reply Quote 0
                        • BipulkumarsinghB
                          Bipulkumarsingh
                          last edited by

                          But i want to asked if i need to check they are not near in range { 6,12 }.

                          1 Reply Last reply Reply Quote 0
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors