Community
    • Login

    Find paragraph of X words containing multiple keywords ?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    22 Posts 4 Posters 5.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Alan KilbornA
      Alan Kilborn @n_antiyou
      last edited by

      @n_antiyou

      You seem to be wanting to find multiple words within some proximity of each other? Here’s an example of something I used to find TWO words within so many characters of each other: https://community.notepad-plus-plus.org/post/60219

      Maybe that general idea could be extended to more words.

      1 Reply Last reply Reply Quote 1
      • guy038G
        guy038
        last edited by guy038

        Hello, @n_antiyou, @peterjones and All

        Please, just skip to my next post, below. This one is now out of date !

        Unlike @peterjones, I will consider a continuous range of any character. So a single common word can be defined by the regex \W+\w+, which stands for a non-null range of non-word chars followed by a non-null range of word chars

        Now in order to match a block of N words containing the words Word_1 and Word_2 and Word_3, use the following regex, written with the free-spacing and multi-lines mode ( In this mode, a space char is not taken in account and must be coded as \x20 or preceded by a \ character

        (?xs-i)
        (?= ( \W+ \w+ ) {0,N-1} \W+ Word_1 )
        (?= ( \W+ \w+ ) {0,N-1} \W+ Word_2 )
        (?= ( \W+ \w+ ) {0,N-1} \W+ Word_3 )
        ( \W+ \w+ ) {N}
        

        So, if I take up your example, with a range of 30 words, we get the effective search regex :

        (?xs-i)
        (?= ( \W+ \w+ ) {0,29} \W+ DOG    )
        (?= ( \W+ \w+ ) {0,29} \W+ CAT    )
        (?= ( \W+ \w+ ) {0,29} \W+ RABBIT )
        ( \W+ \w+ ) {30}
        
        • Select all text from (?xs-i) till ( \W+ \w+ ) {30}

        • Open the Search dialog ( Ctrl + F )

          • Tick the Wrap around option

          • Select the Regular expression search mode

          • Click on the Find next button

        You may test that regex expression against the sample text below :

        word word word word word word word word word DOG
        word word word word word word word word word DOG
        word word word word word word word word word word
        
        word word word word word word word word word CAT
        word word word word word word word word word CAT
        word word word word word word word word word word
        
        word word word word word word word word word RABBIT
        word word word word word word word word word RABBIT
        word word word word word word word word word word
        
        word word word word word word word word word DOG
        word word word word word word word word word CAT
        word word word word word word word word word RABBIT¤
        
        DOG word word word word word word word word word
        CAT word word word word word word word word word
        RABBIT word word word word word word word word word¤
        
        word word word RABBIT word word word word word word
        word DOG word word word word word word word word
        word word word word word word word word CAT word¤
        
        RABBIT word word word word word DOG word word word
        word DOG word word word word CAT word word word
        word CAT word word word word word word word RABBIT¤
        
        word word word word word word word word word word
        word word word word word word word word word word
        word word word word word word word  CAT DOG RABBIT¤
        
        DOG RABBIT CAT word word word word word word word
        word word word word word word word word word word
        word word word word word word word word word word¤
        
        word word word word word word word word word word
        word word word word word word word word word DOG
        word word word word word word word word word DOG
        
        word word word word word word word word word word
        word word word word word word word word word CAT
        word word word word word word word word word CAT
        
        word word word word word word word word word word
        word word word word word word word word word RABBIT
        word word word word word word word word word RABBIT
        
        word word word word word word word word word word
        

        Now, if you want to match each word DOG, CAT or RABBIT, in these blocks, only, we’ll need to temporary add a character, not existing yet, in current file, at the end of these blocks

        In the following S/R, below, I chose the ¤ character. So :

        • Select all the text, below :
        (?xs-i)
        (?= ( \W+ \w+ ) {0,29} \W+ DOG    )
        (?= ( \W+ \w+ ) {0,29} \W+ CAT    )
        (?= ( \W+ \w+ ) {0,29} \W+ RABBIT )
        ( \W+ \w+ ) {30} \K
        
        • Open the Replace dialog ( Ctrl + H )

          • Type in ¤ in the Replace with: zone

          • Tick the Wrap around option

          • Select the Regular expression search mode

          • Click on the Replace All button ( Do not use the Replace button )


        Then, search or marking should be easy with the regex :

        SEARCH / MARK (?-i)(DOG|CAT|RABBIT)(?=(\W+\w+){0,29}¤)


        Finally, in order to delete the temporary ¤ character, use this trivial regex :

        • SEARCH ¤

        • REPLACE Leave EMPTY

        Best Regards,

        guy038

        Alan KilbornA 1 Reply Last reply Reply Quote 1
        • Alan KilbornA
          Alan Kilborn @guy038
          last edited by

          @guy038:

          Nice techniques, here.

          Some notes:

          You may test that regex expression against the sample text below :

          It appears this text already has the ¤ character in it??

          Select all text from (?xs-i) till ( \W+ \w+ ) {30}

          It is a bit awkward to have to copy the regex specified on this site into a Notepad++ tab, and then select it again in order to press Ctrl+f on it.
          Suggest in the future to also provide in postings here a directly copyable version of the regex (meaning, a non (?x) version). Just a suggestion to avoid “losing” the newbies.

          Now, if you want to match each word DOG, CAT or RABBIT, in these blocks, only, we’ll need to temporary add a character, not existing yet, in current file, at the end of these blocks

          At first I thought this solution would match if only one of the 3 words were present (because of the use of “or”) but in reality it only matches if all 3 are present (an “and” search).
          As the “and” scenario was considered in the earlier half of the previous posting, I’m confused as to what this second half is showing…

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by

            Hi, @n_antiyou, @peterjones, @alan-kilborn and All

            Ah… Yes, Alan, you’re right about the ¤ character ! So I repost all my previous reply, for a better understanding and to provide new versions of the regexes !


            Unlike @peterjones, I will consider a continuous range of any character, even the EOL chars, using the (?s) modifier.

            Now, we can associate any common word with its leading non-word char(s), whatever they are, even line-break(s). Therefore, it can be expressed with the regex \W+\w+, which stands for a non-null range of non-word chars followed by a non-null range of word chars

            Then, in order to match a block of N words containing the words Word_1 and Word_2 and Word_3, at least once each, use the generic regex, below, written with the free-spacing and multi-lines mode (?x)

            Remember that, in this mode :

            • A space char is not taken in account and must be coded as \x20 or preceded by a \ character )

            • Any text, after a first # character, is not taken in account, too and is only comments !

            • To match a literal # char, use, either, the \x23 or \# syntax !

            (?xs-i)
            (?= ( \W+ \w+ ) {0,N-1} \W+ Word_1 )
            (?=    (?1)     {0,N-1} \W+ Word_2 )
            (?=    (?1)     {0,N-1} \W+ Word_3 )
                   (?1)     {N}
            

            So, if I take up your example, where N = 30 words, without using the free-spacing mode :

            • Open the Search dialog ( Ctrl + F )

              • SEARCH (?s-i)(?=(\W+\w+){0,29}\W+DOG)(?=(?1){0,29}\W+CAT)(?=(?1){0,29}\W+RABBIT)(?1){30}    Regex A

              • Tick the Wrap around option

              • Select the Regular expression search mode

              • Click on the Find next button

            You may test that regex expression against the sample text, below :

            Note, Alan that this regex A selects all the block and not each keyword, individually !

            word word word word word word word word word DOG
            word word word word word word word word word DOG
            word word word word word word word word word word
            
            word word word word word word word word word CAT
            word word word word word word word word word CAT
            word word word word word word word word word word
            
            word word word word word word word word word RABBIT
            word word word word word word word word word RABBIT
            word word word word word word word word word word
            
            word word word word word word word word word DOG
            word word word word word word word word word CAT
            word word word word word word word word word RABBIT
            
            DOG word word word word word word word word word
            CAT word word word word word word word word word
            RABBIT word word word word word word word word word
            
            word word word RABBIT word word word word word word
            word DOG word word word word word word word word
            word word word word word word word word CAT word
            
            RABBIT word word word word word DOG word word word
            word DOG word word word word CAT word word word
            word CAT word word word word word word word RABBIT
            
            word word word word word word word word word word
            word word word word word word word word word word
            word word word word word word word  CAT DOG RABBIT
            
            DOG RABBIT CAT word word word word word word word
            word word word word word word word word word word
            word word word word word word word word word word
            
            word word word word word word word word word word
            word word word word word word word word word DOG
            word word word word word word word word word DOG
            
            word word word word word word word word word word
            word word word word word word word word word CAT
            word word word word word word word word word CAT
            
            word word word word word word word word word word
            word word word word word word word word word RABBIT
            word word word word word word word word word RABBIT
            
            word word word word word word word word word word
            

            Now, if you want to match, individually, each word DOG, CAT or RABBIT, of these blocks, we need a temporary character, not existing yet, in current file, which will be located at the end of these blocks

            By that means, we’ll be sure that the range, between any found occurrence of CAT, DOG or RABBIT included and this ending anchor, will not exceed 30 words !

            I chose the ¤ character as the anchor, but any single char, not regex meta-character, would be OK, too

            • Open the Replace dialog ( Ctrl + H )

              • SEARCH (?s-i)(?=(\W+\w+){0,29}\W+DOG)(?=(?1){0,29}\W+CAT)(?=(?1){0,29}\W+RABBIT)(?1){30}\K    Regex B

              • REPLACE ¤ ( the anchor )

              • Tick the Wrap around option

              • Select the Regular expression search mode

              • Click on the Replace All button ( Do not use the Replace button )

            You should get this output :

            word word word word word word word word word DOG
            word word word word word word word word word DOG
            word word word word word word word word word word
            
            word word word word word word word word word CAT
            word word word word word word word word word CAT
            word word word word word word word word word word
            
            word word word word word word word word word RABBIT
            word word word word word word word word word RABBIT
            word word word word word word word word word word
            
            word word word word word word word word word DOG
            word word word word word word word word word CAT
            word word word word word word word word word RABBIT¤
            
            DOG word word word word word word word word word
            CAT word word word word word word word word word
            RABBIT word word word word word word word word word¤
            
            word word word RABBIT word word word word word word
            word DOG word word word word word word word word
            word word word word word word word word CAT word¤
            
            RABBIT word word word word word DOG word word word
            word DOG word word word word CAT word word word
            word CAT word word word word word word word RABBIT¤
            
            word word word word word word word word word word
            word word word word word word word word word word
            word word word word word word word  CAT DOG RABBIT¤
            
            DOG RABBIT CAT word word word word word word word
            word word word word word word word word word word
            word word word word word word word word word word¤
            
            word word word word word word word word word word
            word word word word word word word word word DOG
            word word word word word word word word word DOG
            
            word word word word word word word word word word
            word word word word word word word word word CAT
            word word word word word word word word word CAT
            
            word word word word word word word word word word
            word word word word word word word word word RABBIT
            word word word word word word word word word RABBIT
            
            word word word word word word word word word word
            

            Then, against this modified text, above, the search or marking of any word DOG, CAT or RABBIT, within the concerned sections, can be performed with the regex :

            SEARCH / MARK (?-i)(DOG|CAT|RABBIT)(?=(\W+\w+){0,29}¤)    Regex C

            Alan, this time, any keyword, inside the concerned blocks, are individually matched with the regex C ! Note that if you would have removed the anchor ¤ at the end of the look-ahead, the regex engine would have found absolutely all the occurrences of the keywords DOG, CAT or RABBIT ! Not what we expect to ! Hence, the necessity of the anchor ;-))


            Finally, in order to delete the temporary ¤ character, use this trivial regex :

            • SEARCH ¤

            • REPLACE Leave EMPTY

            Best Regards,

            guy038

            Alan KilbornA n_antiyouN 2 Replies Last reply Reply Quote 2
            • Alan KilbornA
              Alan Kilborn @guy038
              last edited by

              @guy038

              Thank you for the clarifications.

              1 Reply Last reply Reply Quote 0
              • n_antiyouN
                n_antiyou @guy038
                last edited by

                @guy038 This is it! It works! Ahahah

                I am not sure if I did correct. To be honest, I have understood about 10% of what you guys wrote. But I copied

                (?s-i)(?=(\W+\w+){0,29}\W+DOG)(?=(?1){0,29}\W+CAT)(?=(?1){0,29}\W+RABBIT)(?1){30}

                substituing DOG/RABBIT/CAT with 3 other keywords, and notepad++ found a portion of the text 30 words long containing all 3.

                This is exactly what I was looking for.

                Now, I have only 2 questions left:

                1) If I want to modify the size of the portion of the text I want to find ( lets say from 30 to 50 ), would it look like this:

                " (?s-i)(?=(\W+\w+){0,49}\W+guidato)(?=(?1){0,49}\W+parte)(?=(?1){0,49}\W+contributi)(?1){50} " ?

                2 ) What is the expression to achieve the same result, but with 4 and 5 keywords instead of 3? ( I know I could just ask what’s the pattern to follow to add more keywords… if you want you can write it down… I am scared that I won’t understand it tho )

                PeterJonesP 1 Reply Last reply Reply Quote 0
                • n_antiyouN
                  n_antiyou
                  last edited by n_antiyou

                  Oh, and one more ( sorry )

                  3 ) I see that notepad, when entering the expression and hitting "find ", it brings me to the proper place where the portion of the text is, and then enlightens it in grey. Is there a way to make so it also englithens the 3 keywords INSIDE, with a different color? ( any color, even the same color for all 3 keywords )
                  You may look at the picture I sent above as an example.

                  1 Reply Last reply Reply Quote 0
                  • PeterJonesP
                    PeterJones @n_antiyou
                    last edited by

                    @n_antiyou

                    1. Yep, that’s right.
                    2. each one of the (?=(\W+\w+){0,49}\W+guidato) terms applies to one of your required words. You’ll notice right now, there are three of those terms, each with one of your required words. You just need to add more of the same terms but with the new words.
                    3. with a different color? not in the same regular expression, sorry
                    1 Reply Last reply Reply Quote 1
                    • n_antiyouN
                      n_antiyou
                      last edited by

                      Perfect, so assuming 6 keywords and 300 words as size, it should look like this:

                      (?s-i)(?=(\W+\w+){0,299}\W+word1)(?=(\W+\w+){0,299}\W+word2)(?=(\W+\w+){0,299}\W+word3)(?=(\W+\w+){0,299}\W+word4)(?=(\W+\w+){0,299}\W+word5)(?=(\W+\w+){0,299}\W+word6)(?1){300}

                      Correct?

                      I’m starting to think this is a bit too complex tho. Not the expression per se, since once I understand how it works I can make new ones on my own, but the process takes time. Aren’t there programs that do this kind of research with a friendly UI?
                      Maybe I could find people on fiverr to develop an extention of google chrome that does this kind of research, so that it would also work on a PDF without converting to txt alltogether.

                      Alan KilbornA 1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by

                        Hi, @n_antiyou,

                        Give me some minutes ! your last regex can be simplified ;-))

                        BR

                        guy038

                        1 Reply Last reply Reply Quote 0
                        • Alan KilbornA
                          Alan Kilborn @n_antiyou
                          last edited by

                          @n_antiyou said in Find paragraph of X words containing multiple keywords ?:

                          Aren’t there programs that do this kind of research with a friendly UI?

                          Are there? I guess you’d have to go and find them then.

                          Maybe I could find people on fiverr to develop an extention of google chrome that does this kind of research, so that it would also work on a PDF without converting to txt alltogether.

                          Are there people standing by just to do this sort of thing?
                          That’s nice if so.
                          Maybe they can field some of the oddball need regex questions we get asked here.

                          1 Reply Last reply Reply Quote 0
                          • n_antiyouN
                            n_antiyou
                            last edited by

                            There might be, there’s people that do all sorts of things on fiverr it seems. XD

                            1 Reply Last reply Reply Quote 0
                            • guy038G
                              guy038
                              last edited by guy038

                              Hello, @n_antiyou, @peterjones, @alan-kilborn and All

                              @n_antiyou,

                              For 6 keywords, you can use that mono-line regex , below, in free-spacing mode, which enables you to place any space within this long regex for a better readability !

                              SEARCH / MARK (?xs-i) (?=(\W+ \w+){0,299} \W+ Word_1) (?=(?1){0,299} \W+ Word_2) (?=(?1){0,299} \W+ Word_3) (?=(?1){0,299} \W+ Word_4) (?=(?1){0,299} \W+ Word_5) (?=(?1){0,299} \W+ Word_6) (?1){300}


                              Now, you may use the Search > Mark All > Using #th style in order to highlight your keywords with a specific color. Note that, for your 6th keyword, you’ll have to cheat a bit by applying two successive highlightings to the same word ! Just try to mix two styles ;-))

                              Best Regards,

                              guy038

                              P.S. :

                              Note that the syntax (\W+ \w+), near the beginning of the regex, defines the group 1 containing the sub-regex \W+\w+, which is re-used, further on, thanks to the simple syntax (?1)

                              You’ll find some links to improve yourself in regexes here !

                              1 Reply Last reply Reply Quote 2
                              • guy038G
                                guy038
                                last edited by guy038

                                Hi, @n_antiyou and All,

                                As I said in my previous post, you may mix some styles, from the 5 styles, available by default, to get other colors, in order to color all your keywords !

                                Refer to this post by @Claudia-Frank, who, unfortunately, is no longer active on this forum ! Her contribution was quite important and she provided quantity of excellent Python scripts, too ! Let’s wish her the best and good coding moments ;-))

                                https://community.notepad-plus-plus.org/post/27621


                                With the help of the NppQCP plugin ( Quick Color Plugin ), I built up a Word image which recapitulates the main style combinations, with significant colors and their RGB coordinates


                                3a6919c6-acd7-4a43-80fe-b56305397952-image.png

                                Best Regards,

                                guy038

                                1 Reply Last reply Reply Quote 3
                                • First post
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors