• Login
Community
  • Login

Find paragraph of X words containing multiple keywords ?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
22 Posts 4 Posters 5.2k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • N
    n_antiyou
    last edited by Apr 12, 2021, 9:09 PM

    Hey there. I was looking on the web for a way to search through a PDF with more advanced researches than simply CTRL+F and 1 keyword.
    I tried acrobat reader, but boolean and proximity researches, while being very close to what I need, aren’t exactly what I aim for.
    My idea is to give some keywords ( example: dog, cat, rabbit ), and then give a size for the paragraph in words ( example 500 words )
    The program would then search through all the PDF and find all paragraphs of said quantity of words, containing the given keywords. In this case, 500 words long paragraph containing dog, cat and rabbit.

    I got told it can be done with notepad++, even if I have to convert the pdf to txt that’s not a problem. But I have no idea of what expression I have to use.
    Would you mind helping?

    P 1 Reply Last reply Apr 12, 2021, 9:29 PM Reply Quote 0
    • P
      PeterJones @n_antiyou
      last edited by Apr 12, 2021, 9:29 PM

      @n_antiyou ,

      Well, you would definitely have to convert the PDF to txt, because Notepad++ is a text editor, and doesn’t understand PDF binary format.

      To search for any of dog,cat,rabbit, you would use the search dialog in regular expression mode, (dog|cat|rabbit) as the search string.

      However, you were saying something about “size of paragraph” and “quantity of said words”. I’m not sure there’s any good way to do what you want in regular expressions. In a programming language, you could easily define restrictions like that. In regex, you might be able to work something out.

      For example, with the text

      this paragraph is a short dog
      
      this paragraph has more words and matches cat
      
      too short rabbit
      
      this paragraph has enough words and but doesn't have canines, felines, or bugs bunny
      
      this paragraph has more words and dog matches 
      
      this paragraph has more words and rabbit matches 
      

      There are three paragraphs that match, where “paragraph” is defined as a line of text that ends with a newline sequence. I used the restriction that the paragraph had to contain 7 or more words to match. If you wanted “500”, just change the 7 to 500.

      • FIND = (?-s)^(?=.*(dog|cat|rabbit))(\S+\h+){7,}.*$
        SEARCH MODE = regular expression

      This says "turn off . matches newline, start the match at the beginning of the line; the line must contain dog, cat, or rabbit somewhere. The line must also contain 7 or more “words”, where a word is defined as one or more non-whitespace characters followed by at least one horizontal space (space or tab).

      If your paragraph rule is really that a paragraph might contain newlines, and that paragraph breaks are when there are two newlines in a row, the search expression would have to change.

      If you want a better answer than that, you’ll have to ask a more-detailed question.

      ----

      Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as literal text using the </> toolbar button or manual Markdown syntax. To make regex in red (and so they keep their special characters like *), use backticks, like `^.*?blah.*?\z`. Screenshots can be pasted from the clipboard to your post using Ctrl+V to show graphical items, but any text should be included as literal text in your post so we can easily copy/paste your data. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get. Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries.

      -----

      see also “FAQ Desk: Why Does My .docx File Look Like Junk In Notepad++”

      1 Reply Last reply Reply Quote 0
      • N
        n_antiyou
        last edited by Apr 12, 2021, 9:33 PM

        Oh, sorry for missing what I mean with paragraph. Might be a language barrier as english is not my main language. For paragraph, I do not mean one line of text. I mean a portion of the text. So if a page contains 30 lines of text and 3000 words, a paragraph of 500 words is just a piece of that page.

        1 Reply Last reply Reply Quote 1
        • N
          n_antiyou
          last edited by Apr 12, 2021, 9:44 PM

          I tried your expression and it seems to work fine ( besides the fact it counts “paragraph” as a line ), but it only checks for 1 of the 3 keywords. I want it to find a paragraph that contains all 3 words, not just one.

          A 1 Reply Last reply Apr 12, 2021, 10:16 PM Reply Quote 0
          • A
            Alan Kilborn @n_antiyou
            last edited by Apr 12, 2021, 10:16 PM

            @n_antiyou

            For paragraph, I do not mean one line of text. I mean a portion of the text.

            You still haven’t really defined what a paragraph is (to you).
            A common (but not necessarily strict) definition is a section of text terminated by two (or maybe more) consecutive line-ending characters.

            1 Reply Last reply Reply Quote 0
            • N
              n_antiyou
              last edited by n_antiyou Apr 12, 2021, 10:17 PM Apr 12, 2021, 10:16 PM

              @PeterJones
              First of all excuse me for making 3 posts in a row. I’m kinda new here.
              I made a picture in the hope to better explain what I need: fadfadsfasdfasdfasdf.png
              This is a sample text, each line is 10 words, but I am telling this only to help us keep count of how many words there are. I don’t care about the lines per se.
              Now, I want the program to find a portion/paragraph of this text that is 30 words big and contains CAT, DOG and RABBIT.

              As you can see I enlightened a portion of the text where this event happens.
              The program should NOT pick lines such as the first 3:

              word word word word word word word word word DOG
              word word word word word word word word word DOG
              word word word word word word word word word word

              Because it only contains DOG, but doesn’t contain CAT or RABBIT.

              Is this easier to understand?

              1 Reply Last reply Reply Quote 0
              • N
                n_antiyou
                last edited by n_antiyou Apr 12, 2021, 10:23 PM Apr 12, 2021, 10:21 PM

                As of dots, comas and stuff like that, I don’t care. The paragraph could even be:

                ,word word word. Word word word. Word. word word CAT.
                Word, DOG, word. Word. Word. Word word word word word,
                word word word word word RABBIT word word word. word wor

                Another example that makes it easy to understand is this:
                Imagine I have a book in PDF that talks about doctors A and B. Doctor B discovers a new research method called C.
                Now, A/B/C will be named numerous times in the PDF, across thousands of pages.

                But lets assume I want to find the portion of the text where A talks about what he thinks of B and C, then the only option is a script similar to what I am looking for.

                1 Reply Last reply Reply Quote 0
                • N
                  n_antiyou
                  last edited by n_antiyou Apr 12, 2021, 10:37 PM Apr 12, 2021, 10:37 PM

                  An english native speaker friend of mine suggests I should call it " a portion " or " a section " of the text. Since Paragraph have a precise punctuation and I don’t care about the punctuation.
                  I am just basically interested in finding a precise portion of a text based on size and keywords. Again, sorry

                  A 1 Reply Last reply Apr 13, 2021, 1:04 AM Reply Quote 0
                  • A
                    Alan Kilborn @n_antiyou
                    last edited by Apr 13, 2021, 1:04 AM

                    @n_antiyou

                    You seem to be wanting to find multiple words within some proximity of each other? Here’s an example of something I used to find TWO words within so many characters of each other: https://community.notepad-plus-plus.org/post/60219

                    Maybe that general idea could be extended to more words.

                    1 Reply Last reply Reply Quote 1
                    • G
                      guy038
                      last edited by guy038 Apr 13, 2021, 2:31 PM Apr 13, 2021, 3:22 AM

                      Hello, @n_antiyou, @peterjones and All

                      Please, just skip to my next post, below. This one is now out of date !

                      Unlike @peterjones, I will consider a continuous range of any character. So a single common word can be defined by the regex \W+\w+, which stands for a non-null range of non-word chars followed by a non-null range of word chars

                      Now in order to match a block of N words containing the words Word_1 and Word_2 and Word_3, use the following regex, written with the free-spacing and multi-lines mode ( In this mode, a space char is not taken in account and must be coded as \x20 or preceded by a \ character

                      (?xs-i)
                      (?= ( \W+ \w+ ) {0,N-1} \W+ Word_1 )
                      (?= ( \W+ \w+ ) {0,N-1} \W+ Word_2 )
                      (?= ( \W+ \w+ ) {0,N-1} \W+ Word_3 )
                      ( \W+ \w+ ) {N}
                      

                      So, if I take up your example, with a range of 30 words, we get the effective search regex :

                      (?xs-i)
                      (?= ( \W+ \w+ ) {0,29} \W+ DOG    )
                      (?= ( \W+ \w+ ) {0,29} \W+ CAT    )
                      (?= ( \W+ \w+ ) {0,29} \W+ RABBIT )
                      ( \W+ \w+ ) {30}
                      
                      • Select all text from (?xs-i) till ( \W+ \w+ ) {30}

                      • Open the Search dialog ( Ctrl + F )

                        • Tick the Wrap around option

                        • Select the Regular expression search mode

                        • Click on the Find next button

                      You may test that regex expression against the sample text below :

                      word word word word word word word word word DOG
                      word word word word word word word word word DOG
                      word word word word word word word word word word
                      
                      word word word word word word word word word CAT
                      word word word word word word word word word CAT
                      word word word word word word word word word word
                      
                      word word word word word word word word word RABBIT
                      word word word word word word word word word RABBIT
                      word word word word word word word word word word
                      
                      word word word word word word word word word DOG
                      word word word word word word word word word CAT
                      word word word word word word word word word RABBIT¤
                      
                      DOG word word word word word word word word word
                      CAT word word word word word word word word word
                      RABBIT word word word word word word word word word¤
                      
                      word word word RABBIT word word word word word word
                      word DOG word word word word word word word word
                      word word word word word word word word CAT word¤
                      
                      RABBIT word word word word word DOG word word word
                      word DOG word word word word CAT word word word
                      word CAT word word word word word word word RABBIT¤
                      
                      word word word word word word word word word word
                      word word word word word word word word word word
                      word word word word word word word  CAT DOG RABBIT¤
                      
                      DOG RABBIT CAT word word word word word word word
                      word word word word word word word word word word
                      word word word word word word word word word word¤
                      
                      word word word word word word word word word word
                      word word word word word word word word word DOG
                      word word word word word word word word word DOG
                      
                      word word word word word word word word word word
                      word word word word word word word word word CAT
                      word word word word word word word word word CAT
                      
                      word word word word word word word word word word
                      word word word word word word word word word RABBIT
                      word word word word word word word word word RABBIT
                      
                      word word word word word word word word word word
                      

                      Now, if you want to match each word DOG, CAT or RABBIT, in these blocks, only, we’ll need to temporary add a character, not existing yet, in current file, at the end of these blocks

                      In the following S/R, below, I chose the ¤ character. So :

                      • Select all the text, below :
                      (?xs-i)
                      (?= ( \W+ \w+ ) {0,29} \W+ DOG    )
                      (?= ( \W+ \w+ ) {0,29} \W+ CAT    )
                      (?= ( \W+ \w+ ) {0,29} \W+ RABBIT )
                      ( \W+ \w+ ) {30} \K
                      
                      • Open the Replace dialog ( Ctrl + H )

                        • Type in ¤ in the Replace with: zone

                        • Tick the Wrap around option

                        • Select the Regular expression search mode

                        • Click on the Replace All button ( Do not use the Replace button )


                      Then, search or marking should be easy with the regex :

                      SEARCH / MARK (?-i)(DOG|CAT|RABBIT)(?=(\W+\w+){0,29}¤)


                      Finally, in order to delete the temporary ¤ character, use this trivial regex :

                      • SEARCH ¤

                      • REPLACE Leave EMPTY

                      Best Regards,

                      guy038

                      A 1 Reply Last reply Apr 13, 2021, 12:14 PM Reply Quote 1
                      • A
                        Alan Kilborn @guy038
                        last edited by Apr 13, 2021, 12:14 PM

                        @guy038:

                        Nice techniques, here.

                        Some notes:

                        You may test that regex expression against the sample text below :

                        It appears this text already has the ¤ character in it??

                        Select all text from (?xs-i) till ( \W+ \w+ ) {30}

                        It is a bit awkward to have to copy the regex specified on this site into a Notepad++ tab, and then select it again in order to press Ctrl+f on it.
                        Suggest in the future to also provide in postings here a directly copyable version of the regex (meaning, a non (?x) version). Just a suggestion to avoid “losing” the newbies.

                        Now, if you want to match each word DOG, CAT or RABBIT, in these blocks, only, we’ll need to temporary add a character, not existing yet, in current file, at the end of these blocks

                        At first I thought this solution would match if only one of the 3 words were present (because of the use of “or”) but in reality it only matches if all 3 are present (an “and” search).
                        As the “and” scenario was considered in the earlier half of the previous posting, I’m confused as to what this second half is showing…

                        1 Reply Last reply Reply Quote 0
                        • G
                          guy038
                          last edited by Apr 13, 2021, 2:25 PM

                          Hi, @n_antiyou, @peterjones, @alan-kilborn and All

                          Ah… Yes, Alan, you’re right about the ¤ character ! So I repost all my previous reply, for a better understanding and to provide new versions of the regexes !


                          Unlike @peterjones, I will consider a continuous range of any character, even the EOL chars, using the (?s) modifier.

                          Now, we can associate any common word with its leading non-word char(s), whatever they are, even line-break(s). Therefore, it can be expressed with the regex \W+\w+, which stands for a non-null range of non-word chars followed by a non-null range of word chars

                          Then, in order to match a block of N words containing the words Word_1 and Word_2 and Word_3, at least once each, use the generic regex, below, written with the free-spacing and multi-lines mode (?x)

                          Remember that, in this mode :

                          • A space char is not taken in account and must be coded as \x20 or preceded by a \ character )

                          • Any text, after a first # character, is not taken in account, too and is only comments !

                          • To match a literal # char, use, either, the \x23 or \# syntax !

                          (?xs-i)
                          (?= ( \W+ \w+ ) {0,N-1} \W+ Word_1 )
                          (?=    (?1)     {0,N-1} \W+ Word_2 )
                          (?=    (?1)     {0,N-1} \W+ Word_3 )
                                 (?1)     {N}
                          

                          So, if I take up your example, where N = 30 words, without using the free-spacing mode :

                          • Open the Search dialog ( Ctrl + F )

                            • SEARCH (?s-i)(?=(\W+\w+){0,29}\W+DOG)(?=(?1){0,29}\W+CAT)(?=(?1){0,29}\W+RABBIT)(?1){30}    Regex A

                            • Tick the Wrap around option

                            • Select the Regular expression search mode

                            • Click on the Find next button

                          You may test that regex expression against the sample text, below :

                          Note, Alan that this regex A selects all the block and not each keyword, individually !

                          word word word word word word word word word DOG
                          word word word word word word word word word DOG
                          word word word word word word word word word word
                          
                          word word word word word word word word word CAT
                          word word word word word word word word word CAT
                          word word word word word word word word word word
                          
                          word word word word word word word word word RABBIT
                          word word word word word word word word word RABBIT
                          word word word word word word word word word word
                          
                          word word word word word word word word word DOG
                          word word word word word word word word word CAT
                          word word word word word word word word word RABBIT
                          
                          DOG word word word word word word word word word
                          CAT word word word word word word word word word
                          RABBIT word word word word word word word word word
                          
                          word word word RABBIT word word word word word word
                          word DOG word word word word word word word word
                          word word word word word word word word CAT word
                          
                          RABBIT word word word word word DOG word word word
                          word DOG word word word word CAT word word word
                          word CAT word word word word word word word RABBIT
                          
                          word word word word word word word word word word
                          word word word word word word word word word word
                          word word word word word word word  CAT DOG RABBIT
                          
                          DOG RABBIT CAT word word word word word word word
                          word word word word word word word word word word
                          word word word word word word word word word word
                          
                          word word word word word word word word word word
                          word word word word word word word word word DOG
                          word word word word word word word word word DOG
                          
                          word word word word word word word word word word
                          word word word word word word word word word CAT
                          word word word word word word word word word CAT
                          
                          word word word word word word word word word word
                          word word word word word word word word word RABBIT
                          word word word word word word word word word RABBIT
                          
                          word word word word word word word word word word
                          

                          Now, if you want to match, individually, each word DOG, CAT or RABBIT, of these blocks, we need a temporary character, not existing yet, in current file, which will be located at the end of these blocks

                          By that means, we’ll be sure that the range, between any found occurrence of CAT, DOG or RABBIT included and this ending anchor, will not exceed 30 words !

                          I chose the ¤ character as the anchor, but any single char, not regex meta-character, would be OK, too

                          • Open the Replace dialog ( Ctrl + H )

                            • SEARCH (?s-i)(?=(\W+\w+){0,29}\W+DOG)(?=(?1){0,29}\W+CAT)(?=(?1){0,29}\W+RABBIT)(?1){30}\K    Regex B

                            • REPLACE ¤ ( the anchor )

                            • Tick the Wrap around option

                            • Select the Regular expression search mode

                            • Click on the Replace All button ( Do not use the Replace button )

                          You should get this output :

                          word word word word word word word word word DOG
                          word word word word word word word word word DOG
                          word word word word word word word word word word
                          
                          word word word word word word word word word CAT
                          word word word word word word word word word CAT
                          word word word word word word word word word word
                          
                          word word word word word word word word word RABBIT
                          word word word word word word word word word RABBIT
                          word word word word word word word word word word
                          
                          word word word word word word word word word DOG
                          word word word word word word word word word CAT
                          word word word word word word word word word RABBIT¤
                          
                          DOG word word word word word word word word word
                          CAT word word word word word word word word word
                          RABBIT word word word word word word word word word¤
                          
                          word word word RABBIT word word word word word word
                          word DOG word word word word word word word word
                          word word word word word word word word CAT word¤
                          
                          RABBIT word word word word word DOG word word word
                          word DOG word word word word CAT word word word
                          word CAT word word word word word word word RABBIT¤
                          
                          word word word word word word word word word word
                          word word word word word word word word word word
                          word word word word word word word  CAT DOG RABBIT¤
                          
                          DOG RABBIT CAT word word word word word word word
                          word word word word word word word word word word
                          word word word word word word word word word word¤
                          
                          word word word word word word word word word word
                          word word word word word word word word word DOG
                          word word word word word word word word word DOG
                          
                          word word word word word word word word word word
                          word word word word word word word word word CAT
                          word word word word word word word word word CAT
                          
                          word word word word word word word word word word
                          word word word word word word word word word RABBIT
                          word word word word word word word word word RABBIT
                          
                          word word word word word word word word word word
                          

                          Then, against this modified text, above, the search or marking of any word DOG, CAT or RABBIT, within the concerned sections, can be performed with the regex :

                          SEARCH / MARK (?-i)(DOG|CAT|RABBIT)(?=(\W+\w+){0,29}¤)    Regex C

                          Alan, this time, any keyword, inside the concerned blocks, are individually matched with the regex C ! Note that if you would have removed the anchor ¤ at the end of the look-ahead, the regex engine would have found absolutely all the occurrences of the keywords DOG, CAT or RABBIT ! Not what we expect to ! Hence, the necessity of the anchor ;-))


                          Finally, in order to delete the temporary ¤ character, use this trivial regex :

                          • SEARCH ¤

                          • REPLACE Leave EMPTY

                          Best Regards,

                          guy038

                          A N 2 Replies Last reply Apr 13, 2021, 2:44 PM Reply Quote 2
                          • A
                            Alan Kilborn @guy038
                            last edited by Apr 13, 2021, 2:44 PM

                            @guy038

                            Thank you for the clarifications.

                            1 Reply Last reply Reply Quote 0
                            • N
                              n_antiyou @guy038
                              last edited by Apr 13, 2021, 2:58 PM

                              @guy038 This is it! It works! Ahahah

                              I am not sure if I did correct. To be honest, I have understood about 10% of what you guys wrote. But I copied

                              (?s-i)(?=(\W+\w+){0,29}\W+DOG)(?=(?1){0,29}\W+CAT)(?=(?1){0,29}\W+RABBIT)(?1){30}

                              substituing DOG/RABBIT/CAT with 3 other keywords, and notepad++ found a portion of the text 30 words long containing all 3.

                              This is exactly what I was looking for.

                              Now, I have only 2 questions left:

                              1) If I want to modify the size of the portion of the text I want to find ( lets say from 30 to 50 ), would it look like this:

                              " (?s-i)(?=(\W+\w+){0,49}\W+guidato)(?=(?1){0,49}\W+parte)(?=(?1){0,49}\W+contributi)(?1){50} " ?

                              2 ) What is the expression to achieve the same result, but with 4 and 5 keywords instead of 3? ( I know I could just ask what’s the pattern to follow to add more keywords… if you want you can write it down… I am scared that I won’t understand it tho )

                              P 1 Reply Last reply Apr 13, 2021, 3:04 PM Reply Quote 0
                              • N
                                n_antiyou
                                last edited by n_antiyou Apr 13, 2021, 3:03 PM Apr 13, 2021, 3:02 PM

                                Oh, and one more ( sorry )

                                3 ) I see that notepad, when entering the expression and hitting "find ", it brings me to the proper place where the portion of the text is, and then enlightens it in grey. Is there a way to make so it also englithens the 3 keywords INSIDE, with a different color? ( any color, even the same color for all 3 keywords )
                                You may look at the picture I sent above as an example.

                                1 Reply Last reply Reply Quote 0
                                • P
                                  PeterJones @n_antiyou
                                  last edited by Apr 13, 2021, 3:04 PM

                                  @n_antiyou

                                  1. Yep, that’s right.
                                  2. each one of the (?=(\W+\w+){0,49}\W+guidato) terms applies to one of your required words. You’ll notice right now, there are three of those terms, each with one of your required words. You just need to add more of the same terms but with the new words.
                                  3. with a different color? not in the same regular expression, sorry
                                  1 Reply Last reply Reply Quote 1
                                  • N
                                    n_antiyou
                                    last edited by Apr 13, 2021, 3:11 PM

                                    Perfect, so assuming 6 keywords and 300 words as size, it should look like this:

                                    (?s-i)(?=(\W+\w+){0,299}\W+word1)(?=(\W+\w+){0,299}\W+word2)(?=(\W+\w+){0,299}\W+word3)(?=(\W+\w+){0,299}\W+word4)(?=(\W+\w+){0,299}\W+word5)(?=(\W+\w+){0,299}\W+word6)(?1){300}

                                    Correct?

                                    I’m starting to think this is a bit too complex tho. Not the expression per se, since once I understand how it works I can make new ones on my own, but the process takes time. Aren’t there programs that do this kind of research with a friendly UI?
                                    Maybe I could find people on fiverr to develop an extention of google chrome that does this kind of research, so that it would also work on a PDF without converting to txt alltogether.

                                    A 1 Reply Last reply Apr 13, 2021, 3:17 PM Reply Quote 0
                                    • G
                                      guy038
                                      last edited by Apr 13, 2021, 3:16 PM

                                      Hi, @n_antiyou,

                                      Give me some minutes ! your last regex can be simplified ;-))

                                      BR

                                      guy038

                                      1 Reply Last reply Reply Quote 0
                                      • A
                                        Alan Kilborn @n_antiyou
                                        last edited by Apr 13, 2021, 3:17 PM

                                        @n_antiyou said in Find paragraph of X words containing multiple keywords ?:

                                        Aren’t there programs that do this kind of research with a friendly UI?

                                        Are there? I guess you’d have to go and find them then.

                                        Maybe I could find people on fiverr to develop an extention of google chrome that does this kind of research, so that it would also work on a PDF without converting to txt alltogether.

                                        Are there people standing by just to do this sort of thing?
                                        That’s nice if so.
                                        Maybe they can field some of the oddball need regex questions we get asked here.

                                        1 Reply Last reply Reply Quote 0
                                        • N
                                          n_antiyou
                                          last edited by Apr 13, 2021, 3:20 PM

                                          There might be, there’s people that do all sorts of things on fiverr it seems. XD

                                          1 Reply Last reply Reply Quote 0
                                          9 out of 22
                                          • First post
                                            9/22
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors