Community
    • 登入

    Regular expression to find two words in files in folder

    已排程 已置頂 已鎖定 已移動 Help wanted · · · – – – · · ·
    11 貼文 4 Posters 24.9k 瀏覽
    正在載入更多貼文
    • 從舊到新
    • 從新到舊
    • 最多點贊
    回覆
    • 在新貼文中回覆
    登入後回覆
    此主題已被刪除。只有擁有主題管理權限的使用者可以查看。
    • Andrea CappelliA
      Andrea Cappelli
      最後由 編輯

      Hi,
      I need to look for two words in a set of files in a folder.
      using
      (Word1) | (word2)
      I find files that contain either word1 or word2 as but I need to find
      Only files that contain word1 and word2 together ignoring all other files.
      Thanks for your cooperation

      1 條回覆 最後回覆 回覆 引用 0
      • Per IsaksonP
        Per Isakson
        最後由 編輯

        Does (Word1)(?s:.*?)(word2)|(word2)(?s:.*?)(Word1) work?

        1 條回覆 最後回覆 回覆 引用 0
        • guy038G
          guy038
          最後由 guy038 編輯

          Hello, @andrea-cappelli,

          The regex given by @per-isakson is quite correct. However when the general case where the two words Word1 and Word2 are located in different lines, a search, with the Find in Files dialog, does NOT display, in the Find Result panel, all the lines of the block, beginning with Word1 and ending with Word2 ( or the opposite ) but ONLY the first line of each multi-lines block. ( small bug ! )

          So, instead, you could use the regex (?si)(Word1)(?=.*?(Word2))|(?2)(?=.*?(?1)), which searches for, either, the words Word1 OR Word2, in an insensitive case way, if they are followed, further on, by the second specific word

          Notes :

          • The syntax (?si), at beginning of the regex, are modifiers which ensures that :

            • The dot ( . ) special character matches, absolutely, any single character ( standard or EOL )

            • The search will be perform, in an insensitive case way ( If you need a sensitive search, just use the syntax (?s-i) )

          • Then (Word1) matches the string Word1, stored as group 1, due to the parentheses, ONLY IF followed by the first string Word2, found afterwards, also stored as group 2, due to the “Look-ahead” construction (?=.*?(Word2))

          • After the alternative symbol |, the case (?2)(?=.*?(?1)) just represents the opposite case, where we’re searching for the string Word2, followed, further on, with the string Word1. We use a specific regex construction (?#), named a called subpattern. ( This atomic group is just a particular case of of recursive subpattern, located outside the parentheses to which it refers )


          @andrea-cappelli, if your two words, Word1 and Word2, are, always, both located in a same line, you could, preferably, use the more simple regex, below, which searches for the smaller range of characters, in a same line, between the string Word1 and Word2 OR between Word2 and Word1

          (?i-s)(Word1).*?(Word2)|(?2).*?(?1)

          Notes :

          • The (?i-s) modifiers ensures that :

            • The search is performed, in an *insensitive case way

            • The dot will match a single standard character, even if you previously checked the . matches newline option

          • If you need a sensitive search, change the modifiers part by the syntax (?-is)


          After running these regexes, using the Find in Files dialog, you should get, in the Find result panel :

          • The absolute path of each file, containing the two words Word1 and Word2

          • Some lines, containing, either, Word1 or Word2 or both

          If you, simply, need the list of all these files, follow the method, below :

          • With a right mouse click, choose the select All option, in the Find result panel

          • Hit the Ctrl + C shortcut ( DO NOT use the context option Copy ! )

          • Paste the clipboard contents in a new tab, with the Ctrl + V shortcut

          • In this new tab, perform the simple S/R, below :

          SEARCH ^\t.+\R

          REPLACE Leave EMPTY

          Best Regards,

          guy038

          P.S. :

          It’s very important to understand the fundamental difference between a subpattern, used as a subroutine and a back reference !!

          For instance, given the four lines text, below :

          123abc123
          123abc789
          789abc123
          789abc789
          

          The regex (\d+)abc\1, with the \1 back-reference, would match the first and fourth line, only. Indeed, the syntax \1 refers to the present value of the group 1

          Whereas the regex (\d+)abc(?1), with the (?1) called subpattern, would match the four lines ! Actually, this second regex syntax is,simply, identical to the regex (\d+)abc\d+ ;-))

          P.P.S. :

          After preparing this post, I found out that a previous post, more detailed, at the address, below, just speak about a very similar problem !!!

          https://notepad-plus-plus.org/community/topic/12948/2-search-strings-in-a-group-of-files-with-the-search-function

          1 條回覆 最後回覆 回覆 引用 0
          • Andrea CappelliA
            Andrea Cappelli
            最後由 編輯

            Thank you very much for the explanation. The expression suits me in this form

            (?si)(Word1)(?=.?(Word2))|(?2)(?=.?(?1))

            Thanks again

            1 條回覆 最後回覆 回覆 引用 0
            • Vasile CarausV
              Vasile Caraus
              最後由 Vasile Caraus 編輯

              Andreea Cappelli, I test you last regex, I am glad to here that works, but I don’t know in what case exactly. I have a file with Word1 and Word2 on different lines in the same file, and doesn’t work you regex. So, tell us how it looks your file.

              guy38, about your very fine regex (?i-s)(Word1).*?(Word2)|(?2).*?(?1) this selects everything from word1 and word2 on the same line. Perfect.

              Now, I change a little bit your regex, so I can select the entire line that contains Word1 and Word2.

              ^.*(?i-s)(Word1).*?(Word2)|(?2).*?(?1).*$ but doesn’t work too good. It selects everything till Word2, but not everything after him. Can you look a little bit?

              1 條回覆 最後回覆 回覆 引用 0
              • Andrea CappelliA
                Andrea Cappelli
                最後由 編輯

                With some experience for me as it gave me the best results is this

                (word1)(?s:.?)(word2)|(word2)(?s:.?)(word1)

                But I see that the discussion gets interesting then I write the whole issue.
                I file like this
                …
                latest_meas_value(‘R’,“LY1-2”,81,81,0,0,“”,1)
                latest_meas_value(‘R’,“H-H”,85,85,0,0,“”,1)
                tankitem(“+ESSENZE FLOREALI (alt E)”,1,1)
                tankitem(“+Bryaconeel (Tabl)”,1,1)
                tankitem(“+Aesculus compositum (Drops)”,1,1)
                tankitem(“+Viscum compositum mite”,1,1)
                tankitem(“+Bryaconeel (Tabl)”,2,1)
                tankitem(“+Aesculus compositum (Drops)”,2,1)
                tankitem(“+Viscum compositum mite”,2,1)
                op_name(“OK”)
                starting_time(12,15)
                client_info(“NAME AND SURNAME”,“”,“JESI”,“”,“13.07.1926”,“”,“Clinic”)
                test_date(“8/10/1995”)
                memo_pad(“LONG DESCRIPTION\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n”,4,13)
                elapsed_time(1)

                Of these files should I look for
                JESI
                and
                2005
                Jesi only in the line “client_info” and 2005 only in the line “test_date”
                Any ideas?

                1 條回覆 最後回覆 回覆 引用 0
                • Vasile CarausV
                  Vasile Caraus
                  最後由 編輯

                  and where are word1 and word2 in your text?

                  Andrea CappelliA 1 條回覆 最後回覆 回覆 引用 0
                  • Per IsaksonP
                    Per Isakson
                    最後由 編輯

                    Why do you try to match what’s between the word1 and two with (?s:.?), which only matches zero or one character? Replace it by (?s:.*?) and word1 and word2 by client_info and test_date, respectively. That will make your example work.

                    1 條回覆 最後回覆 回覆 引用 0
                    • Andrea CappelliA
                      Andrea Cappelli @Vasile Caraus
                      最後由 編輯

                      @Andrea-Cappelli said:

                      JESI

                      word1 = Jesi
                      word2 = 2005

                      1 條回覆 最後回覆 回覆 引用 0
                      • Per IsaksonP
                        Per Isakson
                        最後由 編輯

                        Try (?x)(?ms)((test_date.*?2005).*?(client_info.*?JESI))|(\2.*?\1)

                        In case word1and word2are names of variable rather than values, I don’t know.

                        1 條回覆 最後回覆 回覆 引用 0
                        • guy038G
                          guy038
                          最後由 編輯

                          Hi, @vasile-caraus,

                          Now, I realized that the regex, given in my previous post, (?i-s)(Word1).*?(Word2)|(?2).*?(?1) could be simplified !

                          Indeed, as I explained, we can’t use back-references, which are not defined if the regex engine choose the second alternative ! But, when the boundaries Word1 and Word2 are not, themselves, regexes ( as, for instance \d+, a..z… ) and rather simple strings, we can use the more simple syntax below :

                          (?i-s)Word1.*?Word2|Word2.*?Word1


                          Secondly, to select any entire line ( with its EOL characters ) containing the two words Word1 and Word2, whatever their order, use the regex, below :

                          (?i-s)^.*(Word1.*Word2|Word2.*Word1).*\R

                          Best Regards,

                          guy038

                          P.S. :

                          As we’re rather dealing with exact words, we should use, instead of the two above, the regexes, below :

                          (?i-s)(?<=\W)Word1\W.*?\WWord2(?=\W)|(?<=\W)Word2\W.*?\WWord1(?=\W)

                          and

                          (?i-s)^.*\W(Word1\W.*\WWord2|Word2\W.*\WWord1)\W.*\R

                          1 條回覆 最後回覆 回覆 引用 0
                          • 第一個貼文
                            最後的貼文
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors