Community
    • Login

    finding files using reg-ex

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    14 Posts 3 Posters 842 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Alan KilbornA
      Alan Kilborn @Dieter Zweigel
      last edited by Alan Kilborn

      @Dieter-Zweigel

      Is word order important?
      That is, must word1 appear before word2?

      Must the two words be on the same line?
      Or can they be anywhere in the file?

      Do you want to find all occurrences of this in a file?
      Or just the first is sufficient?

      Something that should work (until you “tighten up” your spec) is:

      find: (?s)(?=.*word1)(?=.*word2).*

      Dieter ZweigelD 1 Reply Last reply Reply Quote 2
      • Dieter ZweigelD
        Dieter Zweigel @Alan Kilborn
        last edited by

        @Alan-Kilborn No, word order is not important; the regex should find any order (and occurance) of both words.

        Alan KilbornA 1 Reply Last reply Reply Quote 1
        • Alan KilbornA
          Alan Kilborn @Dieter Zweigel
          last edited by

          @Dieter-Zweigel

          So then I think what I already provided should work fine for you.
          Does it?

          Dieter ZweigelD 1 Reply Last reply Reply Quote 0
          • Dieter ZweigelD
            Dieter Zweigel @Alan Kilborn
            last edited by

            @Alan-Kilborn Unfortunatley it does not work. The result contains all the files in the directory indipendant of the occurance of either word1 or word2.

            Alan KilbornA 1 Reply Last reply Reply Quote 0
            • Alan KilbornA
              Alan Kilborn @Dieter Zweigel
              last edited by

              @Dieter-Zweigel

              Hmm, well I just tried it again to verify, and for me it found only the files that contained both words.
              Not sure what would be going wrong for you with it.
              Sorry. :-(

              1 Reply Last reply Reply Quote 0
              • Alan KilbornA
                Alan Kilborn @Dieter Zweigel
                last edited by Alan Kilborn

                @Dieter-Zweigel said in finding files using reg-ex:

                how can I find files that contain word1 AND word2

                Another technique for this “and” problem:

                Did you know that you can base a second search on the results of a first search?

                Here’s how:

                After you do the Find in Files for “word1”, right-click in the “Find result” window and select Find in these found results…
                You can then proceed to specify “word2” and the next search will be conducted only in the files found with your earlier search.
                The net result of the second search should be files that must contain both “word1” and “word2”.
                Caution: Be aware of your setting for Search only in found lines --for what you’ve specified you want to untick that.

                1 Reply Last reply Reply Quote 1
                • Dieter ZweigelD
                  Dieter Zweigel
                  last edited by

                  I don’t know what went wrong with the first search on my original files. After that I created a test directory with some simple test files and the second search using your regex delivered the expected result.
                  Thank you very much for the suggestion to use “find in these results”. This method is much easier (though less elegant) and works fine.

                  Alan KilbornA 1 Reply Last reply Reply Quote 1
                  • Alan KilbornA
                    Alan Kilborn @Dieter Zweigel
                    last edited by Alan Kilborn

                    @Dieter-Zweigel said in finding files using reg-ex:

                    I don’t know what went wrong with the first search on my original files

                    It could be that the regex I gave is causing an “overflow” with larger files.
                    This is a known Notepad++ problem where, on a single file, all text is deemed a “hit” when really an error message should be displayed.
                    With the info provided to this point, I can’t tell for certain if this is something you are encountering. My testing of it was certainly done on very small files that I quickly made, so the large file phenomenon, if that’s truly what is happening for you, would not have happened to me.

                    Here’s another possible regex to try:

                    (?s)(word1).+?(word2)|(?2).+?(?1)

                    I’d be interested to know if you have a different experience with that one, on your original fileset.

                    However, it seems like your immediate problem is solved with the other technique, and that is a “good thing”. :-)

                    Dieter ZweigelD 1 Reply Last reply Reply Quote 2
                    • Dieter ZweigelD
                      Dieter Zweigel @Alan Kilborn
                      last edited by Dieter Zweigel

                      @Alan-Kilborn My original files are MS-Word .doc of a size between 40 kB and 70 kB. I would not consider these files being large - and apparently they are small enough for a normal search for only one word. Is the file size only a problem when using regular expressions?
                      Your second regex delivers correct results on both the original doc and the testfiles (txt). I have to admit that I do not fully understand the expressions. However, I am very happy now, having two solutions for the problem. Thank you!

                      Alan KilbornA 1 Reply Last reply Reply Quote 1
                      • Alan KilbornA
                        Alan Kilborn @Dieter Zweigel
                        last edited by

                        @Dieter-Zweigel said in finding files using reg-ex:

                        Is the file size only a problem when using regular expressions?

                        File size isn’t the problem, per se. The problem is that in a large file the two words could be far apart, causing the regex engine to have to do a lot of “work” and it can become “overloaded”.

                        In large files where the words are close together it should be no problem; obviously, also the case for small files that they should be okay.

                        1 Reply Last reply Reply Quote 1
                        • guy038G
                          guy038
                          last edited by

                          Hello, @dieter-zweigel , @alan-kilborn and All,

                          @dieter-zweigel, you said :

                          I have to admit that I do not fully understand the expressions.

                          The @alan-kilborn’s regex (?s)(?=.*?word1)(?=.*?word2).* may be described as :

                          • First, the (?s) syntax means that the regex dot symbol . represents any single character, even EOL chars !

                          • Then, come two positive look-ahead structures (?=.....) which test if the regex expression , after the = sign is true

                            • From beginning of file, is there, further on, a string Word 1, after a greatest range, possibly null, of any character ?

                            • After this first step, it’s important to understand that processing the first look-ahead (?=.*word1) has not changed the regex engine search position which is, still, at the very beginning of file !

                            • So, from beginning of file, is there, further on, a string Word 2, after a greatest range, possibly null of any character ?

                          • If the answer to these two questions is yes, then the regex engine matches, again from the very beginning, all the file contents .* . However, note that, when the Find Result panel is involved, only the first physical line of each file, globally seen as a single line, is displayed Safe behavior in case of huge files ;-))

                          And to search for files containing, at least, 1 string word1 OR 1 string word2, use this regex, with an alternative located inside the look-ahead :

                          (?s)(?=.*(word1|word2)).*


                          Now, Alan I did some tests with the more simple regex (?s)(?=.*AAA).* against the well-known license.txt file. This regex should select all file contents if the string AAA, whatever its case, exists and should beep, if no string AAA is found.

                          Unfortunately, I noticed that the search crashed and selects all file contents, although this file does not contain, obviously, the AAA string. I, then, shortened this file and the regex seems to work for a 13,5 kB file, only, with the expected message Find: Can't find the text "(?s)(?=.*AAA).*" Surely, my weak configuration corrupt correct results. Just test it on various files. The problem occurs when no match can be found !

                          It’s worth to add that this regex would correctly work if we were searching word1 and word2 in each line of a file and not in all file contents, with the regex (?-s)(?=.*word1)(?=.*word2).+ ;-))


                          So, @dieter-zweigel, I would advise you to use, preferably, the second @alan-kilborn regex syntax, which is must faster and does not report wrong matches

                          To end with, @dieter-zweigel, note that this regex (?s)(word1).+?(word2)|(?2).+?(?1) is a shortened syntax for :

                          (?s)(word1).+?(word2)|(word2).+?(word1). This form is easier to understand and almost obvious. Indeed, we are looking for a text :

                          • Containing the string word1 and, further on, the string word2

                          OR ( | )

                          • Containing the string word2 and, further on, the string word1

                          It’s important to realize that, although word1 and word2 are stored as groups 1 and 2 we cannot use the syntax (?s)(word1).+?(word2)|\2.*?\1, with back-references to these groups !

                          Do you see why ? Well, when the first alternative is matched ( Word1.........Word2 ), the back-references \1 and \2, although not used, do contain the strings word1 and word2. But, when the first alternative fails ( case Word2........Word1 ), the second alternative \2.*?\1 is tried. However, as no group is defined, this regex part is just invalid

                          Conversely, with the (?1) and (?2) syntaxes which are subroutine calls to contents of groups 1 and 2, the syntax (?s)(word1).+?(word2)|(?2).+?(?1) is correct and can match the two cases. Note that the subroutine calls are really interesting when groups contains, themselves, regexes, possibly complex, instead of simple strings !

                          A simple example : given this text :

                          123---ABC---123
                          123---ABC---456
                          123---ABC---789
                          
                          456---ABC---123
                          456---ABC---456
                          456---ABC---789
                          
                          789---ABC---123
                          789---ABC---456
                          789---ABC---789
                          

                          See the difference between the regex (\d+)---ABC---\1 and the regex (\d+)---ABC---(?1), against that text :

                          • In the former, the back-reference \1 refers to the present value of the group 1

                          • In the latter, the subroutine call (?1) refers to regex contents of the group 1, so \d+

                          This means that the last regex is just identical to the regex \d+---ABC---\d+. Of course a subroutine call can refer to a much complex regex than \d+ !

                          Best Regards,

                          guy038

                          Alan KilbornA 1 Reply Last reply Reply Quote 3
                          • Alan KilbornA
                            Alan Kilborn @guy038
                            last edited by

                            @guy038

                            Hi Guy, yes, here’s what happened when I answered the question originally:

                            I looked in my file of notes and I saw this example:

                            (?-s)(?=.*foo)(?=.*bar).*

                            So I copied that example in my response above, and blindly changed the leading (?-s) to (?s), after some quick testing on small data.

                            After the OP had problems with that, I looked a bit farther down in my notes file and found the note to use this one when the data is not necessarily on the same line:

                            (?s)(foo).+?(bar)|(?2).+?(?1)

                            So the conclusion I draw, is that it is great to have “notes”, but it is also smart to really read them before just grabbing a snippet and changing it even slightly, to then offer it as advice. :-(

                            1 Reply Last reply Reply Quote 2
                            • Alan KilbornA
                              Alan Kilborn
                              last edited by

                              Another variant on this general theme that is handy is finding two words, in either order, with a certain degree of proximity. This way I can find, in my notes, words that I may not know the exact phrasing for, but that I know are going to be there, and be close to each other, when I need to look up something.

                              So, say I want to find foo close to bar, say within 50 characters. Maybe bar occurs before foo, but maybe not. Here’s what I’d search for:

                              (?s)(foo)(.{0,50}?)(bar)|(?3)(?2)(?1)

                              1 Reply Last reply Reply Quote 2
                              • First post
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors