finding files using reg-ex
I want to find files in a directory that contain two (or more) specific words. Files containing word1 OR word2 are returned using | but how can I find files that contain word1 AND word2 ? I tried (word1)*.(word2) but that didn’t work.
Thanks for your help.
Is word order important?
That is, must word1 appear before word2?
Must the two words be on the same line?
Or can they be anywhere in the file?
Do you want to find all occurrences of this in a file?
Or just the first is sufficient?
Something that should work (until you “tighten up” your spec) is:
@Alan-Kilborn No, word order is not important; the regex should find any order (and occurance) of both words.
So then I think what I already provided should work fine for you.
@Alan-Kilborn Unfortunatley it does not work. The result contains all the files in the directory indipendant of the occurance of either word1 or word2.
Hmm, well I just tried it again to verify, and for me it found only the files that contained both words.
Not sure what would be going wrong for you with it.
how can I find files that contain word1 AND word2
Another technique for this “and” problem:
Did you know that you can base a second search on the results of a first search?
After you do the Find in Files for “word1”, right-click in the “Find result” window and select Find in these found results…
You can then proceed to specify “word2” and the next search will be conducted only in the files found with your earlier search.
The net result of the second search should be files that must contain both “word1” and “word2”.
Caution: Be aware of your setting for Search only in found lines --for what you’ve specified you want to untick that.
I don’t know what went wrong with the first search on my original files. After that I created a test directory with some simple test files and the second search using your regex delivered the expected result.
Thank you very much for the suggestion to use “find in these results”. This method is much easier (though less elegant) and works fine.
I don’t know what went wrong with the first search on my original files
It could be that the regex I gave is causing an “overflow” with larger files.
This is a known Notepad++ problem where, on a single file, all text is deemed a “hit” when really an error message should be displayed.
With the info provided to this point, I can’t tell for certain if this is something you are encountering. My testing of it was certainly done on very small files that I quickly made, so the large file phenomenon, if that’s truly what is happening for you, would not have happened to me.
Here’s another possible regex to try:
I’d be interested to know if you have a different experience with that one, on your original fileset.
However, it seems like your immediate problem is solved with the other technique, and that is a “good thing”. :-)
@Alan-Kilborn My original files are MS-Word .doc of a size between 40 kB and 70 kB. I would not consider these files being large - and apparently they are small enough for a normal search for only one word. Is the file size only a problem when using regular expressions?
Your second regex delivers correct results on both the original doc and the testfiles (txt). I have to admit that I do not fully understand the expressions. However, I am very happy now, having two solutions for the problem. Thank you!
Is the file size only a problem when using regular expressions?
File size isn’t the problem, per se. The problem is that in a large file the two words could be far apart, causing the regex engine to have to do a lot of “work” and it can become “overloaded”.
In large files where the words are close together it should be no problem; obviously, also the case for small files that they should be okay.
@dieter-zweigel, you said :
I have to admit that I do not fully understand the expressions.
The @alan-kilborn’s regex
(?s)(?=.*?word1)(?=.*?word2).*may be described as :
(?s)syntax means that the regex dot symbol
.represents any single character, even EOL chars !
Then, come two positive look-ahead structures
(?=.....)which test if the regex expression , after the
From beginning of file, is there, further on, a string
Word 1, after a greatest range, possibly null, of any character ?
After this first step, it’s important to understand that processing the first look-ahead
(?=.*word1)has not changed the regex engine search position which is, still, at the very beginning of file !
So, from beginning of file, is there, further on, a string
Word 2, after a greatest range, possibly null of any character ?
If the answer to these two questions is
yes, then the regex engine matches, again from the very beginning, all the file contents
.*. However, note that, when the
Find Resultpanel is involved, only the first physical line of each file, globally seen as a single line, is displayed Safe behavior in case of huge files ;-))
And to search for files containing, at least,
word2, use this regex, with an alternative located inside the look-ahead :
Now, Alan I did some tests with the more simple regex
(?s)(?=.*AAA).*against the well-known
license.txtfile. This regex should select all file contents if the string
AAA, whatever its case, exists and should beep, if no string
Unfortunately, I noticed that the search crashed and selects all file contents, although this file does not contain, obviously, the
AAAstring. I, then, shortened this file and the regex seems to work for a
13,5 kBfile, only, with the expected message
Find: Can't find the text "(?s)(?=.*AAA).*"Surely, my weak configuration corrupt correct results. Just test it on various files. The problem occurs when no match can be found !
It’s worth to add that this regex would correctly work if we were searching
word2in each line of a file and not in all file contents, with the regex
To end with, @dieter-zweigel, note that this regex
(?s)(word1).+?(word2)|(?2).+?(?1)is a shortened syntax for :
(?s)(word1).+?(word2)|(word2).+?(word1). This form is easier to understand and almost obvious. Indeed, we are looking for a text :
- Containing the string
word1and, further on, the string
- Containing the string
word2and, further on, the string
It’s important to realize that, although
word2are stored as groups
2we cannot use the syntax
(?s)(word1).+?(word2)|\2.*?\1, with back-references to these groups !
Do you see why ? Well, when the first alternative is matched (
Word1.........Word2), the back-references
\2, although not used, do contain the strings
word2. But, when the first alternative fails ( case
Word2........Word1), the second alternative
\2.*?\1is tried. However, as no group is defined, this regex part is just invalid
Conversely, with the
(?2)syntaxes which are subroutine calls to contents of groups
2, the syntax
(?s)(word1).+?(word2)|(?2).+?(?1)is correct and can match the two cases. Note that the subroutine calls are really interesting when groups contains, themselves, regexes, possibly complex, instead of simple strings !
A simple example : given this text :
123---ABC---123 123---ABC---456 123---ABC---789 456---ABC---123 456---ABC---456 456---ABC---789 789---ABC---123 789---ABC---456 789---ABC---789
See the difference between the regex
(\d+)---ABC---\1and the regex
(\d+)---ABC---(?1), against that text :
In the former, the back-reference
\1refers to the present value of the group
In the latter, the subroutine call
(?1)refers to regex contents of the group
This means that the last regex is just identical to the regex
\d+---ABC---\d+. Of course a subroutine call can refer to a much complex regex than
Hi Guy, yes, here’s what happened when I answered the question originally:
I looked in my file of notes and I saw this example:
So I copied that example in my response above, and blindly changed the leading
(?s), after some quick testing on small data.
After the OP had problems with that, I looked a bit farther down in my notes file and found the note to use this one when the data is not necessarily on the same line:
So the conclusion I draw, is that it is great to have “notes”, but it is also smart to really read them before just grabbing a snippet and changing it even slightly, to then offer it as advice. :-(
Another variant on this general theme that is handy is finding two words, in either order, with a certain degree of proximity. This way I can find, in my notes, words that I may not know the exact phrasing for, but that I know are going to be there, and be close to each other, when I need to look up something.
So, say I want to find
bar, say within 50 characters. Maybe
foo, but maybe not. Here’s what I’d search for: