finding files using reg-ex
-
Hi,
I want to find files in a directory that contain two (or more) specific words. Files containing word1 OR word2 are returned using | but how can I find files that contain word1 AND word2 ? I tried (word1)*.(word2) but that didn’t work.
Thanks for your help. -
Is word order important?
That is, must word1 appear before word2?Must the two words be on the same line?
Or can they be anywhere in the file?Do you want to find all occurrences of this in a file?
Or just the first is sufficient?Something that should work (until you “tighten up” your spec) is:
find:
(?s)(?=.*word1)(?=.*word2).*
-
@Alan-Kilborn No, word order is not important; the regex should find any order (and occurance) of both words.
-
So then I think what I already provided should work fine for you.
Does it? -
@Alan-Kilborn Unfortunatley it does not work. The result contains all the files in the directory indipendant of the occurance of either word1 or word2.
-
Hmm, well I just tried it again to verify, and for me it found only the files that contained both words.
Not sure what would be going wrong for you with it.
Sorry. :-( -
@Dieter-Zweigel said in finding files using reg-ex:
how can I find files that contain word1 AND word2
Another technique for this “and” problem:
Did you know that you can base a second search on the results of a first search?
Here’s how:
After you do the Find in Files for “word1”, right-click in the “Find result” window and select Find in these found results…
You can then proceed to specify “word2” and the next search will be conducted only in the files found with your earlier search.
The net result of the second search should be files that must contain both “word1” and “word2”.
Caution: Be aware of your setting for Search only in found lines --for what you’ve specified you want to untick that. -
I don’t know what went wrong with the first search on my original files. After that I created a test directory with some simple test files and the second search using your regex delivered the expected result.
Thank you very much for the suggestion to use “find in these results”. This method is much easier (though less elegant) and works fine. -
@Dieter-Zweigel said in finding files using reg-ex:
I don’t know what went wrong with the first search on my original files
It could be that the regex I gave is causing an “overflow” with larger files.
This is a known Notepad++ problem where, on a single file, all text is deemed a “hit” when really an error message should be displayed.
With the info provided to this point, I can’t tell for certain if this is something you are encountering. My testing of it was certainly done on very small files that I quickly made, so the large file phenomenon, if that’s truly what is happening for you, would not have happened to me.Here’s another possible regex to try:
(?s)(word1).+?(word2)|(?2).+?(?1)
I’d be interested to know if you have a different experience with that one, on your original fileset.
However, it seems like your immediate problem is solved with the other technique, and that is a “good thing”. :-)
-
@Alan-Kilborn My original files are MS-Word .doc of a size between 40 kB and 70 kB. I would not consider these files being large - and apparently they are small enough for a normal search for only one word. Is the file size only a problem when using regular expressions?
Your second regex delivers correct results on both the original doc and the testfiles (txt). I have to admit that I do not fully understand the expressions. However, I am very happy now, having two solutions for the problem. Thank you! -
@Dieter-Zweigel said in finding files using reg-ex:
Is the file size only a problem when using regular expressions?
File size isn’t the problem, per se. The problem is that in a large file the two words could be far apart, causing the regex engine to have to do a lot of “work” and it can become “overloaded”.
In large files where the words are close together it should be no problem; obviously, also the case for small files that they should be okay.
-
Hello, @dieter-zweigel , @alan-kilborn and All,
@dieter-zweigel, you said :
I have to admit that I do not fully understand the expressions.
The @alan-kilborn’s regex
(?s)(?=.*?word1)(?=.*?word2).*
may be described as :-
First, the
(?s)
syntax means that the regex dot symbol.
represents any single character, even EOL chars ! -
Then, come two positive look-ahead structures
(?=.....)
which test if the regex expression , after the=
sign istrue
-
From beginning of file, is there, further on, a string
Word 1
, after a greatest range, possibly null, of any character ? -
After this first step, it’s important to understand that processing the first look-ahead
(?=.*word1)
has not changed the regex engine search position which is, still, at the very beginning of file ! -
So, from beginning of file, is there, further on, a string
Word 2
, after a greatest range, possibly null of any character ?
-
-
If the answer to these two questions is
yes
, then the regex engine matches, again from the very beginning, all the file contents.*
. However, note that, when theFind Result
panel is involved, only the first physical line of each file, globally seen as a single line, is displayed Safe behavior in case of huge files ;-))
And to search for files containing, at least,
1
stringword1
OR1
stringword2
, use this regex, with an alternative located inside the look-ahead :(?s)(?=.*(word1|word2)).*
Now, Alan I did some tests with the more simple regex
(?s)(?=.*AAA).*
against the well-knownlicense.txt
file. This regex should select all file contents if the stringAAA
, whatever its case, exists and should beep, if no stringAAA
is found.Unfortunately, I noticed that the search crashed and selects all file contents, although this file does not contain, obviously, the
AAA
string. I, then, shortened this file and the regex seems to work for a13,5 kB
file, only, with the expected messageFind: Can't find the text "(?s)(?=.*AAA).*"
Surely, my weak configuration corrupt correct results. Just test it on various files. The problem occurs when no match can be found !It’s worth to add that this regex would correctly work if we were searching
word1
andword2
in each line of a file and not in all file contents, with the regex(?-s)(?=.*word1)(?=.*word2).+
;-))
So, @dieter-zweigel, I would advise you to use, preferably, the second @alan-kilborn regex syntax, which is must faster and does not report wrong matches
To end with, @dieter-zweigel, note that this regex
(?s)(word1).+?(word2)|(?2).+?(?1)
is a shortened syntax for :(?s)(word1).+?(word2)|(word2).+?(word1)
. This form is easier to understand and almost obvious. Indeed, we are looking for a text :- Containing the string
word1
and, further on, the stringword2
OR (
|
)- Containing the string
word2
and, further on, the stringword1
It’s important to realize that, although
word1
andword2
are stored as groups1
and2
we cannot use the syntax(?s)(word1).+?(word2)|\2.*?\1
, with back-references to these groups !Do you see why ? Well, when the first alternative is matched (
Word1.........Word2
), the back-references\1
and\2
, although not used, do contain the stringsword1
andword2
. But, when the first alternative fails ( caseWord2........Word1
), the second alternative\2.*?\1
is tried. However, as no group is defined, this regex part is just invalidConversely, with the
(?1)
and(?2)
syntaxes which are subroutine calls to contents of groups1
and2
, the syntax(?s)(word1).+?(word2)|(?2).+?(?1)
is correct and can match the two cases. Note that the subroutine calls are really interesting when groups contains, themselves, regexes, possibly complex, instead of simple strings !A simple example : given this text :
123---ABC---123 123---ABC---456 123---ABC---789 456---ABC---123 456---ABC---456 456---ABC---789 789---ABC---123 789---ABC---456 789---ABC---789
See the difference between the regex
(\d+)---ABC---\1
and the regex(\d+)---ABC---(?1)
, against that text :-
In the former, the back-reference
\1
refers to the present value of the group1
-
In the latter, the subroutine call
(?1)
refers to regex contents of the group1
, so\d+
This means that the last regex is just identical to the regex
\d+---ABC---\d+
. Of course a subroutine call can refer to a much complex regex than\d+
!Best Regards,
guy038
-
-
Hi Guy, yes, here’s what happened when I answered the question originally:
I looked in my file of notes and I saw this example:
(?-s)(?=.*foo)(?=.*bar).*
So I copied that example in my response above, and blindly changed the leading
(?-s)
to(?s)
, after some quick testing on small data.After the OP had problems with that, I looked a bit farther down in my notes file and found the note to use this one when the data is not necessarily on the same line:
(?s)(foo).+?(bar)|(?2).+?(?1)
So the conclusion I draw, is that it is great to have “notes”, but it is also smart to really read them before just grabbing a snippet and changing it even slightly, to then offer it as advice. :-(
-
Another variant on this general theme that is handy is finding two words, in either order, with a certain degree of proximity. This way I can find, in my notes, words that I may not know the exact phrasing for, but that I know are going to be there, and be close to each other, when I need to look up something.
So, say I want to find
foo
close tobar
, say within 50 characters. Maybebar
occurs beforefoo
, but maybe not. Here’s what I’d search for:(?s)(foo)(.{0,50}?)(bar)|(?3)(?2)(?1)