finding files using reg-ex

Dieter Zweigel

Hi,
I want to find files in a directory that contain two (or more) specific words. Files containing word1 OR word2 are returned using | but how can I find files that contain word1 AND word2 ? I tried (word1)*.(word2) but that didn’t work.
Thanks for your help.

Alan Kilborn

@Dieter-Zweigel

Is word order important?
That is, must word1 appear before word2?

Must the two words be on the same line?
Or can they be anywhere in the file?

Do you want to find all occurrences of this in a file?
Or just the first is sufficient?

Something that should work (until you “tighten up” your spec) is:

find: (?s)(?=.*word1)(?=.*word2).*

Dieter Zweigel

@Alan-Kilborn No, word order is not important; the regex should find any order (and occurance) of both words.

Alan Kilborn

@Dieter-Zweigel

So then I think what I already provided should work fine for you.
Does it?

Dieter Zweigel

@Alan-Kilborn Unfortunatley it does not work. The result contains all the files in the directory indipendant of the occurance of either word1 or word2.

Alan Kilborn

@Dieter-Zweigel

Hmm, well I just tried it again to verify, and for me it found only the files that contained both words.
Not sure what would be going wrong for you with it.
Sorry. :-(

Alan Kilborn

@Dieter-Zweigel said in finding files using reg-ex:

how can I find files that contain word1 AND word2

Another technique for this “and” problem:

Did you know that you can base a second search on the results of a first search?

Here’s how:

After you do the Find in Files for “word1”, right-click in the “Find result” window and select Find in these found results…
You can then proceed to specify “word2” and the next search will be conducted only in the files found with your earlier search.
The net result of the second search should be files that must contain both “word1” and “word2”.
Caution: Be aware of your setting for Search only in found lines --for what you’ve specified you want to untick that.

Dieter Zweigel

I don’t know what went wrong with the first search on my original files. After that I created a test directory with some simple test files and the second search using your regex delivered the expected result.
Thank you very much for the suggestion to use “find in these results”. This method is much easier (though less elegant) and works fine.

Alan Kilborn

@Dieter-Zweigel said in finding files using reg-ex:

I don’t know what went wrong with the first search on my original files

It could be that the regex I gave is causing an “overflow” with larger files.
This is a known Notepad++ problem where, on a single file, all text is deemed a “hit” when really an error message should be displayed.
With the info provided to this point, I can’t tell for certain if this is something you are encountering. My testing of it was certainly done on very small files that I quickly made, so the large file phenomenon, if that’s truly what is happening for you, would not have happened to me.

Here’s another possible regex to try:

(?s)(word1).+?(word2)|(?2).+?(?1)

I’d be interested to know if you have a different experience with that one, on your original fileset.

However, it seems like your immediate problem is solved with the other technique, and that is a “good thing”. :-)

Dieter Zweigel

@Alan-Kilborn My original files are MS-Word .doc of a size between 40 kB and 70 kB. I would not consider these files being large - and apparently they are small enough for a normal search for only one word. Is the file size only a problem when using regular expressions?
Your second regex delivers correct results on both the original doc and the testfiles (txt). I have to admit that I do not fully understand the expressions. However, I am very happy now, having two solutions for the problem. Thank you!

Alan Kilborn

@Dieter-Zweigel said in finding files using reg-ex:

Is the file size only a problem when using regular expressions?

File size isn’t the problem, per se. The problem is that in a large file the two words could be far apart, causing the regex engine to have to do a lot of “work” and it can become “overloaded”.

In large files where the words are close together it should be no problem; obviously, also the case for small files that they should be okay.

guy038

Hello, @dieter-zweigel , @alan-kilborn and All,

@dieter-zweigel, you said :

I have to admit that I do not fully understand the expressions.

The @alan-kilborn’s regex (?s)(?=.*?word1)(?=.*?word2).* may be described as :

First, the (?s) syntax means that the regex dot symbol . represents any single character, even EOL chars !
Then, come two positive look-ahead structures (?=.....) which test if the regex expression , after the = sign is true
- From beginning of file, is there, further on, a string Word 1, after a greatest range, possibly null, of any character ?
- After this first step, it’s important to understand that processing the first look-ahead (?=.*word1) has not changed the regex engine search position which is, still, at the very beginning of file !
- So, from beginning of file, is there, further on, a string Word 2, after a greatest range, possibly null of any character ?
If the answer to these two questions is yes, then the regex engine matches, again from the very beginning, all the file contents .* . However, note that, when the Find Result panel is involved, only the first physical line of each file, globally seen as a single line, is displayed Safe behavior in case of huge files ;-))

And to search for files containing, at least, 1 string word1 OR 1 string word2, use this regex, with an alternative located inside the look-ahead :

(?s)(?=.*(word1|word2)).*

Now, Alan I did some tests with the more simple regex (?s)(?=.*AAA).* against the well-known license.txt file. This regex should select all file contents if the string AAA, whatever its case, exists and should beep, if no string AAA is found.

Unfortunately, I noticed that the search crashed and selects all file contents, although this file does not contain, obviously, the AAA string. I, then, shortened this file and the regex seems to work for a 13,5 kB file, only, with the expected message Find: Can't find the text "(?s)(?=.*AAA).*" Surely, my weak configuration corrupt correct results. Just test it on various files. The problem occurs when no match can be found !

It’s worth to add that this regex would correctly work if we were searching word1 and word2 in each line of a file and not in all file contents, with the regex (?-s)(?=.*word1)(?=.*word2).+ ;-))

So, @dieter-zweigel, I would advise you to use, preferably, the second @alan-kilborn regex syntax, which is must faster and does not report wrong matches

To end with, @dieter-zweigel, note that this regex (?s)(word1).+?(word2)|(?2).+?(?1) is a shortened syntax for :

(?s)(word1).+?(word2)|(word2).+?(word1). This form is easier to understand and almost obvious. Indeed, we are looking for a text :

Containing the string word1 and, further on, the string word2

OR ( | )

Containing the string word2 and, further on, the string word1

It’s important to realize that, although word1 and word2 are stored as groups 1 and 2 we cannot use the syntax (?s)(word1).+?(word2)|\2.*?\1, with back-references to these groups !

Do you see why ? Well, when the first alternative is matched ( Word1.........Word2 ), the back-references \1 and \2, although not used, do contain the strings word1 and word2. But, when the first alternative fails ( case Word2........Word1 ), the second alternative \2.*?\1 is tried. However, as no group is defined, this regex part is just invalid

Conversely, with the (?1) and (?2) syntaxes which are subroutine calls to contents of groups 1 and 2, the syntax (?s)(word1).+?(word2)|(?2).+?(?1) is correct and can match the two cases. Note that the subroutine calls are really interesting when groups contains, themselves, regexes, possibly complex, instead of simple strings !

A simple example : given this text :

123---ABC---123
123---ABC---456
123---ABC---789

456---ABC---123
456---ABC---456
456---ABC---789

789---ABC---123
789---ABC---456
789---ABC---789

See the difference between the regex (\d+)---ABC---\1 and the regex (\d+)---ABC---(?1), against that text :

In the former, the back-reference \1 refers to the present value of the group 1
In the latter, the subroutine call (?1) refers to regex contents of the group 1, so \d+

This means that the last regex is just identical to the regex \d+---ABC---\d+. Of course a subroutine call can refer to a much complex regex than \d+ !

Best Regards,

guy038

Alan Kilborn

@guy038

Hi Guy, yes, here’s what happened when I answered the question originally:

I looked in my file of notes and I saw this example:

(?-s)(?=.*foo)(?=.*bar).*

So I copied that example in my response above, and blindly changed the leading (?-s) to (?s), after some quick testing on small data.

After the OP had problems with that, I looked a bit farther down in my notes file and found the note to use this one when the data is not necessarily on the same line:

(?s)(foo).+?(bar)|(?2).+?(?1)

So the conclusion I draw, is that it is great to have “notes”, but it is also smart to really read them before just grabbing a snippet and changing it even slightly, to then offer it as advice. :-(

Alan Kilborn

Another variant on this general theme that is handy is finding two words, in either order, with a certain degree of proximity. This way I can find, in my notes, words that I may not know the exact phrasing for, but that I know are going to be there, and be close to each other, when I need to look up something.

So, say I want to find foo close to bar, say within 50 characters. Maybe bar occurs before foo, but maybe not. Here’s what I’d search for:

(?s)(foo)(.{0,50}?)(bar)|(?3)(?2)(?1)