2 search strings in a group of files with the search function
-
Hi,
I think that this topic has been treated at other times so I apologize if I repeat myself.
I have to try 2 strings in a group of files.
I tried with
(Word1) | (word2)
but I wish there were no file duplication in research.
how should I do?
Thanks for your cooperation -
not sure what you are talking about. Maybe you wanna provide some screenshot or some example
kind of this is what I expect and this is what happens explanations?Cheers
Claudia -
Hello, Andrea,
What you, really, like to, is not clearly defined, in your post ! So I tried to guess ;-)
-
Seemingly, you’re looking for, either, the string Word1 OR the string Word2, in several files
-
As I supposed that these files may contain many occurrences of the string Word1 AND/OR many occurrences of the string Word2 you would prefer to get only ONE result, per file, wouldn’t you ?
-
But, as it’s likely that most of them contain the TWO strings Word1 and Word2, do you prefer :
-
A) : ONE result, per file, containing ONE of the two strings, whatever it is
-
B) : TWO results, per file, with a first line containing ONE of the strings, and a second line containing the OTHER string
-
Anyway, I’ll give you the solution for the two cases A) and B ;-))
Initially, when I built these two regexes, for cases A) et B), I tried them on a dozen, or so, of files, and, unfortunately, I noticed that, when the current file scanned, has an important size, the regex might cause a catastrophic backtracking, ending to a global wrong match of the full contents of this file :-((
As I was unable to find out other correct regexes, yet, which could avoid search failure, I, then, decided to take advantage of a new N++ feature, implemented since the 6.9.2 version : The new option Find in this finder…, when you right-click inside the Find result panel !
So, I split the problem in two parts :
-
Firstly, output, only, the lines, of each scanned file, which contain, either, the string Word1 and/or Word2, in the Find result panel
-
Secondly, from that found restricted list, use my original regexes to get the right results !
So, follow these preliminary steps, below :
-
Open the Find in Files dialog ( Ctrl + Shift + F )
-
Type, in the Find what: field, the simple regex
(?-i)Word1|Word2
-
Type, in the Replace with Field, the regex
$0
( SECURITY ! ) -
Type, in the Filters field, your list of files to be scanned
-
Type, in the Directory: field, the absolute path of the folder, containing your files
-
Click on the Find All button
=> The Find result panel should appear, with all the concerned lines, from all your files to be scanned !
Notes :
-
The
(?-i)
modifier forces the regex search to be performed in a sensitive way. Use, instead, the(?i)
syntax, if you prefer to run the search, in a insensitive way ! -
Although we’re just searching something, and not replacing anything, it’s a good habit to, always, put the form
$0
, which stands for the complete current matched string. Indeed, just suppose that you clicked, by mistake, on the Replace in Files button and that you confirmed the replacement, by clicking on the OK button of the validation dialog, this S/R would simply replace any matched string by this same string itself :-)) Quite at ease, isn’t it ?
Now, we are going to exploit the restricted text, of the Find result panel :
Case A) :
This regex search looks for the last line, in the Find result panel, containing, indifferently, the string Word1 or Word2, in that EXACT case :
-
Right-click, inside the Find result panel, and choose the Find in this finder… option
-
In the Find what: field, type
(?s-i).*\K(?:Word1|Word2)
-
Check the Search only on found lines option
-
Uncheck the Match whole word only option
-
Select the Regular expression search mode
-
Click on the Find All button
=> A second “Find result” panel appears, with the indication - Line Filter Mode: only display the filtered results. This new panel should contain, only, ONE line, per file, with, indifferently, the string Word1 or the string Word2 !
Notes :
-
The first part,
(?s-i).*
, looks for any amount, even empty or multi-lines, of any character ( standard or EOL ) till the last occurrence, in the file, of the string Word1 or Word2, with its exact case, stored in a non-capturing group(?:...|...)
-
Due to the
\K
syntax, the the location of the regex match is reset and the regex engine just matches the string Word1 or Word2
Case B) :
This regex search looks for the TWO last lines, in the Find result panel, containing the string Word1, first, then, the string Word2 OR the string Word2, first, then, the string Word1, ans all, in that EXACT case :
-
Select, again, the MAIN Find result panel ( IMPORTANT )
-
Right-click, inside, and choose, again, the Find in this finder… option
-
In the Find what: field, type the regex
(?s-i).*\K(?:(Word1)(?=.*(?2))|(Word2)(?=.*(?1)))|.*\K(?:(?1)|(?2))
-
Check the Search only on found lines option
-
Uncheck the Match whole word only option
-
Select the Regular expression search mode
-
Click on the Find All button
=> A third “Find result” panel appears, with the indication - Line Filter Mode: only display the filtered results. This new panel should contain TWO lines, per file, with, for each file :
- A first line, containing the string Word1 and a second line, containing the string Word2, in that exact case
OR
- A first line, containing the string Word2 and a second line, containing the string Word1, in that exact case
NOTES :
-
If a scanned file contains the string Word1 or Word2, ONLY, this unique occurrence is, also, outputted !
-
This search regex is quite difficult to understand, because it uses some expressions, called subroutine calls
(?n)
, which point out, by reference, to the groups 1 and 2. Not easy to explain correctly this regex ! I started with the more simple regex, below :
(?s-i).*\K(?:Word1(?=.*Word2)|Word2(?=.*Word1))|.*\K(?:Word1|Word2)
-
After matching the longest range of any character, which is forgotten, due to the
\K
syntaxt, the regex engine tries to find, either :- The string Word1, ONLY IF it’s followed, further on, with the string Word2 ( case C )
OR
- The string Word2, ONLY IF it’s followed, further on, with the string Word1 ( case D )
- The string Word1, ONLY IF it’s followed, further on, with the string Word2 ( case C )
-
Then, after matching, again, a longest range of any character, which is reset, due to the
\K
form, the regex engine tries, this time, to find, either :- The other string Word2 ( case C )
OR - The other string Word1 ( case D )
- The other string Word2 ( case C )
The inconvenient of this regex
(?s-i).*\K(?:Word1(?=.*Word2)|Word2(?=.*Word1))|.*\K(?:Word1|Word2)
is that you must repeat the two strings, to look for, three times to get the overall regex to work ! By using some subroutine calls, we need to enter these two strings ONCE, only, instead of three times !In short, the syntax
(?n)
of a subroutine call, represents the exact contents of group n, which can be located, before, or after, its reference(?n]
. So :-
(?1)
is equivalent to the group 1,(Word1)
-
(?2)
is equivalent to the group 2,(Word2)
Therefore, from the original regex form
(?s-i).*\K(?:Word1(?=.*Word2)|Word2(?=.*Word1))|.*\K(?:Word1|Word2)
, we may change the regex into this final one, below, which is, as correct as the other one, although a bit more difficult to understand :-(. However, you’ll just need to write the two strings to search, only ONCE !(?s-i).*\K(?:(Word1)(?=.*(?2))|(Word2)(?=.*(?1)))|.*\K(?:(?1)|(?2))
Best Regards,
guy038
Additional info :
1) Concerning the new Find in this finder option of the Search result panel :
-
If you do NOT check the Search only on found lines option, the search is performed on all the contents of the different files, listed in the Search result panel
-
if you check the Search only on found lines option, the search is performed, ONLY, on all the lines, listed in the Search result panel
2) The main difference between a subroutine call and a back-reference is that:
-
A back-reference
\n
refers to the present value of the group n -
A subroutine call
(?n)
refers to the present template of the group n
So, if we consider these four lines :
123 ABC 123 123 ABC 789 789 ABC 123 789 ABC 789
The regex
(\d+) ABC \1
would match lines 1 and 4, whereas the regex(\d+) ABC (?1)
would match the four linesIn other words, the regexes
(\d+) ABC (?1)
and(\d+) ABC (\d+)
are strictly equivalent
3) Note that, when a subroutine call is used INSIDE its group, to which it refers, it becomes a recursive sub-pattern reference :
-
For instance, in the regex :
(\{[^\{\}]+\}(?1)?)
, the group 1 is the overall regex, which does contain its reference(?1)
. So(?1)
is a recursive sub-pattern reference -
But, in the regex
(\{[^\{\}]+\})(?1)?
, the group 1 is(\{[^\{\}]+\})
and its reference(?1)
, located outside the group 1, is just a subroutine call
These two regexes looks for, either, a single string
{....}
surrounded by curly braces or two consecutive strings{....}{....}
You may test these regexes, with the example text, with contains a well-balanced amount of curly braces :
{This}{is}{a small}{text}{{in order{to {test}{this}}}{{regex}}}
-