Notepad++ Regex to find files having more than 2 words but should exclude certain word
-
I have below files
File 1
hjsakh
hasj
ashjh word1 sadhjhf
asgdga
ashsad
sj word3a sdha
ajsh
ashjh
word2aFile 2
hjsakh
hasj
ashjh word1 sadhjhf
asgdga
ashsad
word2a
sdhfj
sj word3b sdha
ajsh
ashjhFile 3
hjsakh
hasj
ashjh word1 sadhjhf
asgdga
ashsad
word2b
sdh
sj sdha
ajsh
ashjhFile 4
hjsakh
hasj
asgdga
ashsad
word2c
sdh
sj sdha
ajsh
ashjh
ashjh word1 sadhjhfFile 5
hjsakh
hasj
ashjh word1 sadhjhf
asgdga
ashsad
word3a
sdh
sj sdha
ajsh
ashjhFile 6
hjsakh
hasj
word3b
ashjh word1 sadhjhf
asgdga
ashsad
sdh
sj sdha
ajsh
ashjhlooking for regex which can find files which satisfy this criteria (contains word1) & (contains word2a or word2b or word2c) & (does not contain word3a and word3b). In above example it should find only File 3 and File 4
-
@Jignesh-Waghela Regular expressions tend to be read from left to write and intend to match patterns that are scanned left to right. As you watch to match things where the words appear in any order I would do this in several passes rather than constructing a rather long regexp that has all the possible orders of words and their conditions.
My passes were:
- Scan for files containing word1. Either make a list of these. It turns out all of your files contain word1.
- Scan the list of files from step 1 for
(word2a|word2b|word2c)
- We are down to a second list that has files 1, 2, 3, and 4. - It turns out that the
not
scanner,(?!word3a|word3b)
only works on lines, not files, even if you try ``(?-s)(?!word3a|word3b). Thus I would scan list 2 for
(word3a|word3b)` and exclude those. This excludes files 1, 2, and also 5 and 6 but those are not in list 2. The remaining files are 3 and 4.
-
@Jignesh-Waghela said in Notepad++ Regex to find files having more than 2 words but should exclude certain word:
looking for regex which can find files which satisfy this criteria (contains word1) & (contains word2a or word2b or word2c) & (does not contain word3a and word3b). In above example it should find only File 3 and File 4
Try:
(?s)\A(?=.*(word2a|word2b|word2c))(?!.*(word3a|word3b)).*word1
Edit to add:
(?s)\A(?=.*?\b(word2a|word2b|word2c)\b)(?!.*?\b(word3a|word3b)\b).*?\bword1\b
is probably better. The
\b
assertions require word boundaries — so thatreword2a
orword13
won’t count. Changing.*
to.*?
won’t change the results, but it might make the expressions more efficient and less likely to run into “expression too complex” failures. -
@mkupper said in Notepad++ Regex to find files having more than 2 words but should exclude certain word:
It turns out that the not scanner, (?!word3a|word3b) only works on lines, not files, even if you try ``(?-s)(?!word3a|word3b).
You’re misunderstanding what is going wrong. The expression you gave says, “Match null here if the following characters aren’t either
word3a
orword3b
.” That matches at every position in a file except immediately beforeword3a
orword3b
. (Each line is listed once in the search results, but look at the number of hits.) What is needed is(?s)\A(?!.*(word3a|word3b))
, which says, “Match null at the beginning of the document if the following characters don’t match any number of arbitrary characters followed by eitherword3a
orword3b
.”When I tested the whole expression I got some erratic results when I had it match null, which is why I set it to match the string from the beginning up to the word
word1
.