Regex: Find Pages with One String but Not Another
-
Hello, @dick-adams-0, @coises, @mark-olson, @alan-kilborn and All,
Ah yes, indeed, @mark-olson, your solution works nicely
But, I think of two other constructs, which are similar :
-
SEARCH
(?s-i)\A(?!.*audio)(?=.*music) -
SEARCH
(?s-i)\A(?=.*music)(?!.*audio)
The magic of your solution and of my versions too, is that the regex does scan from the very beginning to the very end of current file to verify the two assertions, simultaneously !
So, with my search versions :
-
If current file contains, at least one word
musicand NO wordaudioanywhere, an empty string is detected (\A) , with the yellow calltip^ zero length match, that implies a TRUE match -
If current file contains the word(s)
audiosomewhere, whatever the wordmusicexists or NOT, nothing is detected, that implies NO match
However, the @coises’s solution, which can, also, be expressed
(?-i)audio(*COMMIT)(*FAIL)|(?s)music(?!.*audio), is really clever and even better as :-
It succeeds to get the different matches of the
musicword, when NO wordaudioexists in current file -
It does NOT match anything, as soon as one or several
audiowords(s) exist, anywhere, in current file
Best Regards
guy038
To see the beauty of the @coises’s solution, refer to :
https://www.rexegg.com/backtracking-control-verbs.html#failthematch
-
-
@Coises said in Regex: Find Pages with One String but Not Another:
The proposed expression matches the third line of that file.
Maybe being more greedy helps this solution variant:
(?s-i)((audio.*music|music.*audio)(*SKIP)(*F))|musicBut truly, I suppose your expression is better.
-
Hello, @dick-adams-0, @coises, @mark-olson, @alan-kilborn and All,
I’ve found out a new solution to the problem, derived from the @coises’s one :
SEARCH / MARK
(?s-i)(?=.*audio)(*COMMIT)(*FAIL)|musicIt could be associated to the generic expression :
(?s-i)(?=.*What we do NOT want in current file)(*COMMIT)(*FAIL)|What we DO want in current file
VERY IMPORTANT :
- For this regex and the regexes in my previous post, you must tick the
Wrap aroundoption and if you use theMARKdialog, don’t forget to check thePurge for each searchoption for correct tests !
BR
guy038
- For this regex and the regexes in my previous post, you must tick the
-
(?s-i)(?=.*audio)(*COMMIT)(*FAIL)|musicEven with Wrap around checkmarked, it can work incorrectly. Consider the text
audio music foowith the caret somewhere after theabut before them. Press Find Next.musicis matched, even though it shouldn’t be, becauseaudiois present.This is because a “wrapped” Find Next will perform TWO internal searches if the caret is anywhere but the first position of the file. In the incorrect case cited, the FIRST [internal] search sees
musicbut doesn’t seeaudio(it wouldn’t until the SECOND [internal] search), thus the hit. Note: This two-internal-search thing has been discussed previously on this forum.So if one is going to use the idiom, don’t use it with Find Next or Replace, even with Wrap around checkmarked. All other types of searches (either file-level where Wrap around doesn’t matter, e.g. Find in Files, or file-level where it does matter, e.g. Mark, Replace All) should be OK with this.
-
Thanks for all who took time to look at this question. I was unfamiliar with backtracking control verbs, so I got an education reading through the answers.
Not sure how caret positions would affect the outcome. My use case is a batch search of multiple files, where the entire file is being searched.
That so, backtracking control verbs solution(s) saved me weeks (perhaps months!) of work, as I needed to search over 15,000 HTML files to find those on specific topics where I had not yet added a music player (the <audio> tag).
You guys are lifesavers—Thanks for the assist!
-
@Dick-Adams-0 said in Regex: Find Pages with One String but Not Another:
My use case is a batch search of multiple files, where the entire file is being searched.
Right, but the evolving discussion went in the direction of a general technique, not your specific need.
-
Hello, @dick-adams-0, @coises, @mark-olson, @alan-kilborn and All,
Alan, you’re right about this specific case. So, we would need an
Always from beginningoption too … Well, I think it’s more sensible to tell people that this kind of regex(?s-i)(?=.*audio)(*COMMIT)(*FAIL)|musicdoes NOT work properly, when just using theFind Nextand/orReplacebutton !
Now, see the power of the @coises’s regex :
- Put the following text in a new tab
Jon Susan Helen Nicole Andrew Alice Petr Mike Mary Margaret ob-
Open the Mark dialog (
Ctrl + M) -
MARK
(?s-i)(?=.*(?:Bob|Peter|John))(*COMMIT)(*FAIL)|Mary|Helen|Alice -
Check only the
Purge for each searchandWrap aroundoptions -
Select the
Regular expressionsearch mode -
Click on the
Mark Allbutton
=> As the forbidden masculine surnames
BobPeterandJohnare misspelled, All the searched feminine surnames are correctly highlightedNow, as soon as you modify this original text into one of the forms, below :
John Jon Jon John John Jon John Jon Susan Susan Susan Susan Susan Susan Susan Susan Helen Helen Helen Helen Helen Helen Helen Heen Nicole Nicole Nicole Nicole Nicole Nicole Nicole Nicole Andrew Andrew Andrew Andrew Andrew Andrew Andrew Andrew Alice Alice Alice Alice Alice Alice Alice Alie Petr Peter Petr Peter Petr Peter Peter Petr Mike Mike Mike Mike Mike Mike Mike Mike Mary Mary Mary Mary Mary Mary Mary ary Margaret Margaret Margaret Margaret Margaret Margaret Margaret Margaret ob ob Bob ob Bob Bob Bob obThen, a hit on the
Mark Allbutton will NOT mark any text, as expected, because there are always one, two or three forbidden masculine surnames in the listIn the rightmost case too, although that the forbidden masculine surnames are misspelled, there is NO match as well, just because all the searched feminine surnames are misspelled too !
Really awesome !
Best Regards,
guy038
-
@guy038 said in Regex: Find Pages with One String but Not Another:
I think it’s more sensible to tell people that this kind of regex (?s-i)(?=.*audio)(*COMMIT)(*FAIL)|music does NOT work properly, when just using the Find Next and/or Replace button !
It’s sensible, but it is another detail to remember. And if you don’t remember it, you may get a wrong result (but move on to your next action thinking it is correct).
Your example with names is fine, but I don’t think it adds anything new to the technique.
BTW, I think instead of surnames, you should have said first names.
-
Hello, @dick-adams-0, @coises, @mark-olson, @alan-kilborn and All,
Alan, regarding my previous post, it, obviously, does not add anything new, but I wanted to show an example with SEVERAL allowed and forbidden
first names
Sorry, for my spelling error, but I’m a bit lost with the complexity of American/English languages, regarding the way of describing personal names !
So, although that is quite off-topic, just one example :
For instance, with the personal name
Daniel James Sullivanand namesDJandDan, which of the words, below, you would use, in common language, to qualify
the different wordsDaniel,James,Sullivan,DJandDan?First name Forename Given name Proper name Baptismal name Middle name Second name Last name Surname Family name NicknameBest Regards
guy038
-
Ha, well I’m no expert on it, but I’ll give you one American’s opinion:
-
Daniel: first name, given name, proper name (probably), baptismal name (probably), forename (never heard that one before but possibly) -
James: middle name, second name (probably) -
Sullivan: last name, surname, family name -
DJandDan: nickname -
the whole thing taken together
Daniel James Sullivan: proper name
Hopefully this helps you in some way, for your Notepad++ regex work!
-
-
@guy038
I believe that I’ve come up with a more performant solution than your most recent suggestion.
Your most recent suggestion performs well if the forbidden words are present, but exhibits catastrophic backtracking if there are no forbidden words. You can test this by running it on a file with several thousand lines, and seeing what happens with and without forbidden words.Here’s an improved version that is guaranteed to operate in linear time while also finding every match, based on your generic find-a-regex-between-two-regex-matches formula. As an added bonus, it doesn’t use backtracking verbs, which makes it portable to regex engines that don’t support backtracking verbs.
(?xs)(?:\A (?!.*(?:FORBIDDEN)) | (?!\A)\G ) .*?\K DESIRED
Plugging inBob|Peter|Johnfor FORBIDDEN, and(?:Mary|Helen|Alice)for DESIRED, we get:
(?s-i)(?:\A(?!.*(?:Bob|Peter|John))|(?!\A)\G).*?\K(?:Mary|Helen|Alice)It works as follows:
- The
BSR(begin search region) of this regex is just the start of the file,\A, followed by negative lookahead for the forbidden words (Bob, Peter, and John). - There is no ESR (we want to search the entire file), so following the usual
(?:BSR|(?!\A)\G), we just have.*?\K. - The things we want to find come after the
\Kas usual.
I tested this on a 25 thousand line file, and verified that it quickly matches every line if no forbidden words are present, and quickly fails if a forbidden word is present.
- The
-
Hi, @mark-olson and All,
Sorry for the delay. Over the last two days, I’ve been getting some fresh air on the ski slopes at Chamrousse, at
35minutes from Grenoble ! Of course, it was a bit crowded on Sunday, but yesterday, Monday, me, and my friend Philippe, had a great time ;-))
Let’s go back to our regex problems !
Your new solution worked well but ONLY IF the
Wrap aroundoption is always checked before running this regex !
From your proposition, below :
(?s-i)(?:\A(?!.*(?:Bob|Peter|John))|(?!\A)\G).*?\K(?:Mary|Helen|Alice)Let’s simplify this regex with just
1forbidden first name Peter and1allowed first name Alice, giving the similar regex :(?s-i)(?:\A(?!.*Peter)|(?!\A)\G).*?\KAliceNow, given the INPUT text, below, pasted in a new tab :
Susan Helen Nicole Andrew Alice Mike Mary MargaretLet’s suppose that we use the Mark dialog with, both, the
Purge for each searchandWrap aroundoptions checked- After running this simplified search regex, we get, as expected, the first name
Alicemarked because no forbidden masculine first name exists in this text
Why :
-
First, the regex tries to match the part
(?:\A(?!.*Peter). As no forbidden first name exists, this part is true. Then, the regex tries to find a match of the part.*?\KAliceand, of course, we do get theAlicefirst name marked -
Now, let’s replace, in our text, the empty line, between
AliceandMike, by the forbidden first namePeter -
If we re-run the regex, we do get the expected
0 match in entire fileresult
Why :
-
This time, from beginning of file, the regex “see” the first name
Peter, on the sixth line. So this regex part is false. -
Thus, it tries the second alternative
(?!\A)\Gwhich is also false, because we still are at the very beginning of file -
So, we immediately get the message
Mark: 0 match in entire file -
Now, uncheck the
Wrap aroundoption -
Move to the very beginning of the new tab (
Ctrl + Home) -
Running again the regex, you still get the correct result
Mark : 0 matches from caret to end-of-file
-
Finally, move the caret right before the word
Helen( so, on the second line of current file ) -
Re-run the regex => the first name
Aliceis now marked, although the forbidden first namePeterexists in current file
Why :
Well, from beginning of file, the regex “see” the first name
Peter, on the sixth line. So this regex part is false-
Then, it tries the second alternative
(?!\A)\Gwhich is, this time, true, because we are not at the very beginning of file ( on the second line ) -
Thus, it tries the remaining part
.*?\KAliceand we wrongly get theAlicefirst name marked !
Note that a similar issue appears, too, with my previous regex :
- Let’s start with our INPUT text, adding the forbidden first name
Peter:
Susan Helen Nicole Andrew Alice Peter Mike Mary Margaret-
We put the caret at the beginning of the
Mikeline (7thline ) -
If I use the
Markdialog, with thePurge for each searchoption checked BUT theWrap aroundoption UN-checked -
And the regex
(?s-i)(?=.*(?:Bob|Peter|John))(*COMMIT)(*FAIL)|Mary|Helen|Alice
=> The
Maryfirst name is marked, although the forbidden first namePeteris present :-((
Conclusion :
Whatever the regex used, in this specific case, we always need to check the
Wrap aroundoption to get the expected resultsBest Regards,
guy038
- After running this simplified search regex, we get, as expected, the first name
-
@guy038
Looks good! I’d amend it to(?s-i)\A(?=.*(?:Bob|Peter|John))(*COMMIT)(*FAIL)|Mary|Helen|Alice, as this ensures that the check for the forbidden names is only done once at the beginning of the file, and thereby avoids the issue of bad performance on very large files.