Community
    • Login

    Regex: Find Pages with One String but Not Another

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    regex
    18 Posts 5 Posters 2.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Dick Adams 0D
      Dick Adams 0
      last edited by

      I’m trying to write a regex to find files that have one string but not another. In the example below, I’m looking for the word “music” without the word “audio” anywhere before or after it (hence the negative lookbehind & lookahead):

      (?s-i)(?<!audio.*)music(?!.*audio)
      

      NPP says this is an invalid regex, but I don’t understand why. Can anyone shed light on this?

      CoisesC 1 Reply Last reply Reply Quote 0
      • CoisesC
        Coises @Dick Adams 0
        last edited by Coises

        @Dick-Adams-0 said in Regex: Find Pages with One String but Not Another:

        NPP says this is an invalid regex, but I don’t understand why.

        Lookbacks must be fixed length.

        I think this:

        (?s-i)audio(*COMMIT)(*FAIL)|music(?!.*audio)

        will do what you want.

        1 Reply Last reply Reply Quote 1
        • Mark OlsonM
          Mark Olson
          last edited by

          Why not just (?s-i)\A(?!.*audio).*music
          I’m not on my computer, can’t test it but should work

          1 Reply Last reply Reply Quote 1
          • Alan KilbornA
            Alan Kilborn
            last edited by Alan Kilborn

            Maybe a good application of the “what I don’t want skip fail OR what I do want” idiom found HERE:

            (?s-i)((audio.*?music|music.*?audio)(*SKIP)(*F))|music

            CoisesC 1 Reply Last reply Reply Quote 1
            • CoisesC
              Coises @Alan Kilborn
              last edited by

              @Alan-Kilborn said in Regex: Find Pages with One String but Not Another:

              Maybe a good application of the “what I don’t want skip fail OR what I do want” idiom found HERE:

              (?s-i)((audio.*?music|music.*?audio)(*SKIP)(*F))|music

              Consider this file:

              audio
              music
              music
              

              The proposed expression matches the third line of that file.

              Alan KilbornA 1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by guy038

                Hello, @dick-adams-0, @coises, @mark-olson, @alan-kilborn and All,

                Ah yes, indeed, @mark-olson, your solution works nicely

                But, I think of two other constructs, which are similar :

                • SEARCH (?s-i)\A(?!.*audio)(?=.*music)

                • SEARCH (?s-i)\A(?=.*music)(?!.*audio)


                The magic of your solution and of my versions too, is that the regex does scan from the very beginning to the very end of current file to verify the two assertions, simultaneously !

                So, with my search versions :

                • If current file contains, at least one word music and NO word audio anywhere, an empty string is detected ( \A) , with the yellow calltip ^ zero length match, that implies a TRUE match

                • If current file contains the word(s) audio somewhere, whatever the word music exists or NOT, nothing is detected, that implies NO match


                However, the @coises’s solution, which can, also, be expressed (?-i)audio(*COMMIT)(*FAIL)|(?s)music(?!.*audio), is really clever and even better as :

                • It succeeds to get the different matches of the music word, when NO word audio exists in current file

                • It does NOT match anything, as soon as one or several audio words(s) exist, anywhere, in current file

                Best Regards

                guy038

                To see the beauty of the @coises’s solution, refer to :

                https://www.rexegg.com/backtracking-control-verbs.html#failthematch

                1 Reply Last reply Reply Quote 2
                • Alan KilbornA
                  Alan Kilborn @Coises
                  last edited by Alan Kilborn

                  @Coises said in Regex: Find Pages with One String but Not Another:

                  The proposed expression matches the third line of that file.

                  Maybe being more greedy helps this solution variant:

                  (?s-i)((audio.*music|music.*audio)(*SKIP)(*F))|music

                  But truly, I suppose your expression is better.

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hello, @dick-adams-0, @coises, @mark-olson, @alan-kilborn and All,

                    I’ve found out a new solution to the problem, derived from the @coises’s one :

                    SEARCH / MARK (?s-i)(?=.*audio)(*COMMIT)(*FAIL)|music

                    It could be associated to the generic expression :

                    (?s-i)(?=.*What we do NOT want in current file)(*COMMIT)(*FAIL)|What we DO want in current file


                    VERY IMPORTANT :

                    • For this regex and the regexes in my previous post, you must tick the Wrap around option and if you use the MARK dialog, don’t forget to check the Purge for each search option for correct tests !

                    BR

                    guy038

                    Alan KilbornA 1 Reply Last reply Reply Quote 1
                    • Alan KilbornA
                      Alan Kilborn @guy038
                      last edited by Alan Kilborn

                      @guy038

                      (?s-i)(?=.*audio)(*COMMIT)(*FAIL)|music

                      Even with Wrap around checkmarked, it can work incorrectly. Consider the text audio music foo with the caret somewhere after the a but before the m. Press Find Next. music is matched, even though it shouldn’t be, because audio is present.

                      This is because a “wrapped” Find Next will perform TWO internal searches if the caret is anywhere but the first position of the file. In the incorrect case cited, the FIRST [internal] search sees music but doesn’t see audio (it wouldn’t until the SECOND [internal] search), thus the hit. Note: This two-internal-search thing has been discussed previously on this forum.

                      So if one is going to use the idiom, don’t use it with Find Next or Replace, even with Wrap around checkmarked. All other types of searches (either file-level where Wrap around doesn’t matter, e.g. Find in Files, or file-level where it does matter, e.g. Mark, Replace All) should be OK with this.

                      1 Reply Last reply Reply Quote 1
                      • Dick Adams 0D
                        Dick Adams 0
                        last edited by

                        Thanks for all who took time to look at this question. I was unfamiliar with backtracking control verbs, so I got an education reading through the answers.

                        Not sure how caret positions would affect the outcome. My use case is a batch search of multiple files, where the entire file is being searched.

                        That so, backtracking control verbs solution(s) saved me weeks (perhaps months!) of work, as I needed to search over 15,000 HTML files to find those on specific topics where I had not yet added a music player (the <audio> tag).

                        You guys are lifesavers—Thanks for the assist!

                        Alan KilbornA 1 Reply Last reply Reply Quote 1
                        • Alan KilbornA
                          Alan Kilborn @Dick Adams 0
                          last edited by Alan Kilborn

                          @Dick-Adams-0 said in Regex: Find Pages with One String but Not Another:

                          My use case is a batch search of multiple files, where the entire file is being searched.

                          Right, but the evolving discussion went in the direction of a general technique, not your specific need.

                          1 Reply Last reply Reply Quote 1
                          • guy038G
                            guy038
                            last edited by

                            Hello, @dick-adams-0, @coises, @mark-olson, @alan-kilborn and All,

                            Alan, you’re right about this specific case. So, we would need an Always from beginning option too … Well, I think it’s more sensible to tell people that this kind of regex (?s-i)(?=.*audio)(*COMMIT)(*FAIL)|music does NOT work properly, when just using the Find Next and/or Replace button !


                            Now, see the power of the @coises’s regex :

                            • Put the following text in a new tab
                            Jon
                            Susan
                            Helen
                            Nicole
                            Andrew
                            Alice
                            Petr
                            Mike
                            Mary
                            Margaret
                            ob
                            
                            • Open the Mark dialog ( Ctrl + M )

                            • MARK (?s-i)(?=.*(?:Bob|Peter|John))(*COMMIT)(*FAIL)|Mary|Helen|Alice

                            • Check only the Purge for each search and Wrap around options

                            • Select the Regular expression search mode

                            • Click on the Mark All button

                            => As the forbidden masculine surnames Bob Peter and John are misspelled, All the searched feminine surnames are correctly highlighted

                            Now, as soon as you modify this original text into one of the forms, below :

                            John             Jon              Jon              John            John             Jon              John             Jon
                            Susan            Susan            Susan            Susan           Susan            Susan            Susan            Susan
                            Helen            Helen            Helen            Helen           Helen            Helen            Helen            Heen
                            Nicole           Nicole           Nicole           Nicole          Nicole           Nicole           Nicole           Nicole
                            Andrew           Andrew           Andrew           Andrew          Andrew           Andrew           Andrew           Andrew
                            Alice            Alice            Alice            Alice           Alice            Alice            Alice            Alie
                            Petr             Peter            Petr             Peter           Petr             Peter            Peter            Petr
                            Mike             Mike             Mike             Mike            Mike             Mike             Mike             Mike
                            Mary             Mary             Mary             Mary            Mary             Mary             Mary             ary
                            Margaret         Margaret         Margaret         Margaret        Margaret         Margaret         Margaret         Margaret
                            ob               ob               Bob              ob              Bob              Bob              Bob              ob
                            

                            Then, a hit on the Mark All button will NOT mark any text, as expected, because there are always one, two or three forbidden masculine surnames in the list

                            In the rightmost case too, although that the forbidden masculine surnames are misspelled, there is NO match as well, just because all the searched feminine surnames are misspelled too !

                            Really awesome !

                            Best Regards,

                            guy038

                            Alan KilbornA 1 Reply Last reply Reply Quote 1
                            • Alan KilbornA
                              Alan Kilborn @guy038
                              last edited by

                              @guy038 said in Regex: Find Pages with One String but Not Another:

                              I think it’s more sensible to tell people that this kind of regex (?s-i)(?=.*audio)(*COMMIT)(*FAIL)|music does NOT work properly, when just using the Find Next and/or Replace button !

                              It’s sensible, but it is another detail to remember. And if you don’t remember it, you may get a wrong result (but move on to your next action thinking it is correct).


                              Your example with names is fine, but I don’t think it adds anything new to the technique.

                              BTW, I think instead of surnames, you should have said first names.

                              1 Reply Last reply Reply Quote 0
                              • guy038G
                                guy038
                                last edited by guy038

                                Hello, @dick-adams-0, @coises, @mark-olson, @alan-kilborn and All,

                                Alan, regarding my previous post, it, obviously, does not add anything new, but I wanted to show an example with SEVERAL allowed and forbidden first names


                                Sorry, for my spelling error, but I’m a bit lost with the complexity of American/English languages, regarding the way of describing personal names !

                                So, although that is quite off-topic, just one example :

                                For instance, with the personal name Daniel James Sullivan and names DJ and Dan, which of the words, below, you would use, in common language, to qualify
                                the different words Daniel, James, Sullivan, DJ and Dan ?

                                First name     Forename      Given name       Proper name      Baptismal name
                                
                                Middle name
                                
                                Second name    Last name     Surname          Family name
                                
                                Nickname
                                

                                Best Regards

                                guy038

                                Alan KilbornA 1 Reply Last reply Reply Quote 0
                                • Alan KilbornA
                                  Alan Kilborn @guy038
                                  last edited by Alan Kilborn

                                  @guy038

                                  Ha, well I’m no expert on it, but I’ll give you one American’s opinion:

                                  • Daniel: first name, given name, proper name (probably), baptismal name (probably), forename (never heard that one before but possibly)

                                  • James: middle name, second name (probably)

                                  • Sullivan: last name, surname, family name

                                  • DJ and Dan: nickname

                                  • the whole thing taken together Daniel James Sullivan: proper name

                                  Hopefully this helps you in some way, for your Notepad++ regex work!

                                  1 Reply Last reply Reply Quote 1
                                  • Mark OlsonM
                                    Mark Olson
                                    last edited by Mark Olson

                                    @guy038
                                    I believe that I’ve come up with a more performant solution than your most recent suggestion.
                                    Your most recent suggestion performs well if the forbidden words are present, but exhibits catastrophic backtracking if there are no forbidden words. You can test this by running it on a file with several thousand lines, and seeing what happens with and without forbidden words.

                                    Here’s an improved version that is guaranteed to operate in linear time while also finding every match, based on your generic find-a-regex-between-two-regex-matches formula. As an added bonus, it doesn’t use backtracking verbs, which makes it portable to regex engines that don’t support backtracking verbs.

                                    (?xs)(?:\A (?!.*(?:FORBIDDEN)) | (?!\A)\G ) .*?\K DESIRED
                                    Plugging in Bob|Peter|John for FORBIDDEN, and (?:Mary|Helen|Alice) for DESIRED, we get:
                                    (?s-i)(?:\A(?!.*(?:Bob|Peter|John))|(?!\A)\G).*?\K(?:Mary|Helen|Alice)

                                    It works as follows:

                                    1. The BSR (begin search region) of this regex is just the start of the file, \A, followed by negative lookahead for the forbidden words (Bob, Peter, and John).
                                    2. There is no ESR (we want to search the entire file), so following the usual (?:BSR|(?!\A)\G), we just have .*?\K.
                                    3. The things we want to find come after the \K as usual.

                                    I tested this on a 25 thousand line file, and verified that it quickly matches every line if no forbidden words are present, and quickly fails if a forbidden word is present.

                                    1 Reply Last reply Reply Quote 2
                                    • guy038G
                                      guy038
                                      last edited by guy038

                                      Hi, @mark-olson and All,

                                      Sorry for the delay. Over the last two days, I’ve been getting some fresh air on the ski slopes at Chamrousse, at 35 minutes from Grenoble ! Of course, it was a bit crowded on Sunday, but yesterday, Monday, me, and my friend Philippe, had a great time ;-))


                                      Let’s go back to our regex problems !

                                      Your new solution worked well but ONLY IF the Wrap around option is always checked before running this regex !


                                      From your proposition, below :

                                      (?s-i)(?:\A(?!.*(?:Bob|Peter|John))|(?!\A)\G).*?\K(?:Mary|Helen|Alice)

                                      Let’s simplify this regex with just 1 forbidden first name Peter and 1 allowed first name Alice, giving the similar regex :

                                      (?s-i)(?:\A(?!.*Peter)|(?!\A)\G).*?\KAlice

                                      Now, given the INPUT text, below, pasted in a new tab :

                                      Susan
                                      Helen
                                      Nicole
                                      Andrew
                                      Alice
                                      
                                      Mike
                                      Mary
                                      Margaret
                                      

                                      Let’s suppose that we use the Mark dialog with, both, the Purge for each search and Wrap around options checked

                                      • After running this simplified search regex, we get, as expected, the first name Alice marked because no forbidden masculine first name exists in this text

                                      Why :

                                      • First, the regex tries to match the part (?:\A(?!.*Peter). As no forbidden first name exists, this part is true. Then, the regex tries to find a match of the part .*?\KAlice and, of course, we do get the Alice first name marked

                                      • Now, let’s replace, in our text, the empty line, between Alice and Mike , by the forbidden first name Peter

                                      • If we re-run the regex, we do get the expected 0 match in entire file result

                                      Why :

                                      • This time, from beginning of file, the regex “see” the first name Peter, on the sixth line. So this regex part is false.

                                      • Thus, it tries the second alternative (?!\A)\G which is also false, because we still are at the very beginning of file

                                      • So, we immediately get the message Mark: 0 match in entire file

                                      • Now, uncheck the Wrap around option

                                      • Move to the very beginning of the new tab ( Ctrl + Home )

                                      • Running again the regex, you still get the correct result Mark : 0 matches from caret to end-of-file


                                      • Finally, move the caret right before the word Helen ( so, on the second line of current file )

                                      • Re-run the regex => the first name Alice is now marked, although the forbidden first name Peter exists in current file

                                      Why :

                                      Well, from beginning of file, the regex “see” the first name Peter, on the sixth line. So this regex part is false

                                      • Then, it tries the second alternative (?!\A)\G which is, this time, true, because we are not at the very beginning of file ( on the second line )

                                      • Thus, it tries the remaining part .*?\KAlice and we wrongly get the Alice first name marked !


                                      Note that a similar issue appears, too, with my previous regex :

                                      • Let’s start with our INPUT text, adding the forbidden first name Peter :
                                      Susan
                                      Helen
                                      Nicole
                                      Andrew
                                      Alice
                                      Peter
                                      Mike
                                      Mary
                                      Margaret
                                      
                                      • We put the caret at the beginning of the Mike line ( 7th line )

                                      • If I use the Mark dialog, with the Purge for each search option checked BUT the Wrap around option UN-checked

                                      • And the regex (?s-i)(?=.*(?:Bob|Peter|John))(*COMMIT)(*FAIL)|Mary|Helen|Alice

                                      => The Mary first name is marked, although the forbidden first name Peter is present :-((


                                      Conclusion :

                                      Whatever the regex used, in this specific case, we always need to check the Wrap around option to get the expected results

                                      Best Regards,

                                      guy038

                                      Mark OlsonM 1 Reply Last reply Reply Quote 0
                                      • Mark OlsonM
                                        Mark Olson @guy038
                                        last edited by

                                        @guy038
                                        Looks good! I’d amend it to (?s-i)\A(?=.*(?:Bob|Peter|John))(*COMMIT)(*FAIL)|Mary|Helen|Alice, as this ensures that the check for the forbidden names is only done once at the beginning of the file, and thereby avoids the issue of bad performance on very large files.

                                        1 Reply Last reply Reply Quote 1
                                        • First post
                                          Last post
                                        The Community of users of the Notepad++ text editor.
                                        Powered by NodeBB | Contributors