Regex: Find all files that do not contain some words



  • I want to search and find all html files that do not contain the wordss: “Use API site scope”

    I try this, (?!>)(Use API site scope).* but I believe it must be another negative looke ahead



  • Hello, @vasile-caraus and All

    AFAIK, the right regex should be (?s)\A((?!Use API site scope).)+\z That is to say, from start to end of file, select all the text, ( like (?s)\A.+\z ), ONLY IF, at any position, inside the file, the string Use API site scope cannot be found, thanks to the negative look-ahead (?!Use API site scope), which is applied at each single position . of the file !

    Unfortunately, this regex does not work. And after various tests, I think that it’s easier to search, first, for files containing the string Use API site scope and then, deduce all the files which do not contain that string !!


    I suppose that all your HTML files are gathered in a specific directory. If so :

    • Open the Find in Files dialog ( Ctrl + Shift + F )

    • Fill in your directory name, in the Directory: zone

    • Fill in *.htm, in the Filters: zone

    • Select the Regular expression search mode

    • SEARCH (?-i)Use API site scope and click on the Find All button

    => This first search writes, in the Find result panel, all the files, containing the string Use API site scope

    • SEARCH (?-s).\Z and click on the Find All button

    => This second search, of course, gets all the .htm? files, as it matches the last standard character of each file !

    • Select with Ctrl + A, the totality of the Find result panel

    • Copy this selection in the clipboard, with Ctrl + C

    • Now, open a new tab and paste that text with Ctrl + V

    • In that new tab, open the Replace dialog ( Ctrl + H )

    • Check the Wrap around option

    • Perform the regex S/R below :

    SEARCH (?-s)^[^\x20].+\R|(?-i)\x20\(\d+ hits?\)$

    REPLACE Leave EMPTY

    • Then, sort these absolute paths, with the menu option Edit > Line Operations > Sort lines Lexicographically Ascending

    • Add a line break after the last path

    • And, finally, run the last S/R, below :

    SEARCH (?-s)^(.+\R)\1+

    REPLACE Leave EMPTY

    Here we are ! You should get the list of all the files which do not contain the string Use API site scope :-))

    Best Regards

    guy038



  • @guy038 said:

    (?s)\A((?!Use API site scope).)+\z

    BEST SOLUTION, thanks Guy038 !!



  • another solution will be:

    (?s)\A(?!.*?(?<!\w)(Use API site scope)(?!\w)).*



  • Hi, @Vasile-caraus, and All,

    I found an other work-around to get all the html files, which do not contain the expression Use API site scope

    • As any HTML file ends with the tag </html>, I imagined, first, to add a = symbol ( any other should be OK, too ) at the very end of all the files, which do contain the string Use API site scope

    • Secondly, we will perform a Find All operation to detect all the files which do not contain an equal sign at its very end !

    • Finally, we will reverse the previous S/R to delete any extra = symbol, in the HTML files


    So :

    • Open the Find in Files dialog ( Ctrl + Shift + F )

    • Fill in your directory name, in the Directory: zone

    • Fill in *.htm, in the Filters: zone

    • Select the Regular expression search mode

    • SEARCH (?s)(?=\A.*Use API site scope).*\K

    • REPLACE =

    • Click on the Replace in Files button and valid the Are you sure… dialog, after checking

    => A = symbol is added at the very end of all files, containing the string Use API site scope

    • Now, change the search zone with the regex [^=]\K\z

    • Click on the Find All button

    => The Find Result panel should display the last line of all the files without a = symbol, at its very end

    • Select, with Ctrl + A, the totality of the Find result panel

    • Copy this selection in the clipboard, with Ctrl + C

    • Then, open a new tab and paste that text with Ctrl + V

    • Now, we must delete the previous S/R modifications, on some HTML files, with the following S/R :

    • SEARCH =\z

    • REPLACE Leave EMPTY

    • Click on the Replace in Files button and valid the Are you sure… dialog, after checking

    • Finally, in the new tab, where the Find result panel has been pasted, perform a last regex S/R :

    • SEARCH (?-s)^[^\x20].+\R|(?-i)\x20\(1 hit\)$

    • REPLACE Leave EMPTY

    • Click on the Replace All button

    => You should get the absolute path of all the files, which do not contain the expression Use API site scope. Et voilà !

    Cheers,

    guy038



  • @guy038

    I thought you might comment on @Vasile-Caraus 's “BEST SOLUTION” response where he uses \A. Previously you had said that doesn’t work, but apparently it worked for him and I set up a similar test and it worked for me. Maybe you meant to say to avoid \A because it sometimes has problems and can’t be counted on? I have found cases where it doesn’t work but I always give it a try in situations where I want it to work, and only back off to something else when it doesn’t. :-)



  • Hi @vasile-caraus, @Scott-sumner, and All

    Ah yes ! After further tests, I admit what I was wrong about my initial regex. It seems to work, as well as the Vasile’s one, too !

    However, it’s important to point out that it works,mainly, because the Find All in files operation, scans each file, from its very beginning to its very end, and only that way !!

    Indeed, consider, for instance, the text of the change.log of the last 7.5.3 version, below :

    Notepad++ 7.5.3 bug-fixes:
    
    1.  Fix shell extension registration failure in installer.
    2.  Fix theme files installation failure in installer. 
    3.  Fix DSpellCheck incomplete installation in installer.
    
    
    Notepad++ 7.5.2 new features/enhancements & bug-fixes:
    
    1.  Fixed hanging issue while modifying JavaScript TAB settings.
    2.  Add DSpellCheck plugin into distribution.
    3.  Add version and other info into installer.
    4.  Fix an issue while installing a x64 version, x86 version (if it exists) is not removed - and vice versa.
    5.  Fix display glitch of certificate checking error message.
    6.  Remove unused/empty entries from shortcut mapper.
    7.  Add BaanC function list feature.
    8.  Add batch auto-completion into installer.
    
    
    Included plugins:
    
    1.  NppExport v0.2.8 (32-bit x86 only)
    2.  Converter 4.2.1
    3.  Mime Tool 2.1
    4.  DSpellCheck 1.3.2
    
    Updater (Installer only):
    
    * WinGup v4.2
    

    Now, add the string Use API site scope, at the end of the line :

    Notepad++ 7.5.2 new features/enhancements & bug-fixes:
    

    in order to get the new line :

    Notepad++ 7.5.2 new features/enhancements & bug-fixes Use API site scope:
    
    • Open the Find dialog and check the Wrap around option and the Regular expression mode

    • Move the caret to beginning of file

    • Type in the search regex (?s-i)\A((?!Use API site scope).)+\z and click on the Find Next button

    Logically, as the string Use API site scope exists in the file, NO match should be found. That’s the case. So, you could say : everything is OK !

    • Now, place the caret right before the Use API site scope string. A click on the Find Next button gives again No match Right !

    • Then, place the caret right after the upper letter U. This time the search wrongly gets a match, from the caret location till the very end of file ! And if caret is located, at any position, further on, you get a match although it shouldn’t !

    In theory, as I placed, on purpose, the \A assertion, meaning the very beginning of file, it’s obvious that the string Use API site scope should always be considered as found and there should be NO match at all, due to the negative look-ahead and despite, Scott, of the Wrap around option which, internally, forces N++ to scans file, from its very beginning to its very end !!??


    I’m also disturbed because, in the Find result panel, it displays, only, the first line of files, which do not contain the string Use API site scope, although it logically matches all the file contents of the initial Change.log file, if you click on the Find Next button, as it does not contain any string Use API site scope ?!

    But, actually, it’s rather an advantage as the contents of the Find result panel are quite small :-))

    Best Regards,

    guy038



  • @guy038 :

    To say I fully understand what is going on here would be a mistake, but below are my thoughts on it. I’d appreciate being told how I am wrong! :-D


    …it’s important to point out that it works,mainly, because the Find All in files operation, scans each file, from its very beginning to its very end, and only that way!

    I agree. But then in your post you change it up and start talking about Find Next instead of one/all of the file-level searches (Find All … / Find in Files). And that’s where I start disagreeing. :-)


    place the caret right after the upper letter U. This time the search wrongly gets a match, from the caret location till the very end of file ! And if caret is located, at any position, further on, you get a match although it shouldn’t

    I disagree. I think the match here (and at any caret positions farther downward in the file) is correct.

    So perhaps a key thing here is how \A functions when used with a search function that does NOT operate at the file level. Let’s take a simple example. Put the caret on the U in Use from @guy038 's example above (the change.log file). From there do a Find Next (downward direction) on the regex \A.. It matches the U even though \A is supposed to only match at the start-of-file. Why is this? Well, Find Next always starts at the caret, and the Boost regex function that eventually gets called knows nothing about the file, it just knows about a string of data to be searched–and that string in this case is whatever is at the caret position and extending through the end-of-file. Thus to the Boost function the U is the start of the string to be searched. So the \A assert matches and the . matches the U.


    …and despite, Scott, of the Wrap around option which, internally, forces N++ to scans file, from its very beginning to its very end ???

    The forcing you mention is only for the Replace All case with Wrap around ticked. However, above you were talking about Find Next instead, which, when Wrap around is ticked and direction is downward, MUST search from the caret to end-of-file, and then, if a match is not found there, a second search is done from the beginning-of-file to the original user caret placement position. Note that two “internal” searches may be done for only one “user” search. Side note: The Replace-All with Wrap-around search behavior is discussed in more detail here.

    Note that, for the current discussion of Find Next, the Wrap around option could be ticked or not, it doesn’t matter or change anything relevant to the use of \A.


    I’m also disturbed because, in the Find result panel, it displays, only, the first line of files, which do not contain the string Use API site scope, although it logically matches all the file contents

    This does not disturb me at all. :-D
    Anytime there is a multi-line match during searching, only the first line gets written to the Find-result panel. When Use API site scope is not present in the file, the regex matches the whole file contents (as you said), which given a non-trivial file, would be multi-line data…and only the first line is properly put in the panel.



  • Hi, @scott-sumner,

    Sorry for replying so late !

    • About the last point of your post, you’re quite right ! I just forgot that is just the normal displaying of the Find result panel. For instance, the simple regex (?s).+, and a click on the Find All in All Opened Documents button, would display the line 1 of all the opened documents, only ! ( BTW, don’t forgot the plus sign, because the similar search of (?s). => Message “Notepad++ doesn’t answer”, after a while !! )

    • About the forcing of the search boundaries to the very beginning and to the very end, you’re right too ! I simply mixed the two cases Replace All and Find Next, when the Wrap around option is checked. Obviously, their behaviours are not identical :-)

    • Now, when you say :

    Well, Find Next always starts at the caret, and the Boost regex function that eventually gets called knows nothing about the file, it just knows about a string of data to be searched–and that string in this case is whatever is at the caret position and extending through the end-of-file. Thus to the Boost function the U is the start of the string to be searched

    That is the key point ! Is that exact ? Just note that, in this part of the Boost 1.55.0 documentation, below :

    http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.buffer_boundaries

    it is said :

    • \A matches at the start of a buffer only (the same as ```).

    • \z matches at the end of a buffer only (the same as \').

    So, to my mind, the buffer contents should be the contents of any file scanned or of the current file ! As, if what you says is exact, the \A assertion seems totally useless and is not the counterpart of the \z assertion, at all !!

    For instance, imagine that you add the four characters test, without any line-break, at the end of any current file. Then, move the caret at the middle of your current file and perform the S/R below :

    SEARCH (?-s).\z

    REPLACE $0<

    A click on the Replace All button ( or a click on the Find Next button, followed by a click on the Replace button ) correctly changes the last line test into the line test<, only !

    Now, with the same last line test, without any line-break, move, again, the caret, near the middle of the current file and perform this second S/R :

    SEARCH (?-s)\A.

    REPLACE >$0

    This time, a click, on the Replace All button, wrongly adds a > symbol, in front of any standard character, of the current line !!??


    On the contrary, note, Scott, that the François-R Boyer version, discussed in the post, below :

    https://notepad-plus-plus.org/community/topic/13513/proximity-search/13

    used with N++ v5.9.0, correctly adds the > in front of the first character of the current file, ONLY :-))

    Best Regards,

    guy038



  • the most simple solution:

    (?s)\A(?!.*(YOUR_WORDS).*$)


Log in to reply