Community
    • Login

    2 search strings in a group of files with the search function

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    3 Posts 3 Posters 2.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Andrea CappelliA
      Andrea Cappelli
      last edited by Andrea Cappelli

      Hi,
      I think that this topic has been treated at other times so I apologize if I repeat myself.
      I have to try 2 strings in a group of files.
      I tried with
      (Word1) | (word2)
      but I wish there were no file duplication in research.
      how should I do?
      Thanks for your cooperation

      Claudia FrankC 1 Reply Last reply Reply Quote 0
      • Claudia FrankC
        Claudia Frank @Andrea Cappelli
        last edited by

        @Andrea-Cappelli

        not sure what you are talking about. Maybe you wanna provide some screenshot or some example
        kind of this is what I expect and this is what happens explanations?

        Cheers
        Claudia

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hello, Andrea,

          What you, really, like to, is not clearly defined, in your post ! So I tried to guess ;-)

          • Seemingly, you’re looking for, either, the string Word1 OR the string Word2, in several files

          • As I supposed that these files may contain many occurrences of the string Word1 AND/OR many occurrences of the string Word2 you would prefer to get only ONE result, per file, wouldn’t you ?

          • But, as it’s likely that most of them contain the TWO strings Word1 and Word2, do you prefer :

            • A) : ONE result, per file, containing ONE of the two strings, whatever it is

            • B) : TWO results, per file, with a first line containing ONE of the strings, and a second line containing the OTHER string

          Anyway, I’ll give you the solution for the two cases A) and B ;-))


          Initially, when I built these two regexes, for cases A) et B), I tried them on a dozen, or so, of files, and, unfortunately, I noticed that, when the current file scanned, has an important size, the regex might cause a catastrophic backtracking, ending to a global wrong match of the full contents of this file :-((

          As I was unable to find out other correct regexes, yet, which could avoid search failure, I, then, decided to take advantage of a new N++ feature, implemented since the 6.9.2 version : The new option Find in this finder…, when you right-click inside the Find result panel !

          So, I split the problem in two parts :

          • Firstly, output, only, the lines, of each scanned file, which contain, either, the string Word1 and/or Word2, in the Find result panel

          • Secondly, from that found restricted list, use my original regexes to get the right results !


          So, follow these preliminary steps, below :

          • Open the Find in Files dialog ( Ctrl + Shift + F )

          • Type, in the Find what: field, the simple regex (?-i)Word1|Word2

          • Type, in the Replace with Field, the regex $0 ( SECURITY ! )

          • Type, in the Filters field, your list of files to be scanned

          • Type, in the Directory: field, the absolute path of the folder, containing your files

          • Click on the Find All button

          => The Find result panel should appear, with all the concerned lines, from all your files to be scanned !

          Notes :

          • The (?-i) modifier forces the regex search to be performed in a sensitive way. Use, instead, the (?i) syntax, if you prefer to run the search, in a insensitive way !

          • Although we’re just searching something, and not replacing anything, it’s a good habit to, always, put the form $0, which stands for the complete current matched string. Indeed, just suppose that you clicked, by mistake, on the Replace in Files button and that you confirmed the replacement, by clicking on the OK button of the validation dialog, this S/R would simply replace any matched string by this same string itself :-)) Quite at ease, isn’t it ?


          Now, we are going to exploit the restricted text, of the Find result panel :

          Case A) :

          This regex search looks for the last line, in the Find result panel, containing, indifferently, the string Word1 or Word2, in that EXACT case :

          • Right-click, inside the Find result panel, and choose the Find in this finder… option

          • In the Find what: field, type (?s-i).*\K(?:Word1|Word2)

          • Check the Search only on found lines option

          • Uncheck the Match whole word only option

          • Select the Regular expression search mode

          • Click on the Find All button

          => A second “Find result” panel appears, with the indication - Line Filter Mode: only display the filtered results. This new panel should contain, only, ONE line, per file, with, indifferently, the string Word1 or the string Word2 !

          Notes :

          • The first part, (?s-i).* , looks for any amount, even empty or multi-lines, of any character ( standard or EOL ) till the last occurrence, in the file, of the string Word1 or Word2, with its exact case, stored in a non-capturing group (?:...|...)

          • Due to the \K syntax, the the location of the regex match is reset and the regex engine just matches the string Word1 or Word2


          Case B) :

          This regex search looks for the TWO last lines, in the Find result panel, containing the string Word1, first, then, the string Word2 OR the string Word2, first, then, the string Word1, ans all, in that EXACT case :

          • Select, again, the MAIN Find result panel ( IMPORTANT )

          • Right-click, inside, and choose, again, the Find in this finder… option

          • In the Find what: field, type the regex (?s-i).*\K(?:(Word1)(?=.*(?2))|(Word2)(?=.*(?1)))|.*\K(?:(?1)|(?2))

          • Check the Search only on found lines option

          • Uncheck the Match whole word only option

          • Select the Regular expression search mode

          • Click on the Find All button

          => A third “Find result” panel appears, with the indication - Line Filter Mode: only display the filtered results. This new panel should contain TWO lines, per file, with, for each file :

          • A first line, containing the string Word1 and a second line, containing the string Word2, in that exact case

          OR

          • A first line, containing the string Word2 and a second line, containing the string Word1, in that exact case

          NOTES :

          • If a scanned file contains the string Word1 or Word2, ONLY, this unique occurrence is, also, outputted !

          • This search regex is quite difficult to understand, because it uses some expressions, called subroutine calls (?n), which point out, by reference, to the groups 1 and 2. Not easy to explain correctly this regex ! I started with the more simple regex, below :

          (?s-i).*\K(?:Word1(?=.*Word2)|Word2(?=.*Word1))|.*\K(?:Word1|Word2)

          • After matching the longest range of any character, which is forgotten, due to the \K syntaxt, the regex engine tries to find, either :

            • The string Word1, ONLY IF it’s followed, further on, with the string Word2 ( case C )
              OR
              - The string Word2, ONLY IF it’s followed, further on, with the string Word1 ( case D )
          • Then, after matching, again, a longest range of any character, which is reset, due to the \K form, the regex engine tries, this time, to find, either :

            • The other string Word2 ( case C )
              OR
            • The other string Word1 ( case D )

          The inconvenient of this regex (?s-i).*\K(?:Word1(?=.*Word2)|Word2(?=.*Word1))|.*\K(?:Word1|Word2) is that you must repeat the two strings, to look for, three times to get the overall regex to work ! By using some subroutine calls, we need to enter these two strings ONCE, only, instead of three times !

          In short, the syntax (?n) of a subroutine call, represents the exact contents of group n, which can be located, before, or after, its reference (?n]. So :

          • (?1) is equivalent to the group 1, (Word1)

          • (?2) is equivalent to the group 2, (Word2)

          Therefore, from the original regex form (?s-i).*\K(?:Word1(?=.*Word2)|Word2(?=.*Word1))|.*\K(?:Word1|Word2), we may change the regex into this final one, below, which is, as correct as the other one, although a bit more difficult to understand :-(. However, you’ll just need to write the two strings to search, only ONCE !

          (?s-i).*\K(?:(Word1)(?=.*(?2))|(Word2)(?=.*(?1)))|.*\K(?:(?1)|(?2))

          Best Regards,

          guy038

          Additional info :

          1) Concerning the new Find in this finder option of the Search result panel :

          • If you do NOT check the Search only on found lines option, the search is performed on all the contents of the different files, listed in the Search result panel

          • if you check the Search only on found lines option, the search is performed, ONLY, on all the lines, listed in the Search result panel


          2) The main difference between a subroutine call and a back-reference is that:

          • A back-reference \n refers to the present value of the group n

          • A subroutine call (?n) refers to the present template of the group n

          So, if we consider these four lines :

          123 ABC 123
          123 ABC 789
          789 ABC 123
          789 ABC 789
          

          The regex (\d+) ABC \1 would match lines 1 and 4, whereas the regex (\d+) ABC (?1) would match the four lines

          In other words, the regexes (\d+) ABC (?1) and (\d+) ABC (\d+) are strictly equivalent


          3) Note that, when a subroutine call is used INSIDE its group, to which it refers, it becomes a recursive sub-pattern reference :

          • For instance, in the regex :(\{[^\{\}]+\}(?1)?), the group 1 is the overall regex, which does contain its reference (?1). So (?1) is a recursive sub-pattern reference

          • But, in the regex (\{[^\{\}]+\})(?1)?, the group 1 is (\{[^\{\}]+\}) and its reference (?1), located outside the group 1, is just a subroutine call

          These two regexes looks for, either, a single string {....} surrounded by curly braces or two consecutive strings {....}{....}

          You may test these regexes, with the example text, with contains a well-balanced amount of curly braces :

           {This}{is}{a small}{text}{{in order{to {test}{this}}}{{regex}}}
          
          1 Reply Last reply Reply Quote 1
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors