Community
    • Login

    Notepad++ Regex to find files having more than 2 words but should exclude certain word

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    4 Posts 3 Posters 430 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Jignesh WaghelaJ
      Jignesh Waghela
      last edited by

      I have below files
      File 1
      hjsakh
      hasj
      ashjh word1 sadhjhf
      asgdga
      ashsad
      sj word3a sdha
      ajsh
      ashjh
      word2a

      File 2
      hjsakh
      hasj
      ashjh word1 sadhjhf
      asgdga
      ashsad
      word2a
      sdhfj
      sj word3b sdha
      ajsh
      ashjh

      File 3
      hjsakh
      hasj
      ashjh word1 sadhjhf
      asgdga
      ashsad
      word2b
      sdh
      sj sdha
      ajsh
      ashjh

      File 4
      hjsakh
      hasj
      asgdga
      ashsad
      word2c
      sdh
      sj sdha
      ajsh
      ashjh
      ashjh word1 sadhjhf

      File 5
      hjsakh
      hasj
      ashjh word1 sadhjhf
      asgdga
      ashsad
      word3a
      sdh
      sj sdha
      ajsh
      ashjh

      File 6
      hjsakh
      hasj
      word3b
      ashjh word1 sadhjhf
      asgdga
      ashsad
      sdh
      sj sdha
      ajsh
      ashjh

      looking for regex which can find files which satisfy this criteria (contains word1) & (contains word2a or word2b or word2c) & (does not contain word3a and word3b). In above example it should find only File 3 and File 4

      mkupperM CoisesC 2 Replies Last reply Reply Quote 1
      • mkupperM
        mkupper @Jignesh Waghela
        last edited by

        @Jignesh-Waghela Regular expressions tend to be read from left to write and intend to match patterns that are scanned left to right. As you watch to match things where the words appear in any order I would do this in several passes rather than constructing a rather long regexp that has all the possible orders of words and their conditions.

        My passes were:

        1. Scan for files containing word1. Either make a list of these. It turns out all of your files contain word1.
        2. Scan the list of files from step 1 for (word2a|word2b|word2c) - We are down to a second list that has files 1, 2, 3, and 4.
        3. It turns out that the not scanner, (?!word3a|word3b) only works on lines, not files, even if you try ``(?-s)(?!word3a|word3b). Thus I would scan list 2 for (word3a|word3b)` and exclude those. This excludes files 1, 2, and also 5 and 6 but those are not in list 2. The remaining files are 3 and 4.
        CoisesC 1 Reply Last reply Reply Quote 1
        • CoisesC
          Coises @Jignesh Waghela
          last edited by Coises

          @Jignesh-Waghela said in Notepad++ Regex to find files having more than 2 words but should exclude certain word:

          looking for regex which can find files which satisfy this criteria (contains word1) & (contains word2a or word2b or word2c) & (does not contain word3a and word3b). In above example it should find only File 3 and File 4

          Try:

          (?s)\A(?=.*(word2a|word2b|word2c))(?!.*(word3a|word3b)).*word1

          Edit to add:

          (?s)\A(?=.*?\b(word2a|word2b|word2c)\b)(?!.*?\b(word3a|word3b)\b).*?\bword1\b

          is probably better. The \b assertions require word boundaries — so that reword2a or word13 won’t count. Changing .* to .*? won’t change the results, but it might make the expressions more efficient and less likely to run into “expression too complex” failures.

          1 Reply Last reply Reply Quote 3
          • CoisesC
            Coises @mkupper
            last edited by

            @mkupper said in Notepad++ Regex to find files having more than 2 words but should exclude certain word:

            It turns out that the not scanner, (?!word3a|word3b) only works on lines, not files, even if you try ``(?-s)(?!word3a|word3b).

            You’re misunderstanding what is going wrong. The expression you gave says, “Match null here if the following characters aren’t either word3a or word3b.” That matches at every position in a file except immediately before word3a or word3b. (Each line is listed once in the search results, but look at the number of hits.) What is needed is (?s)\A(?!.*(word3a|word3b)), which says, “Match null at the beginning of the document if the following characters don’t match any number of arbitrary characters followed by either word3a or word3b.”

            When I tested the whole expression I got some erratic results when I had it match null, which is why I set it to match the string from the beginning up to the word word1.

            1 Reply Last reply Reply Quote 3
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors