Community
    • Login

    Delete the entire content of all files with less than 100 words

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    25 Posts 6 Posters 1.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • rodica FR
      rodica F @guy038
      last edited by

      @guy038 @Terry-R @Alan-Kilborn @Neil-Schipper

      thank you all. It is always a challenge to discover regex solutions.

      by the way, I didn’t know the method with [[:punct:]] Where can I find about this regex method on internet? I don’t know how to search about it…

      Paul WormerP 1 Reply Last reply Reply Quote 0
      • Paul WormerP
        Paul Wormer @rodica F
        last edited by

        @rodica-f
        Npp user manual

        1 Reply Last reply Reply Quote 0
        • rodica FR
          rodica F @guy038
          last edited by rodica F

          @guy038 said in Delete the entire content of all files with less than 100 words:
          (?s)\A[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,98}[^[:space:]]+[[:space:]]*\z|\A[[:space:]]+\z

          One more question I have for @guy038 I want to use one of your GENERIC S/R for this case. SO I need to delete the content of a file that have less then 10 words between section <START> and <FINAL>

          <START>
          
          The first, thing to note when
          
          <FINAL>
          

          So, I test with all your GENERIC regex formulas you done a long time ago.

          BSR = <START>
          ESR = <FINAL>
          FR = (?s)\A[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,10}[^[:space:]]+[[:space:]]*\z|\A[[:space:]]+\z

          REGEX:

          (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\K(FR)

          (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\KFR(?=\x20)

          (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\KFR

          (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\KFR(?=\x20)

          (?-i:BSR|\G(?!^))(?s:(?!ESR).)*?\K(?-i:FR)

          (?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR)

          (?-i:BSR|(?!^)\G)(?s:(?!ESR).)*?\K(?-i:FR)

          (?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR)

          It is not working, in any of the cases. I get the same message on F/R: “Cannot find the text…”

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi, @rodica-f and All,

            EDIT : The regexes, below, are incomplete. See the correct solution in my next post

            You do not need to use these generic regexes at all !

            Simply, replace \A by <START> and \z by <FINAL> and, of course, change the value of the quantifier of the non-capturing group from 98 to 8, giving the functional regex S/R below :

            SEARCH (?s)<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,8}[^[:space:]]+[[:space:]]*<FINAL>|<START>[[:space:]]+<FINAL>

            REPLACE Leave EMPTY


            So, the general formula for deleting all file contents, if there are less than N words between the two boundaries <START> and <FINAL>, is :

            SEARCH (?s)<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,N-2}[^[:space:]]+[[:space:]]*<FINAL>|<START>[[:space:]]+<FINAL>

            REPLACE Leave EMPTY

            BR

            guy038

            rodica FR 1 Reply Last reply Reply Quote 1
            • rodica FR
              rodica F @guy038
              last edited by

              @guy038 correct me if I’m wrong. The GENERIC formula in this case will be:

              (?s)BSR(FR)*ESR|BSR+ESR

              I think I’m wrong somewhere.

              rodica FR 1 Reply Last reply Reply Quote 0
              • rodica FR
                rodica F @rodica F
                last edited by

                @guy038 by the way I test your generic formula you done for me.

                (?s)<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,8}[^[:space:]]+[[:space:]]*<FINAL>|<START>[[:space:]]+<FINAL>

                In the context below, delete only everything that is framed in <START> and <FINAL>

                But does not delete the entire file, I mean the other words around it.

                blah blah     blah
                
                
                <START>
                
                The first, thing to note when
                
                <FINAL>
                
                   blah blah
                
                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, @rodica-f and All,

                  Oh… Yes ! I was wrong about it ! The correct regex S/R is, of course :

                  SEARCH (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,8}[^[:space:]]+[[:space:]]*<FINAL>.*\z|\A.*<START>[[:space:]]+<FINAL>.*\z

                  REPLACE Leave EMPTY

                  And the general formula for deleting all file contents, if there are less than N words between the two boundaries <START> and <FINAL>, becomes :

                  SEARCH (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,N-2}[^[:space:]]+[[:space:]]*<FINAL>.*\z|\A.*<START>[[:space:]]+<FINAL>.*\z

                  REPLACE Leave EMPTY


                  This regex will delete all file contents in all these cases :

                  • If there no non-space char ( 0 word ), and only some space chars => the regex is \A.*<START>[[:space:]]+<FINAL>.*\z ( the part after the | symbol )

                  • If there are several non-space chars ( one word ), possibly surrounded with space chars => quantifier = 0 and the regex becomes (?s)\A.*<START>[[:space:]]*[^[:space:]]+[[:space:]]*<FINAL>.*\z

                  • If there are several non-space chars followed with space chars, twice ( so two words) => quantifier = 1 and the regex becomes (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+)[^[:space:]]+[[:space:]]*<FINAL>.*\z

                  • If there are several non-space chars followed with space chars, third times ( so three words) => quantifier = 2 and the regex becomes (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){2}[^[:space:]]+[[:space:]]*<FINAL>.*\z

                  and so on… till :

                  • If there are several non-space chars followed with space chars, ninth times ( so nine words) => quantifier = 8 and the regex becomes (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){8}[^[:space:]]+[[:space:]]*<FINAL>.*\z

                  Now, to answer your question, I would say :

                  SEARCH (?s)\A.*BSR(FR)ESR.*\z

                  where FR = [[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,N-2}[^[:space:]]+[[:space:]]*    OR    FR = [[:space:]]+ ( case no word )

                  Best Regards,

                  guy038

                  rodica FR 1 Reply Last reply Reply Quote 1
                  • rodica FR
                    rodica F @guy038
                    last edited by

                    @guy038 thank you very much !

                    rodica FR 1 Reply Last reply Reply Quote 0
                    • rodica FR
                      rodica F @rodica F
                      last edited by

                      @rodica-f

                      Delete the entire content of all files with less than 6 words

                      FIND:
                      \A(?i)[^\w+]*(?:[\w*]+[^\w*]+){0,5}(?:[\w*]+[^\w+]*)?\z

                      REPLACE: (LEAVE EMPTY)

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi, @rodica-f and All,

                        I sorry to tell you that your last regex does not meet exactly the previous rules and is rather erroneous !

                        First, and just anecdotal, the (?i) modifier is useless as no range of letters occurs in your regex

                        Secondly, this regex will delete all file contents if more than 0 word char and less than 7 word chars

                        Thirdly, let’s consider this somple phrase :

                        let abc - xyz
                        

                        It contains 4 non-space expressions ( let, abc, - and xyz )

                        Your regex seems OK as it correctly select all text which contains less than 7 words

                        Now, change the - sign by a + sign :

                        let abc + xyz
                        

                        This time, your regex does not match anything although there are, still, 4 non-space expressions :((


                        Why this behaviour occurs ? Well, the different sub-expressions, that you used in your regex, are erroneous !

                        [^\w+]* means “find a a char different from a word char and different from the + sign”, repeated from 0 to any

                        [\w*]+ means “find a word char or a * symbol”, repeated from 1 to any

                        [^\w*]+ means “find a char different from a word char and different from the * symbol”, repeated from 1 to any

                        So, an almost-correct solution would be \A[^\w]*(?:\w+[^\w]+){0,4}(?:\w+[^\w]*)?\z. However, note that it also matches a true empty file which does not need any replacement as already empty !!


                        Now, the important drawback of using word chars \w and non-word chars [^\w], is that any symbol, met in text, will increase the number of words !. For instance, see the difference betwen :

                        This is a simple example
                        

                        and :

                        This is a sim-ple example
                        

                        If I use my last “word” version \A[^\w]*(?:\w+[^\w]+){0,4}(?:\w+[^\w]*)?\z, it matches the text This is a simple example and not the text This is a sim-ple example ! Because, in the former case, it counts 5 words and, in the later case, it counts 6 words

                        That’s why my previous and @terry-r’s version, using non-space characters [[:^space:]] and space chars [[:space:]], seems more rigorous and practical ;-))

                        Best Regards

                        guy038

                        rodica FR 1 Reply Last reply Reply Quote 3
                        • rodica FR
                          rodica F @guy038
                          last edited by

                          @guy038 said in Delete the entire content of all files with less than 100 words:

                          \A[^\w]*(?:\w+[^\w]+){0,4}(?:\w+[^\w]*)?\z

                          My joy is that, thanks to my regex, an alternative method has been discovered, quite good.

                          thank you @guy038

                          1 Reply Last reply Reply Quote 1
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors