Community
    • Login

    Delete the entire content of all files with less than 100 words

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    25 Posts 6 Posters 1.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Alan KilbornA
      Alan Kilborn @rodica F
      last edited by

      @rodica-f said in Delete the entire content of all files with less than 100 words:

      it doesn’t seem to work. In addition, notepad ++ freezes for about 15 seconds

      Yea, I tested my solution on a very small data set, where it worked OK, but I see with something larger the regex engine has to work too hard so it gives up. Sorry, maybe Neil’s idea or someone else might have something to help you.

      1 Reply Last reply Reply Quote 0
      • rodica FR
        rodica F @rodica F
        last edited by rodica F

        This post is deleted!
        1 Reply Last reply Reply Quote 0
        • rodica FR
          rodica F
          last edited by

          thanks for the tip with \z (Seems to be a better with capital \Z )

          Anyway, I find another 2 solutions. But I will consider the case of 6 words instead of 100, to be easy to test.

          This regex will delete the content of the files with less than 6 words (you have to put 5 as for regex to count 6 )

          FIND: (?s)\A(.*?(\w+\s+){6}).*\Z
          REPLACE BY: LEAVE EMPTY

          This regex will delete the content of the files with more than 6 words (the same, you have to put 5 as for regex to count 6 )

          FIND: (?s)(.*?(\w+\s+){5,}).*\Z
          REPLACE BY: LEAVE EMPTY

          Neil SchipperN Alan KilbornA 2 Replies Last reply Reply Quote 0
          • Neil SchipperN
            Neil Schipper @rodica F
            last edited by

            None of our solutions tolerates non-word, non-space characters such as punctuation. A robust solution should probably make use of constructs like \W or [[:punct:]].

            rodica FR Terry RT 2 Replies Last reply Reply Quote 0
            • Alan KilbornA
              Alan Kilborn @rodica F
              last edited by Alan Kilborn

              @rodica-f said :

              This regex will delete the content of the files with less than 6 words
              FIND: (?s)\A(.*?(\w+\s+){6}).*\Z
              REPLACE BY: LEAVE EMPTY

              I don’t find that to be a true statement.

              This regex will delete the content of the files with more than 6 words

              OK…but the original spec was “less than” X words, not “more than”.

              1 Reply Last reply Reply Quote 0
              • rodica FR
                rodica F @Neil Schipper
                last edited by

                @neil-schipper can you please formulate a complete regex solution?

                Alan KilbornA 1 Reply Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @rodica F
                  last edited by Alan Kilborn

                  @rodica-f said in Delete the entire content of all files with less than 100 words:

                  can you please formulate a complete regex solution?

                  Yes, Neil please provide complete solution, taking into account every possible situation that we can’t know about, because we don’t know everything about OP’s data. :-)

                  Why are we even helping the notorious “Robin Cruise” anyway?

                  1 Reply Last reply Reply Quote 0
                  • Terry RT
                    Terry R @Neil Schipper
                    last edited by Terry R

                    @neil-schipper said in Delete the entire content of all files with less than 100 words:

                    None of our solutions tolerates non-word, non-space characters such as punctuation. A robust solution should probably make use of constructs like \W or [[:punct:]].

                    I find this problem very intriguing. So I set my mind adrift in the regex documentation because; as @Neil-Schipper pointed out; this will likely involve use of character classes, which is where I had also considered it must go.

                    It firstly involves what constitutes a word, most likely one or more “non-space” characters shown together. I fell upon a character class identifed as [[:space:]], and it’s opposite [^[:space:]].

                    So using @Alan-Kilborn regex I altered it to be:
                    FW(?s)\A([[:space:]]*([^[:space:]]+[[:space:]]+){98,}.+)|.+
                    RW:\1

                    I’m still not convinced I’m entirely there but I’ve put it up for public consumption. Maybe someone else wants to take it a bit further, refine it?

                    So the premise is, find more than x number of words first, followed by the remainder of the file. As this is captured, return it. If this is not possible then use the alternation code and select all of the file and as it is not captured don’t return it. Hence we delete the file content if not equal or greater than the x number we seek.

                    Terry

                    Actually now I’ve posted I can see straight away I don’t need {98,}, it can just be {98} as the following .+ takes care of the rest.

                    1 Reply Last reply Reply Quote 3
                    • guy038G
                      guy038
                      last edited by guy038

                      Hello, @rodica-f, @neil-schipper, @alan-kilborn, @terry-r and All,

                      @terry-r :

                      I found out a variant , based on your use of the [[:space:]] POSIX character class !

                      SEARCH (?s)\A[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,98}[^[:space:]]+[[:space:]]*\z|\A[[:space:]]+\z

                      REPLACE Leave EMPTY

                      This regex S/R will delete any content of files containing less than 100 words OR even 0 non-space char followed with some [[:space:]] chars

                      Best Regards,

                      guy038

                      rodica FR 2 Replies Last reply Reply Quote 3
                      • rodica FR
                        rodica F @guy038
                        last edited by

                        @guy038 @Terry-R @Alan-Kilborn @Neil-Schipper

                        thank you all. It is always a challenge to discover regex solutions.

                        by the way, I didn’t know the method with [[:punct:]] Where can I find about this regex method on internet? I don’t know how to search about it…

                        Paul WormerP 1 Reply Last reply Reply Quote 0
                        • Paul WormerP
                          Paul Wormer @rodica F
                          last edited by

                          @rodica-f
                          Npp user manual

                          1 Reply Last reply Reply Quote 0
                          • rodica FR
                            rodica F @guy038
                            last edited by rodica F

                            @guy038 said in Delete the entire content of all files with less than 100 words:
                            (?s)\A[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,98}[^[:space:]]+[[:space:]]*\z|\A[[:space:]]+\z

                            One more question I have for @guy038 I want to use one of your GENERIC S/R for this case. SO I need to delete the content of a file that have less then 10 words between section <START> and <FINAL>

                            <START>
                            
                            The first, thing to note when
                            
                            <FINAL>
                            

                            So, I test with all your GENERIC regex formulas you done a long time ago.

                            BSR = <START>
                            ESR = <FINAL>
                            FR = (?s)\A[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,10}[^[:space:]]+[[:space:]]*\z|\A[[:space:]]+\z

                            REGEX:

                            (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\K(FR)

                            (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\KFR(?=\x20)

                            (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\KFR

                            (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\KFR(?=\x20)

                            (?-i:BSR|\G(?!^))(?s:(?!ESR).)*?\K(?-i:FR)

                            (?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR)

                            (?-i:BSR|(?!^)\G)(?s:(?!ESR).)*?\K(?-i:FR)

                            (?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR)

                            It is not working, in any of the cases. I get the same message on F/R: “Cannot find the text…”

                            1 Reply Last reply Reply Quote 0
                            • guy038G
                              guy038
                              last edited by guy038

                              Hi, @rodica-f and All,

                              EDIT : The regexes, below, are incomplete. See the correct solution in my next post

                              You do not need to use these generic regexes at all !

                              Simply, replace \A by <START> and \z by <FINAL> and, of course, change the value of the quantifier of the non-capturing group from 98 to 8, giving the functional regex S/R below :

                              SEARCH (?s)<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,8}[^[:space:]]+[[:space:]]*<FINAL>|<START>[[:space:]]+<FINAL>

                              REPLACE Leave EMPTY


                              So, the general formula for deleting all file contents, if there are less than N words between the two boundaries <START> and <FINAL>, is :

                              SEARCH (?s)<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,N-2}[^[:space:]]+[[:space:]]*<FINAL>|<START>[[:space:]]+<FINAL>

                              REPLACE Leave EMPTY

                              BR

                              guy038

                              rodica FR 1 Reply Last reply Reply Quote 1
                              • rodica FR
                                rodica F @guy038
                                last edited by

                                @guy038 correct me if I’m wrong. The GENERIC formula in this case will be:

                                (?s)BSR(FR)*ESR|BSR+ESR

                                I think I’m wrong somewhere.

                                rodica FR 1 Reply Last reply Reply Quote 0
                                • rodica FR
                                  rodica F @rodica F
                                  last edited by

                                  @guy038 by the way I test your generic formula you done for me.

                                  (?s)<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,8}[^[:space:]]+[[:space:]]*<FINAL>|<START>[[:space:]]+<FINAL>

                                  In the context below, delete only everything that is framed in <START> and <FINAL>

                                  But does not delete the entire file, I mean the other words around it.

                                  blah blah     blah
                                  
                                  
                                  <START>
                                  
                                  The first, thing to note when
                                  
                                  <FINAL>
                                  
                                     blah blah
                                  
                                  1 Reply Last reply Reply Quote 0
                                  • guy038G
                                    guy038
                                    last edited by guy038

                                    Hello, @rodica-f and All,

                                    Oh… Yes ! I was wrong about it ! The correct regex S/R is, of course :

                                    SEARCH (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,8}[^[:space:]]+[[:space:]]*<FINAL>.*\z|\A.*<START>[[:space:]]+<FINAL>.*\z

                                    REPLACE Leave EMPTY

                                    And the general formula for deleting all file contents, if there are less than N words between the two boundaries <START> and <FINAL>, becomes :

                                    SEARCH (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,N-2}[^[:space:]]+[[:space:]]*<FINAL>.*\z|\A.*<START>[[:space:]]+<FINAL>.*\z

                                    REPLACE Leave EMPTY


                                    This regex will delete all file contents in all these cases :

                                    • If there no non-space char ( 0 word ), and only some space chars => the regex is \A.*<START>[[:space:]]+<FINAL>.*\z ( the part after the | symbol )

                                    • If there are several non-space chars ( one word ), possibly surrounded with space chars => quantifier = 0 and the regex becomes (?s)\A.*<START>[[:space:]]*[^[:space:]]+[[:space:]]*<FINAL>.*\z

                                    • If there are several non-space chars followed with space chars, twice ( so two words) => quantifier = 1 and the regex becomes (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+)[^[:space:]]+[[:space:]]*<FINAL>.*\z

                                    • If there are several non-space chars followed with space chars, third times ( so three words) => quantifier = 2 and the regex becomes (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){2}[^[:space:]]+[[:space:]]*<FINAL>.*\z

                                    and so on… till :

                                    • If there are several non-space chars followed with space chars, ninth times ( so nine words) => quantifier = 8 and the regex becomes (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){8}[^[:space:]]+[[:space:]]*<FINAL>.*\z

                                    Now, to answer your question, I would say :

                                    SEARCH (?s)\A.*BSR(FR)ESR.*\z

                                    where FR = [[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,N-2}[^[:space:]]+[[:space:]]*    OR    FR = [[:space:]]+ ( case no word )

                                    Best Regards,

                                    guy038

                                    rodica FR 1 Reply Last reply Reply Quote 1
                                    • rodica FR
                                      rodica F @guy038
                                      last edited by

                                      @guy038 thank you very much !

                                      rodica FR 1 Reply Last reply Reply Quote 0
                                      • rodica FR
                                        rodica F @rodica F
                                        last edited by

                                        @rodica-f

                                        Delete the entire content of all files with less than 6 words

                                        FIND:
                                        \A(?i)[^\w+]*(?:[\w*]+[^\w*]+){0,5}(?:[\w*]+[^\w+]*)?\z

                                        REPLACE: (LEAVE EMPTY)

                                        1 Reply Last reply Reply Quote 0
                                        • guy038G
                                          guy038
                                          last edited by guy038

                                          Hi, @rodica-f and All,

                                          I sorry to tell you that your last regex does not meet exactly the previous rules and is rather erroneous !

                                          First, and just anecdotal, the (?i) modifier is useless as no range of letters occurs in your regex

                                          Secondly, this regex will delete all file contents if more than 0 word char and less than 7 word chars

                                          Thirdly, let’s consider this somple phrase :

                                          let abc - xyz
                                          

                                          It contains 4 non-space expressions ( let, abc, - and xyz )

                                          Your regex seems OK as it correctly select all text which contains less than 7 words

                                          Now, change the - sign by a + sign :

                                          let abc + xyz
                                          

                                          This time, your regex does not match anything although there are, still, 4 non-space expressions :((


                                          Why this behaviour occurs ? Well, the different sub-expressions, that you used in your regex, are erroneous !

                                          [^\w+]* means “find a a char different from a word char and different from the + sign”, repeated from 0 to any

                                          [\w*]+ means “find a word char or a * symbol”, repeated from 1 to any

                                          [^\w*]+ means “find a char different from a word char and different from the * symbol”, repeated from 1 to any

                                          So, an almost-correct solution would be \A[^\w]*(?:\w+[^\w]+){0,4}(?:\w+[^\w]*)?\z. However, note that it also matches a true empty file which does not need any replacement as already empty !!


                                          Now, the important drawback of using word chars \w and non-word chars [^\w], is that any symbol, met in text, will increase the number of words !. For instance, see the difference betwen :

                                          This is a simple example
                                          

                                          and :

                                          This is a sim-ple example
                                          

                                          If I use my last “word” version \A[^\w]*(?:\w+[^\w]+){0,4}(?:\w+[^\w]*)?\z, it matches the text This is a simple example and not the text This is a sim-ple example ! Because, in the former case, it counts 5 words and, in the later case, it counts 6 words

                                          That’s why my previous and @terry-r’s version, using non-space characters [[:^space:]] and space chars [[:space:]], seems more rigorous and practical ;-))

                                          Best Regards

                                          guy038

                                          rodica FR 1 Reply Last reply Reply Quote 3
                                          • rodica FR
                                            rodica F @guy038
                                            last edited by

                                            @guy038 said in Delete the entire content of all files with less than 100 words:

                                            \A[^\w]*(?:\w+[^\w]+){0,4}(?:\w+[^\w]*)?\z

                                            My joy is that, thanks to my regex, an alternative method has been discovered, quite good.

                                            thank you @guy038

                                            1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors