Delete the entire content of all files with less than 100 words

guy038

Hello, @rodica-f and All,

Oh… Yes ! I was wrong about it ! The correct regex S/R is, of course :

SEARCH (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,8}[^[:space:]]+[[:space:]]*<FINAL>.*\z|\A.*<START>[[:space:]]+<FINAL>.*\z

REPLACE Leave EMPTY

And the general formula for deleting all file contents, if there are less than N words between the two boundaries <START> and <FINAL>, becomes :

SEARCH (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,N-2}[^[:space:]]+[[:space:]]*<FINAL>.*\z|\A.*<START>[[:space:]]+<FINAL>.*\z

REPLACE Leave EMPTY

This regex will delete all file contents in all these cases :

If there no non-space char ( 0 word ), and only some space chars => the regex is \A.*<START>[[:space:]]+<FINAL>.*\z ( the part after the | symbol )
If there are several non-space chars ( one word ), possibly surrounded with space chars => quantifier = 0 and the regex becomes (?s)\A.*<START>[[:space:]]*[^[:space:]]+[[:space:]]*<FINAL>.*\z
If there are several non-space chars followed with space chars, twice ( so two words) => quantifier = 1 and the regex becomes (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+)[^[:space:]]+[[:space:]]*<FINAL>.*\z
If there are several non-space chars followed with space chars, third times ( so three words) => quantifier = 2 and the regex becomes (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){2}[^[:space:]]+[[:space:]]*<FINAL>.*\z

and so on… till :

If there are several non-space chars followed with space chars, ninth times ( so nine words) => quantifier = 8 and the regex becomes (?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){8}[^[:space:]]+[[:space:]]*<FINAL>.*\z

Now, to answer your question, I would say :

SEARCH (?s)\A.*BSR(FR)ESR.*\z

where FR = [[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,N-2}[^[:space:]]+[[:space:]]* OR FR = [[:space:]]+ ( case no word )

Best Regards,

guy038

rodica F

@guy038 thank you very much !

rodica F

@rodica-f

Delete the entire content of all files with less than 6 words

FIND:
\A(?i)[^\w+]*(?:[\w*]+[^\w*]+){0,5}(?:[\w*]+[^\w+]*)?\z

REPLACE: (LEAVE EMPTY)

guy038

Hi, @rodica-f and All,

I sorry to tell you that your last regex does not meet exactly the previous rules and is rather erroneous !

First, and just anecdotal, the (?i) modifier is useless as no range of letters occurs in your regex

Secondly, this regex will delete all file contents if more than 0 word char and less than 7 word chars

Thirdly, let’s consider this somple phrase :

let abc - xyz

It contains 4 non-space expressions ( let, abc, - and xyz )

Your regex seems OK as it correctly select all text which contains less than 7 words

Now, change the - sign by a + sign :

let abc + xyz

This time, your regex does not match anything although there are, still, 4 non-space expressions :((

Why this behaviour occurs ? Well, the different sub-expressions, that you used in your regex, are erroneous !

[^\w+]* means “find a a char different from a word char and different from the + sign”, repeated from 0 to any

[\w*]+ means “find a word char or a * symbol”, repeated from 1 to any

[^\w*]+ means “find a char different from a word char and different from the * symbol”, repeated from 1 to any

So, an almost-correct solution would be \A[^\w]*(?:\w+[^\w]+){0,4}(?:\w+[^\w]*)?\z. However, note that it also matches a true empty file which does not need any replacement as already empty !!

Now, the important drawback of using word chars \w and non-word chars [^\w], is that any symbol, met in text, will increase the number of words !. For instance, see the difference betwen :

This is a simple example

and :

This is a sim-ple example

If I use my last “word” version \A[^\w]*(?:\w+[^\w]+){0,4}(?:\w+[^\w]*)?\z, it matches the text This is a simple example and not the text This is a sim-ple example ! Because, in the former case, it counts 5 words and, in the later case, it counts 6 words

That’s why my previous and @terry-r’s version, using non-space characters [[:^space:]] and space chars [[:space:]], seems more rigorous and practical ;-))

Best Regards

guy038

rodica F

@guy038 said in Delete the entire content of all files with less than 100 words:

\A[^\w]*(?:\w+[^\w]+){0,4}(?:\w+[^\w]*)?\z

My joy is that, thanks to my regex, an alternative method has been discovered, quite good.

thank you @guy038