Delete the entire content of all files with less than 100 words
-
Here’s a game plan:
Phase 1: prepend many-words files with a unique marker
- match <file start><anything, non-greedily><sequence of > 100 words, ie, use count specification
{101,}
><anything, greedily> - replace with something like
rodica does not want this file cleared out\r\n$0
orZX$0
Phase 2: clear all few-words files
- match <negative look-behind: <file start(
\A
)><marker text>><anything, greedily> - replace with nothing
Phase 3: remove markers from many-words files
- match <file start(
\A
)><marker text><anything, greedily, captured to group 1> - replace with group 1 text
Can you take it from here?
Also, you don’t say if you want to do F&R on all loaded files v. all specified files on disk. If the former, and you wish to preserve file dates, you don’t need to save modified, non-empty files after F&R. If the latter, uncleared files will have undergone date modification.
- match <file start><anything, non-greedily><sequence of > 100 words, ie, use count specification
-
Something close to this might do it:
\A(\s*\w+){0,99}\s*\z
-
@alan-kilborn That looks very nice.
-
@alan-kilborn said in Delete the entire content of all files with less than 100 words:
\A(\s*\w+){0,99}\s*\z
thank you. I try your solution, it doesn’t seem to work. In addition, notepad ++ freezes for about 15 seconds
-
@rodica-f said in Delete the entire content of all files with less than 100 words:
it doesn’t seem to work. In addition, notepad ++ freezes for about 15 seconds
Yea, I tested my solution on a very small data set, where it worked OK, but I see with something larger the regex engine has to work too hard so it gives up. Sorry, maybe Neil’s idea or someone else might have something to help you.
-
This post is deleted! -
thanks for the tip with \z (Seems to be a better with capital \Z )
Anyway, I find another 2 solutions. But I will consider the case of 6 words instead of 100, to be easy to test.
This regex will delete the content of the files with less than 6 words (you have to put 5 as for regex to count 6 )
FIND:
(?s)\A(.*?(\w+\s+){6}).*\Z
REPLACE BY:LEAVE EMPTY
This regex will delete the content of the files with more than 6 words (the same, you have to put 5 as for regex to count 6 )
FIND:
(?s)(.*?(\w+\s+){5,}).*\Z
REPLACE BY:LEAVE EMPTY
-
None of our solutions tolerates non-word, non-space characters such as punctuation. A robust solution should probably make use of constructs like
\W
or[[:punct:]]
. -
@rodica-f said :
This regex will delete the content of the files with less than 6 words
FIND:(?s)\A(.*?(\w+\s+){6}).*\Z
REPLACE BY: LEAVE EMPTYI don’t find that to be a true statement.
This regex will delete the content of the files with more than 6 words
OK…but the original spec was “less than” X words, not “more than”.
-
@neil-schipper can you please formulate a complete regex solution?
-
@rodica-f said in Delete the entire content of all files with less than 100 words:
can you please formulate a complete regex solution?
Yes, Neil please provide complete solution, taking into account every possible situation that we can’t know about, because we don’t know everything about OP’s data. :-)
Why are we even helping the notorious “Robin Cruise” anyway?
-
@neil-schipper said in Delete the entire content of all files with less than 100 words:
None of our solutions tolerates non-word, non-space characters such as punctuation. A robust solution should probably make use of constructs like \W or [[:punct:]].
I find this problem very intriguing. So I set my mind adrift in the regex documentation because; as @Neil-Schipper pointed out; this will likely involve use of character classes, which is where I had also considered it must go.
It firstly involves what constitutes a word, most likely one or more “non-space” characters shown together. I fell upon a character class identifed as
[[:space:]]
, and it’s opposite[^[:space:]]
.So using @Alan-Kilborn regex I altered it to be:
FW(?s)\A([[:space:]]*([^[:space:]]+[[:space:]]+){98,}.+)|.+
RW:\1
I’m still not convinced I’m entirely there but I’ve put it up for public consumption. Maybe someone else wants to take it a bit further, refine it?
So the premise is, find more than
x
number of words first, followed by the remainder of the file. As this is captured, return it. If this is not possible then use the alternation code and select all of the file and as it is not captured don’t return it. Hence we delete the file content if not equal or greater than thex
number we seek.Terry
Actually now I’ve posted I can see straight away I don’t need {98,}, it can just be {98} as the following
.+
takes care of the rest. -
Hello, @rodica-f, @neil-schipper, @alan-kilborn, @terry-r and All,
@terry-r :
I found out a variant , based on your use of the
[[:space:]]
POSIX character class !SEARCH
(?s)\A[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,98}[^[:space:]]+[[:space:]]*\z|\A[[:space:]]+\z
REPLACE
Leave EMPTY
This regex S/R will delete any content of files containing less than
100
words OR even0
non-space char followed with some[[:space:]]
charsBest Regards,
guy038
-
@guy038 @Terry-R @Alan-Kilborn @Neil-Schipper
thank you all. It is always a challenge to discover regex solutions.
by the way, I didn’t know the method with
[[:punct:]]
Where can I find about this regex method on internet? I don’t know how to search about it… -
-
@guy038 said in Delete the entire content of all files with less than 100 words:
(?s)\A[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,98}[^[:space:]]+[[:space:]]*\z|\A[[:space:]]+\z
One more question I have for @guy038 I want to use one of your GENERIC S/R for this case. SO I need to delete the content of a file that have less then 10 words between section <START> and <FINAL>
<START> The first, thing to note when <FINAL>
So, I test with all your GENERIC regex formulas you done a long time ago.
BSR =
<START>
ESR =<FINAL>
FR =(?s)\A[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,10}[^[:space:]]+[[:space:]]*\z|\A[[:space:]]+\z
REGEX:
(?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\K(FR)
(?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\KFR(?=\x20)
(?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\KFR
(?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\x20\KFR(?=\x20)
(?-i:BSR|\G(?!^))(?s:(?!ESR).)*?\K(?-i:FR)
(?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR)
(?-i:BSR|(?!^)\G)(?s:(?!ESR).)*?\K(?-i:FR)
(?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR)
It is not working, in any of the cases. I get the same message on F/R: “Cannot find the text…”
-
Hi, @rodica-f and All,
EDIT : The regexes, below, are incomplete. See the correct solution in my next post
You do not need to use these generic regexes at all !
Simply, replace
\A
by<START>
and\z
by<FINAL>
and, of course, change the value of the quantifier of the non-capturing group from98
to8
, giving the functional regex S/R below :SEARCH
(?s)<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,8}[^[:space:]]+[[:space:]]*<FINAL>|<START>[[:space:]]+<FINAL>
REPLACE
Leave EMPTY
So, the general formula for deleting all file contents, if there are less than
N
words between the two boundaries<START>
and<FINAL>
, is :SEARCH
(?s)<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,
N-2}[^[:space:]]+[[:space:]]*<FINAL>|<START>[[:space:]]+<FINAL>
REPLACE
Leave EMPTY
BR
guy038
-
@guy038 correct me if I’m wrong. The GENERIC formula in this case will be:
(?s)BSR(FR)*ESR|BSR+ESR
I think I’m wrong somewhere.
-
@guy038 by the way I test your generic formula you done for me.
(?s)<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,8}[^[:space:]]+[[:space:]]*<FINAL>|<START>[[:space:]]+<FINAL>
In the context below, delete only everything that is framed in <START> and <FINAL>
But does not delete the entire file, I mean the other words around it.
blah blah blah <START> The first, thing to note when <FINAL> blah blah
-
Hello, @rodica-f and All,
Oh… Yes ! I was wrong about it ! The correct regex S/R is, of course :
SEARCH
(?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,8}[^[:space:]]+[[:space:]]*<FINAL>.*\z|\A.*<START>[[:space:]]+<FINAL>.*\z
REPLACE
Leave EMPTY
And the general formula for deleting all file contents, if there are less than
N
words between the two boundaries<START>
and<FINAL>
, becomes :SEARCH
(?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,
N-2}[^[:space:]]+[[:space:]]*<FINAL>.*\z|\A.*<START>[[:space:]]+<FINAL>.*\z
REPLACE
Leave EMPTY
This regex will delete all file contents in all these cases :
-
If there no
non-space
char (0
word ), and only somespace
chars => the regex is\A.*<START>[[:space:]]+<FINAL>.*\z
( the part after the|
symbol ) -
If there are several
non-space
chars ( one word ), possibly surrounded withspace
chars => quantifier =0
and the regex becomes(?s)\A.*<START>[[:space:]]*[^[:space:]]+[[:space:]]*<FINAL>.*\z
-
If there are several
non-space
chars followed withspace
chars, twice ( so two words) => quantifier =1
and the regex becomes(?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+)[^[:space:]]+[[:space:]]*<FINAL>.*\z
-
If there are several
non-space
chars followed withspace
chars, third times ( so three words) => quantifier =2
and the regex becomes(?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){2}[^[:space:]]+[[:space:]]*<FINAL>.*\z
and so on… till :
- If there are several
non-space
chars followed withspace
chars, ninth times ( so nine words) => quantifier =8
and the regex becomes(?s)\A.*<START>[[:space:]]*(?:[^[:space:]]+[[:space:]]+){8}[^[:space:]]+[[:space:]]*<FINAL>.*\z
Now, to answer your question, I would say :
SEARCH
(?s)\A.*
BSR(
FR)
ESR.*\z
where FR =
[[:space:]]*(?:[^[:space:]]+[[:space:]]+){0,
N-2}[^[:space:]]+[[:space:]]*
OR FR =[[:space:]]+
( case no word )Best Regards,
guy038
-