How to use RegEx to split 5000 characters but preserving sentense?
-
I have articles inside txt files.
I wish to split each text file by 5000 characters, but preserving full sentence by the last period.
How to identify the 1st match, 2nd match?
-
Hello, @nz-select and All,
If you don’t mind the approximation, as my method considers the EOL characters as standard ones ( =>
\r
and/or\n
counts for1
char ), here are two regexes to find successive blocks of5000
chars or so, ending with a period :-
SEARCH
(?s).{1,5000}(\.\s+|\z)
will search for the largest area, ending with a period, with a size smaller than5,000
characters -
SEARCH
(?s).{1,5000}.*?(\.\s+|\z)
will search for the smallest area, ending with a period with a size greater than5,000
characters. -
Select the
Regular expression
search mode and, tick theWrap around
option
To know how many blocks of
5,000
chars or so, the current file contains, simply hit theCount
button, in the Find dialog
Now, in order to find out the beginning of the
Nth
match, use these generic regexes :-
SEARCH
(?s)(.{1,5000}(\.\s+|\z)){
N-1}\K
-
SEARCH
(?s)(.{1,5000}.*?(\.\s+|\z)){
N-1}\K
And, of course, change the N - 1 value with the appropriate integer !
Remarks :
-
If a file contains
N
blocks, in totality and you’re using{
N}
as quantifier, it matches the zero-length match, at the very end of current file -
Don’t use a quantifier superior to number N. And, for small files, the only valid quantifier
{1}
will always move to the very end of file !
Best Regards,
guy038
P.S. :
I used the
License.txt
file to test these regexes ! (4
occurrences )
-