How to use RegEx to split 5000 characters but preserving sentense?

  • I have articles inside txt files.

    I wish to split each text file by 5000 characters, but preserving full sentence by the last period.

    How to identify the 1st match, 2nd match?

  • Hello, @nz-select and All,

    If you don’t mind the approximation, as my method considers the EOL characters as standard ones ( => \r and/or \n counts for 1 char ), here are two regexes to find successive blocks of 5000 chars or so, ending with a period :

    • SEARCH (?s).{1,5000}(\.\s+|\z)    will search for the largest area, ending with a period, with a size smaller than 5,000 characters

    • SEARCH (?s).{1,5000}.*?(\.\s+|\z)    will search for the smallest area, ending with a period with a size greater than 5,000 characters.

    • Select the Regular expression search mode and, tick the Wrap around option

    To know how many blocks of 5,000 chars or so, the current file contains, simply hit the Count button, in the Find dialog

    Now, in order to find out the beginning of the Nth match, use these generic regexes :

    • SEARCH (?s)(.{1,5000}(\.\s+|\z)){N-1}\K

    • SEARCH (?s)(.{1,5000}.*?(\.\s+|\z)){N-1}\K

    And, of course, change the N - 1 value with the appropriate integer !

    Remarks :

    • If a file contains N blocks, in totality and you’re using {N} as quantifier, it matches the zero-length match, at the very end of current file

    • Don’t use a quantifier superior to number N. And, for small files, the only valid quantifier {1} will always move to the very end of file !

    Best Regards,


    P.S. :

    I used the License.txt file to test these regexes ! ( 4 occurrences )

Log in to reply