How to use RegEx to split 5000 characters but preserving sentense?

NZ Select

I have articles inside txt files.

I wish to split each text file by 5000 characters, but preserving full sentence by the last period.

How to identify the 1st match, 2nd match?

guy038

Hello, @nz-select and All,

If you don’t mind the approximation, as my method considers the EOL characters as standard ones ( => \r and/or \n counts for 1 char ), here are two regexes to find successive blocks of 5000 chars or so, ending with a period :

SEARCH (?s).{1,5000}(\.\s+|\z) will search for the largest area, ending with a period, with a size smaller than 5,000 characters
SEARCH (?s).{1,5000}.*?(\.\s+|\z) will search for the smallest area, ending with a period with a size greater than 5,000 characters.
Select the Regular expression search mode and, tick the Wrap around option

To know how many blocks of 5,000 chars or so, the current file contains, simply hit the Count button, in the Find dialog

Now, in order to find out the beginning of the Nth match, use these generic regexes :

SEARCH (?s)(.{1,5000}(\.\s+|\z)){N-1}\K
SEARCH (?s)(.{1,5000}.*?(\.\s+|\z)){N-1}\K

And, of course, change the N - 1 value with the appropriate integer !

Remarks :

If a file contains N blocks, in totality and you’re using {N} as quantifier, it matches the zero-length match, at the very end of current file
Don’t use a quantifier superior to number N. And, for small files, the only valid quantifier {1} will always move to the very end of file !

Best Regards,

guy038

P.S. :

I used the License.txt file to test these regexes ! ( 4 occurrences )

Brent Parker

@guy038 Hello, would there be a way to use this RegEx code in Notepad++ to automatically create new txt files based on the character limits?

I edited the RegEx code above a little so that it just selected whole lines up-to a maximum limit of 9000 characters instead of only those ending with a period.

(?s).{1,9000}(\s+|\z)\n

I have a document that is 499,429 characters long, and when using the code above it gives me a count of 58. What would be the simplest way to automatically split/generate 58 separate text files based on the regex code within Notepad++?

Thank you!

PeterJones

@brent-parker ,

(?s).{1,9000}(\s+|\z)\n

Good job adapting it to your situation. As a gotcha, if your file doesn’t end in a newline character, it won’t grab the very last line in your file. Since the original post wanted to split anywhere on a space, they used \s+… but since you want to only split on newline, I would suggest (?s).{1,9000}(\R|\z) – where \R matches \r\n or \n, and the \z is the “match end of file” that handles lines that don’t end with a newline.

What would be the simplest way to automatically split/generate 58 separate text files based on the regex code within Notepad++?

Notepad++ does not have a built-in way to automate splitting the file into chunks based on a regex search (or any other method).

If you were willing to install the PythonScript plugin, one could write a script which would take the <=9000-char chunks and write them to new files

# encoding=utf-8
"""in response to https://community.notepad-plus-plus.org/topic/19738/ , the 2022-May-10 add-on

set nChars to be the maximum "chunk" size, then run this script.
"""
from Npp import editor, notepad
import os.path

nChars = 9000

regex = r'(?s).{1,' + str(nChars) + r'}(\R|\z)'

counter = 0
originalFullname = notepad.getCurrentFilename()
originalPath, originalFile = os.path.split( originalFullname )
originalBase, originalExt = os.path.splitext( originalFile )

def withChunk(m):
    global counter
    counter += 1
    newPath = "{}\\{}_{:03d}{}".format(originalPath, originalBase, counter, originalExt)
    #console.write( "{}\tlength={}".format(newPath, len(m.group(0)))+"\n")

    notepad.new()
    editor.setText(m.group(0))
    notepad.saveAs(newPath)
    notepad.close()
    notepad.activateFile(originalFullname)

editor.research(regex, withChunk)

Install PythonScript plugin (if not already installed) through the Plugins Admin
Plugins > PythonScript > New script, name = split-file-into-chunks-of-X-characters.py
Paste the black box into the script, and save

To run:

Open the file you want to split
click Plugins > Python Script > scripts > split-file-into-chunks-of-X-characters.py

If you want to assign a keyboard shortcut, use Plugins > Python Script > Configure…, and Add the script to the left panel. Exit Notepad++ and restart. After that, you can use the Settings > Shortcut Mapper to assign the keystroke. From then on, that keystroke is equivalent to running the script from the Scripts menu.

(I tried the script with nChars = 900, and it split the source code for that script into two chunks: one that was 901 characters including the two bytes for the final CRLF (since the regex matches 1 to N characters followed by newline sequence), and one that was 186 characters including final CRLF.)

Brent Parker

@peterjones Thank you so much for the amazingly detailed and fast reply!! I just tried it and it worked beautifully! This is going to save me so much time!

Brent Parker

@peterjones hello, sorry to bother you again.

I had one last question for the provided script. If I wanted to edit the script so that it appends a predefined set of text before and after the copied chunk to each of files as they are being written, how would I go about doing that?

Each file generated would look something like this :

<speak xmlns=“http://www.w3.org/2001/10/synthesis”><voice name=“en-GB”><prosody rate=“-05%”>

TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000
TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000

</prosody></voice></speak>

I’ve been attempting to read over the PythonScript Documentation this afternoon to try and figure out how to do it, but it’s a bit of a steep learning curve for someone who has never programmed before to be able to learn in one afternoon, lol. I’d assumed I could just toss in some “editor.addText” items (like below) to place it before and after it sets the copied text:

editor.addText(<speak xmlns="http://www.w3.org/2001/10/synthesis"><voice name="en-GB"><prosody rate="-05%">)
editor.setText(m.group(0))
editor.addText(</prosody></voice></speak>)

Unfortunately, not as simple as I’d hoped.

PeterJones

@brent-parker said in How to use RegEx to split 5000 characters but preserving sentense?:

Unfortunately, not as simple as I’d hoped

As with most programming languages, you have to put quotes around literal strings. Fortunately, Python allows using single quotes or double, so using single quotes around the strings that contain double-quotes is easiest:

editor.addText('<speak xmlns="http://www.w3.org/2001/10/synthesis"><voice name="en-GB"><prosody rate="-05%">')
editor.setText(m.group(0))
editor.addText('</prosody></voice></speak>')

For someone who has never programmed before, and just tried for an afternoon, good effort. I’m always happy to see when people in the forum put in the effort to learn, and give it a go, rather than just assuming that we’ll spoon feed them everything. When you or others show that effort, I feel much better about the effort I put into things like writing the script. So thank you, and I hope you continue to learn. For reference, the PythonScript documentation will only teach you about the interface between the plugin and Notepad++ itself; the default PythonScript plugin uses Python 2.7, and there are a bazillion websites out there which will give tutorials and go into the details of the Python programming language itself.

Brent Parker

@peterjones amazing, thank you so much!

I’m definitely interested in learning the scripting language and what these scripts can do. I’ve been primarily using Notepad++ with RegEx codes to format research articles/papers so that they’re friendly enough to listen to with text-to-speech software.

It’s surprisingly time-consuming to edit out the unnecessary bits, but being able to listen to the papers via TTS without it constantly pausing to read every single footnote reference number and citation outloud is definitely helpful for me when I’m just trying to learn/understand the topic of the paper while out on a walk without constant interruptions. This will probably also be helpful when trying to correct/format OCR’d PDF Scanned text of older works that have yet to be digitized.

Hopefully, I’ll be able to learn enough to be able to share some tips/tricks with others in the community one day.

Brent Parker

@brent-parker

For future reference, I’ll post the the full updated script for creating the documents and appending additional text to the top/bottom of each generated document here just in case any future people are interested.

# encoding=utf-8
"""in response to https://community.notepad-plus-plus.org/topic/19738/ , the 2022-May-10 add-on

set nChars to be the maximum "chunk" size, replace ENTER YOUR TEXT HERE with your own text, then run this script.
"""
from Npp import editor, notepad
import os.path

nChars = 9000

regex = r'(?s).{1,' + str(nChars) + r'}(\R|\z)'

counter = 0
originalFullname = notepad.getCurrentFilename()
originalPath, originalFile = os.path.split( originalFullname )
originalBase, originalExt = os.path.splitext( originalFile )

def withChunk(m):
    global counter
    counter += 1
    newPath = "{}\\{}_{:03d}{}".format(originalPath, originalBase, counter, originalExt)
    #console.write( "{}\tlength={}".format(newPath, len(m.group(0)))+"\n")

    notepad.new()
    editor.setText(m.group(0))
    editor.documentStart()
    editor.addText('ENTER YOUR TEXT HERE\n\n')
    editor.appendText('ENTER YOUR TEXT HERE')
    notepad.saveAs(newPath)
    notepad.close()
    notepad.activateFile(originalFullname)

editor.research(regex, withChunk)