Splitting a text file into multiple text files at every blank line

Davey Clarke

I have a large text file (190,000 lines) that contains data points for each farm field (its agricultural data) each field is split by a blank line. Ideally I would split this file into loads of files so they just contain the data from a single field.

So, in short i want the file to be split into a new one every time there is a blank line.

Cheers.

Jacob Currie

Hm not sure what your looking for within NPP, If I was you id grab my favorite language and loop through your data with a separate script.

Wont write anything for ya but here is some pseudo if you want to give it a try…

original = get your file
fileLineReader = new reader(original)
curIndex = 2
filename = “Farm1.txt”
file = create filename

while (get next ‘currentLine’ of fileLineReader) {
if (currentline isnt blank line) {
///append to current file
file.append(currentLine
} else {
////generate new file
filename = “Farm” + curIndex + “.txt”
file = create filename
curIndex += 1
}
}

Id use powershell or batch, should only take a little bit of time to google the correct.
Once it hits a blank line, the file its writing to changes. Farm1 - Farm2,3,4,5,6…

Scott Sumner

@Davey-Clarke

I don’t know if you are using 32-bit Notepad++ or the Pythonscript plugin, but if you’re willing to do both then the following script will do the job. When run while the desired file to be split (e.g., …\myfile.txt) is active, it will produce 2+ related files containing the post-split data (e.g., …\myfile_1.txt, …\myfile_2.txt, etc).

I named this script SplitCurrentFileByBlankLine.py:

import os
import math

def SCFBBL__main():

    pathname = notepad.getCurrentFilename()
    if pathname.lower().startswith('new '): return  # must have a real file on disk

    #line_delim_regex = r'^\h*\R'  # match truly empty lines OR lines containing only whitespace
    line_delim_regex = r'^\R'  # match truly empty lines ONLY

    match_span_tuple_list = []

    def match_found(m):
        if m.start(1) != -1:
            # delimiter starts the file, followed by non-delimiter data, plus another delimiter  (will get at most one match of this type)
            match_span_tuple_list.append(m.span(1))
        elif m.start(4) != -1:
            # mid-file data plus delimiter (most matches will be of this type)
            match_span_tuple_list.append(m.span(4))
        elif m.start(7) != -1:
            # end of file where no delimiter follows data (will get at most one match of this type)
            match_span_tuple_list.append(m.span(7))

    editor.research(r'(?s)(?:(?:{D})+(?<g1>.+?)(?<g2>(?<g3>{D})+))' \
        '|' \
        '(?:(?<g4>.+?)(?<g5>(?<g6>{D})+))' \
        '|' \
        '(?<g7>.+?\z)'.format(D=line_delim_regex), match_found)

    num_files_to_create = len(match_span_tuple_list)
    if num_files_to_create < 2: return  # no need to split anything
    if num_files_to_create > 10:  # warn user if large # of files is going to be created
        answer = notepad.messageBox('There will be {} files created.\r\n\r\nCONTINUE ?'.format(num_files_to_create), '', MESSAGEBOXFLAGS.YESNO | MESSAGEBOXFLAGS.DEFBUTTON2)
        if answer != MESSAGEBOXFLAGS.RESULTYES: return

    (path_part, file_part) = pathname.rsplit(os.sep, 1)
    file_without_dot_ext = file_part; ext_wo_dot = ''
    try:
        (file_without_dot_ext, ext_wo_dot) = file_part.rsplit('.', 1)
    except ValueError: pass
    num_digits = int(math.log(num_files_to_create, 10)) + 1
    out_file_path_str_format = '{base}_{{:0{d}}}'.format(base=file_without_dot_ext, d=num_digits)
    if len(ext_wo_dot) > 0: out_file_path_str_format += '.' + ext_wo_dot
    out_file_path_str_format = path_part + os.sep + out_file_path_str_format

    for (index, (match_start_pos, match_end_pos)) in enumerate(match_span_tuple_list):
        with open(out_file_path_str_format.format(index), 'wb') as f:
            f.write(editor.getTextRange(match_start_pos, match_end_pos))

SCFBBL__main()