Splitting a text file into multiple text files at every blank line



  • I have a large text file (190,000 lines) that contains data points for each farm field (its agricultural data) each field is split by a blank line. Ideally I would split this file into loads of files so they just contain the data from a single field.

    So, in short i want the file to be split into a new one every time there is a blank line.

    Cheers.



  • Hm not sure what your looking for within NPP, If I was you id grab my favorite language and loop through your data with a separate script.

    Wont write anything for ya but here is some pseudo if you want to give it a try…

    original = get your file
    fileLineReader = new reader(original)
    curIndex = 2
    filename = “Farm1.txt”
    file = create filename

    while (get next ‘currentLine’ of fileLineReader) {
    if (currentline isnt blank line) {
    ///append to current file
    file.append(currentLine
    } else {
    ////generate new file
    filename = “Farm” + curIndex + “.txt”
    file = create filename
    curIndex += 1
    }
    }

    Id use powershell or batch, should only take a little bit of time to google the correct.
    Once it hits a blank line, the file its writing to changes. Farm1 - Farm2,3,4,5,6…



  • @Davey-Clarke

    I don’t know if you are using 32-bit Notepad++ or the Pythonscript plugin, but if you’re willing to do both then the following script will do the job. When run while the desired file to be split (e.g., …\myfile.txt) is active, it will produce 2+ related files containing the post-split data (e.g., …\myfile_1.txt, …\myfile_2.txt, etc).

    I named this script SplitCurrentFileByBlankLine.py:

    import os
    import math
    
    def SCFBBL__main():
    
        pathname = notepad.getCurrentFilename()
        if pathname.lower().startswith('new '): return  # must have a real file on disk
    
        #line_delim_regex = r'^\h*\R'  # match truly empty lines OR lines containing only whitespace
        line_delim_regex = r'^\R'  # match truly empty lines ONLY
    
        match_span_tuple_list = []
    
        def match_found(m):
            if m.start(1) != -1:
                # delimiter starts the file, followed by non-delimiter data, plus another delimiter  (will get at most one match of this type)
                match_span_tuple_list.append(m.span(1))
            elif m.start(4) != -1:
                # mid-file data plus delimiter (most matches will be of this type)
                match_span_tuple_list.append(m.span(4))
            elif m.start(7) != -1:
                # end of file where no delimiter follows data (will get at most one match of this type)
                match_span_tuple_list.append(m.span(7))
    
        editor.research(r'(?s)(?:(?:{D})+(?<g1>.+?)(?<g2>(?<g3>{D})+))' \
            '|' \
            '(?:(?<g4>.+?)(?<g5>(?<g6>{D})+))' \
            '|' \
            '(?<g7>.+?\z)'.format(D=line_delim_regex), match_found)
    
        num_files_to_create = len(match_span_tuple_list)
        if num_files_to_create < 2: return  # no need to split anything
        if num_files_to_create > 10:  # warn user if large # of files is going to be created
            answer = notepad.messageBox('There will be {} files created.\r\n\r\nCONTINUE ?'.format(num_files_to_create), '', MESSAGEBOXFLAGS.YESNO | MESSAGEBOXFLAGS.DEFBUTTON2)
            if answer != MESSAGEBOXFLAGS.RESULTYES: return
    
        (path_part, file_part) = pathname.rsplit(os.sep, 1)
        file_without_dot_ext = file_part; ext_wo_dot = ''
        try:
            (file_without_dot_ext, ext_wo_dot) = file_part.rsplit('.', 1)
        except ValueError: pass
        num_digits = int(math.log(num_files_to_create, 10)) + 1
        out_file_path_str_format = '{base}_{{:0{d}}}'.format(base=file_without_dot_ext, d=num_digits)
        if len(ext_wo_dot) > 0: out_file_path_str_format += '.' + ext_wo_dot
        out_file_path_str_format = path_part + os.sep + out_file_path_str_format
    
        for (index, (match_start_pos, match_end_pos)) in enumerate(match_span_tuple_list):
            with open(out_file_path_str_format.format(index), 'wb') as f:
                f.write(editor.getTextRange(match_start_pos, match_end_pos))
    
    SCFBBL__main()
    

Log in to reply