Splitting a text file into multiple text files at every blank line
-
I have a large text file (190,000 lines) that contains data points for each farm field (its agricultural data) each field is split by a blank line. Ideally I would split this file into loads of files so they just contain the data from a single field.
So, in short i want the file to be split into a new one every time there is a blank line.
Cheers.
-
Hm not sure what your looking for within NPP, If I was you id grab my favorite language and loop through your data with a separate script.
Wont write anything for ya but here is some pseudo if you want to give it a try…
original = get your file
fileLineReader = new reader(original)
curIndex = 2
filename = “Farm1.txt”
file = create filenamewhile (get next ‘currentLine’ of fileLineReader) {
if (currentline isnt blank line) {
///append to current file
file.append(currentLine
} else {
////generate new file
filename = “Farm” + curIndex + “.txt”
file = create filename
curIndex += 1
}
}Id use powershell or batch, should only take a little bit of time to google the correct.
Once it hits a blank line, the file its writing to changes. Farm1 - Farm2,3,4,5,6… -
I don’t know if you are using 32-bit Notepad++ or the Pythonscript plugin, but if you’re willing to do both then the following script will do the job. When run while the desired file to be split (e.g., …\myfile.txt) is active, it will produce 2+ related files containing the post-split data (e.g., …\myfile_1.txt, …\myfile_2.txt, etc).
I named this script
SplitCurrentFileByBlankLine.py
:import os import math def SCFBBL__main(): pathname = notepad.getCurrentFilename() if pathname.lower().startswith('new '): return # must have a real file on disk #line_delim_regex = r'^\h*\R' # match truly empty lines OR lines containing only whitespace line_delim_regex = r'^\R' # match truly empty lines ONLY match_span_tuple_list = [] def match_found(m): if m.start(1) != -1: # delimiter starts the file, followed by non-delimiter data, plus another delimiter (will get at most one match of this type) match_span_tuple_list.append(m.span(1)) elif m.start(4) != -1: # mid-file data plus delimiter (most matches will be of this type) match_span_tuple_list.append(m.span(4)) elif m.start(7) != -1: # end of file where no delimiter follows data (will get at most one match of this type) match_span_tuple_list.append(m.span(7)) editor.research(r'(?s)(?:(?:{D})+(?<g1>.+?)(?<g2>(?<g3>{D})+))' \ '|' \ '(?:(?<g4>.+?)(?<g5>(?<g6>{D})+))' \ '|' \ '(?<g7>.+?\z)'.format(D=line_delim_regex), match_found) num_files_to_create = len(match_span_tuple_list) if num_files_to_create < 2: return # no need to split anything if num_files_to_create > 10: # warn user if large # of files is going to be created answer = notepad.messageBox('There will be {} files created.\r\n\r\nCONTINUE ?'.format(num_files_to_create), '', MESSAGEBOXFLAGS.YESNO | MESSAGEBOXFLAGS.DEFBUTTON2) if answer != MESSAGEBOXFLAGS.RESULTYES: return (path_part, file_part) = pathname.rsplit(os.sep, 1) file_without_dot_ext = file_part; ext_wo_dot = '' try: (file_without_dot_ext, ext_wo_dot) = file_part.rsplit('.', 1) except ValueError: pass num_digits = int(math.log(num_files_to_create, 10)) + 1 out_file_path_str_format = '{base}_{{:0{d}}}'.format(base=file_without_dot_ext, d=num_digits) if len(ext_wo_dot) > 0: out_file_path_str_format += '.' + ext_wo_dot out_file_path_str_format = path_part + os.sep + out_file_path_str_format for (index, (match_start_pos, match_end_pos)) in enumerate(match_span_tuple_list): with open(out_file_path_str_format.format(index), 'wb') as f: f.write(editor.getTextRange(match_start_pos, match_end_pos)) SCFBBL__main()