Community
    • Login

    Splitting a text file into multiple text files at every blank line

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    3 Posts 3 Posters 10.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Davey ClarkeD
      Davey Clarke
      last edited by

      I have a large text file (190,000 lines) that contains data points for each farm field (its agricultural data) each field is split by a blank line. Ideally I would split this file into loads of files so they just contain the data from a single field.

      So, in short i want the file to be split into a new one every time there is a blank line.

      Cheers.

      Scott SumnerS 1 Reply Last reply Reply Quote 0
      • Jacob CurrieJ
        Jacob Currie
        last edited by

        Hm not sure what your looking for within NPP, If I was you id grab my favorite language and loop through your data with a separate script.

        Wont write anything for ya but here is some pseudo if you want to give it a try…

        original = get your file
        fileLineReader = new reader(original)
        curIndex = 2
        filename = “Farm1.txt”
        file = create filename

        while (get next ‘currentLine’ of fileLineReader) {
        if (currentline isnt blank line) {
        ///append to current file
        file.append(currentLine
        } else {
        ////generate new file
        filename = “Farm” + curIndex + “.txt”
        file = create filename
        curIndex += 1
        }
        }

        Id use powershell or batch, should only take a little bit of time to google the correct.
        Once it hits a blank line, the file its writing to changes. Farm1 - Farm2,3,4,5,6…

        1 Reply Last reply Reply Quote 0
        • Scott SumnerS
          Scott Sumner @Davey Clarke
          last edited by

          @Davey-Clarke

          I don’t know if you are using 32-bit Notepad++ or the Pythonscript plugin, but if you’re willing to do both then the following script will do the job. When run while the desired file to be split (e.g., …\myfile.txt) is active, it will produce 2+ related files containing the post-split data (e.g., …\myfile_1.txt, …\myfile_2.txt, etc).

          I named this script SplitCurrentFileByBlankLine.py:

          import os
          import math
          
          def SCFBBL__main():
          
              pathname = notepad.getCurrentFilename()
              if pathname.lower().startswith('new '): return  # must have a real file on disk
          
              #line_delim_regex = r'^\h*\R'  # match truly empty lines OR lines containing only whitespace
              line_delim_regex = r'^\R'  # match truly empty lines ONLY
          
              match_span_tuple_list = []
          
              def match_found(m):
                  if m.start(1) != -1:
                      # delimiter starts the file, followed by non-delimiter data, plus another delimiter  (will get at most one match of this type)
                      match_span_tuple_list.append(m.span(1))
                  elif m.start(4) != -1:
                      # mid-file data plus delimiter (most matches will be of this type)
                      match_span_tuple_list.append(m.span(4))
                  elif m.start(7) != -1:
                      # end of file where no delimiter follows data (will get at most one match of this type)
                      match_span_tuple_list.append(m.span(7))
          
              editor.research(r'(?s)(?:(?:{D})+(?<g1>.+?)(?<g2>(?<g3>{D})+))' \
                  '|' \
                  '(?:(?<g4>.+?)(?<g5>(?<g6>{D})+))' \
                  '|' \
                  '(?<g7>.+?\z)'.format(D=line_delim_regex), match_found)
          
              num_files_to_create = len(match_span_tuple_list)
              if num_files_to_create < 2: return  # no need to split anything
              if num_files_to_create > 10:  # warn user if large # of files is going to be created
                  answer = notepad.messageBox('There will be {} files created.\r\n\r\nCONTINUE ?'.format(num_files_to_create), '', MESSAGEBOXFLAGS.YESNO | MESSAGEBOXFLAGS.DEFBUTTON2)
                  if answer != MESSAGEBOXFLAGS.RESULTYES: return
          
              (path_part, file_part) = pathname.rsplit(os.sep, 1)
              file_without_dot_ext = file_part; ext_wo_dot = ''
              try:
                  (file_without_dot_ext, ext_wo_dot) = file_part.rsplit('.', 1)
              except ValueError: pass
              num_digits = int(math.log(num_files_to_create, 10)) + 1
              out_file_path_str_format = '{base}_{{:0{d}}}'.format(base=file_without_dot_ext, d=num_digits)
              if len(ext_wo_dot) > 0: out_file_path_str_format += '.' + ext_wo_dot
              out_file_path_str_format = path_part + os.sep + out_file_path_str_format
          
              for (index, (match_start_pos, match_end_pos)) in enumerate(match_span_tuple_list):
                  with open(out_file_path_str_format.format(index), 'wb') as f:
                      f.write(editor.getTextRange(match_start_pos, match_end_pos))
          
          SCFBBL__main()
          
          1 Reply Last reply Reply Quote 1
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors