• Login
Community
  • Login

How to use RegEx to split 5000 characters but preserving sentense?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
9 Posts 4 Posters 1.2k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • N
    NZ Select
    last edited by Jul 20, 2020, 8:28 AM

    I have articles inside txt files.

    I wish to split each text file by 5000 characters, but preserving full sentence by the last period.

    How to identify the 1st match, 2nd match?

    1 Reply Last reply Reply Quote 0
    • G
      guy038
      last edited by guy038 Jul 20, 2020, 11:19 AM Jul 20, 2020, 10:41 AM

      Hello, @nz-select and All,

      If you don’t mind the approximation, as my method considers the EOL characters as standard ones ( => \r and/or \n counts for 1 char ), here are two regexes to find successive blocks of 5000 chars or so, ending with a period :

      • SEARCH (?s).{1,5000}(\.\s+|\z)    will search for the largest area, ending with a period, with a size smaller than 5,000 characters

      • SEARCH (?s).{1,5000}.*?(\.\s+|\z)    will search for the smallest area, ending with a period with a size greater than 5,000 characters.

      • Select the Regular expression search mode and, tick the Wrap around option

      To know how many blocks of 5,000 chars or so, the current file contains, simply hit the Count button, in the Find dialog


      Now, in order to find out the beginning of the Nth match, use these generic regexes :

      • SEARCH (?s)(.{1,5000}(\.\s+|\z)){N-1}\K

      • SEARCH (?s)(.{1,5000}.*?(\.\s+|\z)){N-1}\K

      And, of course, change the N - 1 value with the appropriate integer !

      Remarks :

      • If a file contains N blocks, in totality and you’re using {N} as quantifier, it matches the zero-length match, at the very end of current file

      • Don’t use a quantifier superior to number N. And, for small files, the only valid quantifier {1} will always move to the very end of file !

      Best Regards,

      guy038

      P.S. :

      I used the License.txt file to test these regexes ! ( 4 occurrences )

      B 1 Reply Last reply May 10, 2022, 6:07 PM Reply Quote 2
      • B
        Brent Parker @guy038
        last edited by May 10, 2022, 6:07 PM

        @guy038 Hello, would there be a way to use this RegEx code in Notepad++ to automatically create new txt files based on the character limits?

        I edited the RegEx code above a little so that it just selected whole lines up-to a maximum limit of 9000 characters instead of only those ending with a period.

        (?s).{1,9000}(\s+|\z)\n

        I have a document that is 499,429 characters long, and when using the code above it gives me a count of 58. What would be the simplest way to automatically split/generate 58 separate text files based on the regex code within Notepad++?

        Thank you!

        P 1 Reply Last reply May 10, 2022, 6:53 PM Reply Quote 1
        • P
          PeterJones @Brent Parker
          last edited by May 10, 2022, 6:53 PM

          @brent-parker ,

          (?s).{1,9000}(\s+|\z)\n

          Good job adapting it to your situation. As a gotcha, if your file doesn’t end in a newline character, it won’t grab the very last line in your file. Since the original post wanted to split anywhere on a space, they used \s+… but since you want to only split on newline, I would suggest (?s).{1,9000}(\R|\z) – where \R matches \r\n or \n, and the \z is the “match end of file” that handles lines that don’t end with a newline.

          What would be the simplest way to automatically split/generate 58 separate text files based on the regex code within Notepad++?

          Notepad++ does not have a built-in way to automate splitting the file into chunks based on a regex search (or any other method).

          If you were willing to install the PythonScript plugin, one could write a script which would take the <=9000-char chunks and write them to new files

          # encoding=utf-8
          """in response to https://community.notepad-plus-plus.org/topic/19738/ , the 2022-May-10 add-on
          
          set nChars to be the maximum "chunk" size, then run this script.
          """
          from Npp import editor, notepad
          import os.path
          
          nChars = 9000
          
          regex = r'(?s).{1,' + str(nChars) + r'}(\R|\z)'
          
          counter = 0
          originalFullname = notepad.getCurrentFilename()
          originalPath, originalFile = os.path.split( originalFullname )
          originalBase, originalExt = os.path.splitext( originalFile )
          
          def withChunk(m):
              global counter
              counter += 1
              newPath = "{}\\{}_{:03d}{}".format(originalPath, originalBase, counter, originalExt)
              #console.write( "{}\tlength={}".format(newPath, len(m.group(0)))+"\n")
          
              notepad.new()
              editor.setText(m.group(0))
              notepad.saveAs(newPath)
              notepad.close()
              notepad.activateFile(originalFullname)
          
          editor.research(regex, withChunk)
          
          1. Install PythonScript plugin (if not already installed) through the Plugins Admin
          2. Plugins > PythonScript > New script, name = split-file-into-chunks-of-X-characters.py
          3. Paste the black box into the script, and save

          To run:

          1. Open the file you want to split
          2. click Plugins > Python Script > scripts > split-file-into-chunks-of-X-characters.py

          If you want to assign a keyboard shortcut, use Plugins > Python Script > Configure…, and Add the script to the left panel. Exit Notepad++ and restart. After that, you can use the Settings > Shortcut Mapper to assign the keystroke. From then on, that keystroke is equivalent to running the script from the Scripts menu.

          (I tried the script with nChars = 900, and it split the source code for that script into two chunks: one that was 901 characters including the two bytes for the final CRLF (since the regex matches 1 to N characters followed by newline sequence), and one that was 186 characters including final CRLF.)

          B 2 Replies Last reply May 10, 2022, 6:58 PM Reply Quote 2
          • P PeterJones referenced this topic on May 10, 2022, 6:53 PM
          • B
            Brent Parker @PeterJones
            last edited by May 10, 2022, 6:58 PM

            @peterjones Thank you so much for the amazingly detailed and fast reply!! I just tried it and it worked beautifully! This is going to save me so much time!

            1 Reply Last reply Reply Quote 0
            • B
              Brent Parker @PeterJones
              last edited by Brent Parker May 10, 2022, 10:55 PM May 10, 2022, 10:54 PM

              @peterjones hello, sorry to bother you again.

              I had one last question for the provided script. If I wanted to edit the script so that it appends a predefined set of text before and after the copied chunk to each of files as they are being written, how would I go about doing that?

              Each file generated would look something like this :

              <speak xmlns=“http://www.w3.org/2001/10/synthesis”><voice name=“en-GB”><prosody rate=“-05%”>

              TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000
              TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000

              </prosody></voice></speak>

              I’ve been attempting to read over the PythonScript Documentation this afternoon to try and figure out how to do it, but it’s a bit of a steep learning curve for someone who has never programmed before to be able to learn in one afternoon, lol. I’d assumed I could just toss in some “editor.addText” items (like below) to place it before and after it sets the copied text:

              editor.addText(<speak xmlns="http://www.w3.org/2001/10/synthesis"><voice name="en-GB"><prosody rate="-05%">)
              editor.setText(m.group(0))
              editor.addText(</prosody></voice></speak>)
              

              Unfortunately, not as simple as I’d hoped.

              P 1 Reply Last reply May 10, 2022, 11:51 PM Reply Quote 1
              • P
                PeterJones @Brent Parker
                last edited by May 10, 2022, 11:51 PM

                @brent-parker said in How to use RegEx to split 5000 characters but preserving sentense?:

                Unfortunately, not as simple as I’d hoped

                As with most programming languages, you have to put quotes around literal strings. Fortunately, Python allows using single quotes or double, so using single quotes around the strings that contain double-quotes is easiest:

                editor.addText('<speak xmlns="http://www.w3.org/2001/10/synthesis"><voice name="en-GB"><prosody rate="-05%">')
                editor.setText(m.group(0))
                editor.addText('</prosody></voice></speak>')
                

                For someone who has never programmed before, and just tried for an afternoon, good effort. I’m always happy to see when people in the forum put in the effort to learn, and give it a go, rather than just assuming that we’ll spoon feed them everything. When you or others show that effort, I feel much better about the effort I put into things like writing the script. So thank you, and I hope you continue to learn. For reference, the PythonScript documentation will only teach you about the interface between the plugin and Notepad++ itself; the default PythonScript plugin uses Python 2.7, and there are a bazillion websites out there which will give tutorials and go into the details of the Python programming language itself.

                B 1 Reply Last reply May 11, 2022, 12:16 AM Reply Quote 2
                • B
                  Brent Parker @PeterJones
                  last edited by May 11, 2022, 12:16 AM

                  @peterjones amazing, thank you so much!

                  I’m definitely interested in learning the scripting language and what these scripts can do. I’ve been primarily using Notepad++ with RegEx codes to format research articles/papers so that they’re friendly enough to listen to with text-to-speech software.

                  It’s surprisingly time-consuming to edit out the unnecessary bits, but being able to listen to the papers via TTS without it constantly pausing to read every single footnote reference number and citation outloud is definitely helpful for me when I’m just trying to learn/understand the topic of the paper while out on a walk without constant interruptions. This will probably also be helpful when trying to correct/format OCR’d PDF Scanned text of older works that have yet to be digitized.

                  Hopefully, I’ll be able to learn enough to be able to share some tips/tricks with others in the community one day.

                  B 1 Reply Last reply May 11, 2022, 1:10 AM Reply Quote 0
                  • B
                    Brent Parker @Brent Parker
                    last edited by May 11, 2022, 1:10 AM

                    @brent-parker

                    For future reference, I’ll post the the full updated script for creating the documents and appending additional text to the top/bottom of each generated document here just in case any future people are interested.

                    # encoding=utf-8
                    """in response to https://community.notepad-plus-plus.org/topic/19738/ , the 2022-May-10 add-on
                    
                    set nChars to be the maximum "chunk" size, replace ENTER YOUR TEXT HERE with your own text, then run this script.
                    """
                    from Npp import editor, notepad
                    import os.path
                    
                    nChars = 9000
                    
                    regex = r'(?s).{1,' + str(nChars) + r'}(\R|\z)'
                    
                    counter = 0
                    originalFullname = notepad.getCurrentFilename()
                    originalPath, originalFile = os.path.split( originalFullname )
                    originalBase, originalExt = os.path.splitext( originalFile )
                    
                    def withChunk(m):
                        global counter
                        counter += 1
                        newPath = "{}\\{}_{:03d}{}".format(originalPath, originalBase, counter, originalExt)
                        #console.write( "{}\tlength={}".format(newPath, len(m.group(0)))+"\n")
                    
                        notepad.new()
                        editor.setText(m.group(0))
                        editor.documentStart()
                        editor.addText('ENTER YOUR TEXT HERE\n\n')
                        editor.appendText('ENTER YOUR TEXT HERE')
                        notepad.saveAs(newPath)
                        notepad.close()
                        notepad.activateFile(originalFullname)
                    
                    editor.research(regex, withChunk)
                    
                    1 Reply Last reply Reply Quote 1
                    • B Brent Parker referenced this topic on May 11, 2022, 1:10 AM
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors