Community
    • Login

    How to use RegEx to split 5000 characters but preserving sentense?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 4 Posters 1.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • NZ SelectN
      NZ Select
      last edited by

      I have articles inside txt files.

      I wish to split each text file by 5000 characters, but preserving full sentence by the last period.

      How to identify the 1st match, 2nd match?

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @nz-select and All,

        If you don’t mind the approximation, as my method considers the EOL characters as standard ones ( => \r and/or \n counts for 1 char ), here are two regexes to find successive blocks of 5000 chars or so, ending with a period :

        • SEARCH (?s).{1,5000}(\.\s+|\z)    will search for the largest area, ending with a period, with a size smaller than 5,000 characters

        • SEARCH (?s).{1,5000}.*?(\.\s+|\z)    will search for the smallest area, ending with a period with a size greater than 5,000 characters.

        • Select the Regular expression search mode and, tick the Wrap around option

        To know how many blocks of 5,000 chars or so, the current file contains, simply hit the Count button, in the Find dialog


        Now, in order to find out the beginning of the Nth match, use these generic regexes :

        • SEARCH (?s)(.{1,5000}(\.\s+|\z)){N-1}\K

        • SEARCH (?s)(.{1,5000}.*?(\.\s+|\z)){N-1}\K

        And, of course, change the N - 1 value with the appropriate integer !

        Remarks :

        • If a file contains N blocks, in totality and you’re using {N} as quantifier, it matches the zero-length match, at the very end of current file

        • Don’t use a quantifier superior to number N. And, for small files, the only valid quantifier {1} will always move to the very end of file !

        Best Regards,

        guy038

        P.S. :

        I used the License.txt file to test these regexes ! ( 4 occurrences )

        Brent ParkerB 1 Reply Last reply Reply Quote 2
        • Brent ParkerB
          Brent Parker @guy038
          last edited by

          @guy038 Hello, would there be a way to use this RegEx code in Notepad++ to automatically create new txt files based on the character limits?

          I edited the RegEx code above a little so that it just selected whole lines up-to a maximum limit of 9000 characters instead of only those ending with a period.

          (?s).{1,9000}(\s+|\z)\n

          I have a document that is 499,429 characters long, and when using the code above it gives me a count of 58. What would be the simplest way to automatically split/generate 58 separate text files based on the regex code within Notepad++?

          Thank you!

          PeterJonesP 1 Reply Last reply Reply Quote 1
          • PeterJonesP
            PeterJones @Brent Parker
            last edited by

            @brent-parker ,

            (?s).{1,9000}(\s+|\z)\n

            Good job adapting it to your situation. As a gotcha, if your file doesn’t end in a newline character, it won’t grab the very last line in your file. Since the original post wanted to split anywhere on a space, they used \s+… but since you want to only split on newline, I would suggest (?s).{1,9000}(\R|\z) – where \R matches \r\n or \n, and the \z is the “match end of file” that handles lines that don’t end with a newline.

            What would be the simplest way to automatically split/generate 58 separate text files based on the regex code within Notepad++?

            Notepad++ does not have a built-in way to automate splitting the file into chunks based on a regex search (or any other method).

            If you were willing to install the PythonScript plugin, one could write a script which would take the <=9000-char chunks and write them to new files

            # encoding=utf-8
            """in response to https://community.notepad-plus-plus.org/topic/19738/ , the 2022-May-10 add-on
            
            set nChars to be the maximum "chunk" size, then run this script.
            """
            from Npp import editor, notepad
            import os.path
            
            nChars = 9000
            
            regex = r'(?s).{1,' + str(nChars) + r'}(\R|\z)'
            
            counter = 0
            originalFullname = notepad.getCurrentFilename()
            originalPath, originalFile = os.path.split( originalFullname )
            originalBase, originalExt = os.path.splitext( originalFile )
            
            def withChunk(m):
                global counter
                counter += 1
                newPath = "{}\\{}_{:03d}{}".format(originalPath, originalBase, counter, originalExt)
                #console.write( "{}\tlength={}".format(newPath, len(m.group(0)))+"\n")
            
                notepad.new()
                editor.setText(m.group(0))
                notepad.saveAs(newPath)
                notepad.close()
                notepad.activateFile(originalFullname)
            
            editor.research(regex, withChunk)
            
            1. Install PythonScript plugin (if not already installed) through the Plugins Admin
            2. Plugins > PythonScript > New script, name = split-file-into-chunks-of-X-characters.py
            3. Paste the black box into the script, and save

            To run:

            1. Open the file you want to split
            2. click Plugins > Python Script > scripts > split-file-into-chunks-of-X-characters.py

            If you want to assign a keyboard shortcut, use Plugins > Python Script > Configure…, and Add the script to the left panel. Exit Notepad++ and restart. After that, you can use the Settings > Shortcut Mapper to assign the keystroke. From then on, that keystroke is equivalent to running the script from the Scripts menu.

            (I tried the script with nChars = 900, and it split the source code for that script into two chunks: one that was 901 characters including the two bytes for the final CRLF (since the regex matches 1 to N characters followed by newline sequence), and one that was 186 characters including final CRLF.)

            Brent ParkerB 2 Replies Last reply Reply Quote 2
            • PeterJonesP PeterJones referenced this topic on
            • Brent ParkerB
              Brent Parker @PeterJones
              last edited by

              @peterjones Thank you so much for the amazingly detailed and fast reply!! I just tried it and it worked beautifully! This is going to save me so much time!

              1 Reply Last reply Reply Quote 0
              • Brent ParkerB
                Brent Parker @PeterJones
                last edited by Brent Parker

                @peterjones hello, sorry to bother you again.

                I had one last question for the provided script. If I wanted to edit the script so that it appends a predefined set of text before and after the copied chunk to each of files as they are being written, how would I go about doing that?

                Each file generated would look something like this :

                <speak xmlns=“http://www.w3.org/2001/10/synthesis”><voice name=“en-GB”><prosody rate=“-05%”>

                TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000
                TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000 TEXT9000

                </prosody></voice></speak>

                I’ve been attempting to read over the PythonScript Documentation this afternoon to try and figure out how to do it, but it’s a bit of a steep learning curve for someone who has never programmed before to be able to learn in one afternoon, lol. I’d assumed I could just toss in some “editor.addText” items (like below) to place it before and after it sets the copied text:

                editor.addText(<speak xmlns="http://www.w3.org/2001/10/synthesis"><voice name="en-GB"><prosody rate="-05%">)
                editor.setText(m.group(0))
                editor.addText(</prosody></voice></speak>)
                

                Unfortunately, not as simple as I’d hoped.

                PeterJonesP 1 Reply Last reply Reply Quote 1
                • PeterJonesP
                  PeterJones @Brent Parker
                  last edited by

                  @brent-parker said in How to use RegEx to split 5000 characters but preserving sentense?:

                  Unfortunately, not as simple as I’d hoped

                  As with most programming languages, you have to put quotes around literal strings. Fortunately, Python allows using single quotes or double, so using single quotes around the strings that contain double-quotes is easiest:

                  editor.addText('<speak xmlns="http://www.w3.org/2001/10/synthesis"><voice name="en-GB"><prosody rate="-05%">')
                  editor.setText(m.group(0))
                  editor.addText('</prosody></voice></speak>')
                  

                  For someone who has never programmed before, and just tried for an afternoon, good effort. I’m always happy to see when people in the forum put in the effort to learn, and give it a go, rather than just assuming that we’ll spoon feed them everything. When you or others show that effort, I feel much better about the effort I put into things like writing the script. So thank you, and I hope you continue to learn. For reference, the PythonScript documentation will only teach you about the interface between the plugin and Notepad++ itself; the default PythonScript plugin uses Python 2.7, and there are a bazillion websites out there which will give tutorials and go into the details of the Python programming language itself.

                  Brent ParkerB 1 Reply Last reply Reply Quote 2
                  • Brent ParkerB
                    Brent Parker @PeterJones
                    last edited by

                    @peterjones amazing, thank you so much!

                    I’m definitely interested in learning the scripting language and what these scripts can do. I’ve been primarily using Notepad++ with RegEx codes to format research articles/papers so that they’re friendly enough to listen to with text-to-speech software.

                    It’s surprisingly time-consuming to edit out the unnecessary bits, but being able to listen to the papers via TTS without it constantly pausing to read every single footnote reference number and citation outloud is definitely helpful for me when I’m just trying to learn/understand the topic of the paper while out on a walk without constant interruptions. This will probably also be helpful when trying to correct/format OCR’d PDF Scanned text of older works that have yet to be digitized.

                    Hopefully, I’ll be able to learn enough to be able to share some tips/tricks with others in the community one day.

                    Brent ParkerB 1 Reply Last reply Reply Quote 0
                    • Brent ParkerB
                      Brent Parker @Brent Parker
                      last edited by

                      @brent-parker

                      For future reference, I’ll post the the full updated script for creating the documents and appending additional text to the top/bottom of each generated document here just in case any future people are interested.

                      # encoding=utf-8
                      """in response to https://community.notepad-plus-plus.org/topic/19738/ , the 2022-May-10 add-on
                      
                      set nChars to be the maximum "chunk" size, replace ENTER YOUR TEXT HERE with your own text, then run this script.
                      """
                      from Npp import editor, notepad
                      import os.path
                      
                      nChars = 9000
                      
                      regex = r'(?s).{1,' + str(nChars) + r'}(\R|\z)'
                      
                      counter = 0
                      originalFullname = notepad.getCurrentFilename()
                      originalPath, originalFile = os.path.split( originalFullname )
                      originalBase, originalExt = os.path.splitext( originalFile )
                      
                      def withChunk(m):
                          global counter
                          counter += 1
                          newPath = "{}\\{}_{:03d}{}".format(originalPath, originalBase, counter, originalExt)
                          #console.write( "{}\tlength={}".format(newPath, len(m.group(0)))+"\n")
                      
                          notepad.new()
                          editor.setText(m.group(0))
                          editor.documentStart()
                          editor.addText('ENTER YOUR TEXT HERE\n\n')
                          editor.appendText('ENTER YOUR TEXT HERE')
                          notepad.saveAs(newPath)
                          notepad.close()
                          notepad.activateFile(originalFullname)
                      
                      editor.research(regex, withChunk)
                      
                      1 Reply Last reply Reply Quote 1
                      • Brent ParkerB Brent Parker referenced this topic on
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors