Community
    • Login

    Eliminate excess verbiage from audio transcript

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    4 Posts 3 Posters 361 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Mark D Worthen PsyDM
      Mark D Worthen PsyD
      last edited by Mark D Worthen PsyD

      Fellow Notepad++ Users,

      I want to eliminate excess verbiage from Zoom audio transcripts. For unknown reasons, Zoom tends to divide a single sentence into three or four lines, each proceeded by the name of the speaker.

      So that you know I did my homework, I browsed and searched the Notepad++ User Manual and I searched this Community.

      My technical know-how is limited. I will therefore not be offended if you explain it to me like I’m 5. ;-)

      Here is the data I currently have (“before” data):

      Dr Worthen: blah blah blah
      Dr Worthen: blah blah blah
      Dr Worthen: blah blah blah
      Dr Worthen: blah blah blah
      John Doe: blah blah blah
      John Doe: blah blah blah
      John Doe: blah blah blah
      John Doe: blah blah blah
      

      I would like to transform that data into the following (“after” data):

      Dr Worthen: blah blah blah blah blah blah blah blah blah
      John Doe: blah blah blah blah blah blah blah blah blah
      

      Many thanks,

      Mark

      1 Reply Last reply Reply Quote 1
      • Mark OlsonM
        Mark Olson
        last edited by Mark Olson

        @Mark-D-Worthen-PsyD

        Fortunately, this can be done in Notepad++ without plugins, using only the find/replace form. The regular expressions required aren’t the simplest, but they’re also not incredibly hard to understand. You can read about regular expressions in Notepad++ here.

        Before you start, open the Replace tab of the find/replace form (Search->Replace... from the main menu, Ctrl+H using default keybindings).

        Next, make sure that the Wrap around box is checked, and select the Regular expression circle in the Search Mode field.

        Finally, when I say Replace all X with Y, I mean the following:

        1. Put X in the Find what field
        2. Put Y in the Replace with field
        3. Select Replace All

        Suppose you start with this file:

        Dr Worthen: blah1 blah2
        Dr Worthen: blah3 blah4
        Dr Worthen: blah5
        Dr Worthen: blah6 blah7 blah8
        John Doe: zjk1
        John Doe: zjk2 zjk3 zjk4 zjk5
        John Doe: zjk6 pouy1 zjk8
        John Doe: pouy2
        Bob Quzenheim: vbg
        Dr Worthen: nvbm
        Bob Quzenheim: jrrke
        Bob Quzenheim: bnbnm
        
        1. Replace all (?-s)^([^:]+)(?:.*\R\1)* with \x07${0}
          • This will add the BEL character (ASCII code 7) at the beginning of the first line of each sequence of lines with the same speaker.
          • I chose the BEL character because it does not naturally occur in normal text documents.
        2. Replace all (?-s)\R[^\x07\r\n][^:\r\n]*:(.*) with ${1}
          • This will find all lines that don’t start with BEL (so all lines except the speaker’s first in that chunk), and remove the newline before that line and the name of the speaker
        3. Replace all ^\x07 with nothing (leave the Replace with box empty)
          • This will remove the leading BEL characters we used as markers.

        You will be left with the following:

        Dr Worthen: blah1 blah2 blah3 blah4 blah5 blah6 blah7 blah8
        John Doe: zjk1 zjk2 zjk3 zjk4 zjk5 zjk6 pouy1 zjk8 pouy2
        Bob Quzenheim: vbg
        Dr Worthen: nvbm
        Bob Quzenheim: jrrke bnbnm
        

        If you start recording a macro (Macro->Start Recording) immediately before step 1 and stop recording (Macro->Stop Recording) immediately after step 3, then save the macro (Macro->Save Current Recorded Macro), you can easily repeat this sequence of find replaces whenever you want.

        Mark D Worthen PsyDM 1 Reply Last reply Reply Quote 6
        • Mark D Worthen PsyDM
          Mark D Worthen PsyD @Mark Olson
          last edited by Mark D Worthen PsyD

          @Mark-Olson Wow! That worked like a charm. I really appreciate you explaining it all so clearly. Thank you! ~ Mark P.S. I will upvote your post as soon as I accumulate the needed 2 credits.

          1 Reply Last reply Reply Quote 5
          • guy038G
            guy038
            last edited by

            Hello, @mark-d-worthen-psyd, @mark-olson and All,

            @mark-d-worthen-psyd, I didn’t answer before, as we were travelling back home. Holidays time is over :-(


            Just for information, I found out a method which needs two regex S/R only !

            Let’s take again the @mark-olson’s INPUT text :

            Dr Worthen: blah1 blah2
            Dr Worthen: blah3 blah4
            Dr Worthen: blah5
            Dr Worthen: blah6 blah7 blah8
            John Doe: zjk1
            John Doe: zjk2 zjk3 zjk4 zjk5
            John Doe: zjk6 pouy1 zjk8
            John Doe: pouy2
            Bob Quzenheim: vbg
            Dr Worthen: nvbm
            Bob Quzenheim: jrrke
            Bob Quzenheim: bnbnm
            

            With this first S/R, we simply add a line-break after each block of lines, beginning with the same string before the colon

            SEARCH (?-s)^(.+:).+\R(\1.+\R)*

            REPLACE $0\r\n

            Giving this temporary text :

            Dr Worthen: blah1 blah2
            Dr Worthen: blah3 blah4
            Dr Worthen: blah5
            Dr Worthen: blah6 blah7 blah8
            
            John Doe: zjk1
            John Doe: zjk2 zjk3 zjk4 zjk5
            John Doe: zjk6 pouy1 zjk8
            John Doe: pouy2
            
            Bob Quzenheim: vbg
            
            Dr Worthen: nvbm
            
            Bob Quzenheim: jrrke
            Bob Quzenheim: bnbnm
            

            Then, with the second S/R :

            • It removes the line-break and the next zone till a colon, if this line-break do not begin the current line

            • It removes any pure empty line, as well

            SEARCH (?<=.)\R.+:|^\R    OR    (?<!^)\R.+:|^\R

            REPLACE Leave EMPTY

            And we get our expected OUTPUT text :

            Dr Worthen: blah1 blah2 blah3 blah4 blah5 blah6 blah7 blah8
            John Doe: zjk1 zjk2 zjk3 zjk4 zjk5 zjk6 pouy1 zjk8 pouy2
            Bob Quzenheim: vbg
            Dr Worthen: nvbm
            Bob Quzenheim: jrrke bnbnm
            

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 1
            • Alan KilbornA Alan Kilborn referenced this topic on
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors