Eliminate excess verbiage from audio transcript

Mark D Worthen PsyD

Fellow Notepad++ Users,

I want to eliminate excess verbiage from Zoom audio transcripts. For unknown reasons, Zoom tends to divide a single sentence into three or four lines, each proceeded by the name of the speaker.

So that you know I did my homework, I browsed and searched the Notepad++ User Manual and I searched this Community.

My technical know-how is limited. I will therefore not be offended if you explain it to me like I’m 5. ;-)

Here is the data I currently have (“before” data):

Dr Worthen: blah blah blah
Dr Worthen: blah blah blah
Dr Worthen: blah blah blah
Dr Worthen: blah blah blah
John Doe: blah blah blah
John Doe: blah blah blah
John Doe: blah blah blah
John Doe: blah blah blah

I would like to transform that data into the following (“after” data):

Dr Worthen: blah blah blah blah blah blah blah blah blah
John Doe: blah blah blah blah blah blah blah blah blah

Many thanks,

Mark

Mark Olson

@Mark-D-Worthen-PsyD

Fortunately, this can be done in Notepad++ without plugins, using only the find/replace form. The regular expressions required aren’t the simplest, but they’re also not incredibly hard to understand. You can read about regular expressions in Notepad++ here.

Before you start, open the Replace tab of the find/replace form (Search->Replace... from the main menu, Ctrl+H using default keybindings).

Next, make sure that the Wrap around box is checked, and select the Regular expression circle in the Search Mode field.

Finally, when I say Replace all X with Y, I mean the following:

Put X in the Find what field
Put Y in the Replace with field
Select Replace All

Suppose you start with this file:

Dr Worthen: blah1 blah2
Dr Worthen: blah3 blah4
Dr Worthen: blah5
Dr Worthen: blah6 blah7 blah8
John Doe: zjk1
John Doe: zjk2 zjk3 zjk4 zjk5
John Doe: zjk6 pouy1 zjk8
John Doe: pouy2
Bob Quzenheim: vbg
Dr Worthen: nvbm
Bob Quzenheim: jrrke
Bob Quzenheim: bnbnm

Replace all (?-s)^([^:]+)(?:.*\R\1)* with \x07${0}
- This will add the BEL character (ASCII code 7) at the beginning of the first line of each sequence of lines with the same speaker.
- I chose the BEL character because it does not naturally occur in normal text documents.
Replace all (?-s)\R[^\x07\r\n][^:\r\n]*:(.*) with ${1}
- This will find all lines that don’t start with BEL (so all lines except the speaker’s first in that chunk), and remove the newline before that line and the name of the speaker
Replace all ^\x07 with nothing (leave the Replace with box empty)
- This will remove the leading BEL characters we used as markers.

You will be left with the following:

Dr Worthen: blah1 blah2 blah3 blah4 blah5 blah6 blah7 blah8
John Doe: zjk1 zjk2 zjk3 zjk4 zjk5 zjk6 pouy1 zjk8 pouy2
Bob Quzenheim: vbg
Dr Worthen: nvbm
Bob Quzenheim: jrrke bnbnm

If you start recording a macro (Macro->Start Recording) immediately before step 1 and stop recording (Macro->Stop Recording) immediately after step 3, then save the macro (Macro->Save Current Recorded Macro), you can easily repeat this sequence of find replaces whenever you want.

Mark D Worthen PsyD

@Mark-Olson Wow! That worked like a charm. I really appreciate you explaining it all so clearly. Thank you! ~ Mark P.S. I will upvote your post as soon as I accumulate the needed 2 credits.

guy038

Hello, @mark-d-worthen-psyd, @mark-olson and All,

@mark-d-worthen-psyd, I didn’t answer before, as we were travelling back home. Holidays time is over :-(

Just for information, I found out a method which needs two regex S/R only !

Let’s take again the @mark-olson’s INPUT text :

Dr Worthen: blah1 blah2
Dr Worthen: blah3 blah4
Dr Worthen: blah5
Dr Worthen: blah6 blah7 blah8
John Doe: zjk1
John Doe: zjk2 zjk3 zjk4 zjk5
John Doe: zjk6 pouy1 zjk8
John Doe: pouy2
Bob Quzenheim: vbg
Dr Worthen: nvbm
Bob Quzenheim: jrrke
Bob Quzenheim: bnbnm

With this first S/R, we simply add a line-break after each block of lines, beginning with the same string before the colon

SEARCH (?-s)^(.+:).+\R(\1.+\R)*

REPLACE $0\r\n

Giving this temporary text :

Dr Worthen: blah1 blah2
Dr Worthen: blah3 blah4
Dr Worthen: blah5
Dr Worthen: blah6 blah7 blah8

John Doe: zjk1
John Doe: zjk2 zjk3 zjk4 zjk5
John Doe: zjk6 pouy1 zjk8
John Doe: pouy2

Bob Quzenheim: vbg

Dr Worthen: nvbm

Bob Quzenheim: jrrke
Bob Quzenheim: bnbnm

Then, with the second S/R :

It removes the line-break and the next zone till a colon, if this line-break do not begin the current line
It removes any pure empty line, as well

SEARCH (?<=.)\R.+:|^\R OR (?<!^)\R.+:|^\R

REPLACE Leave EMPTY

And we get our expected OUTPUT text :

Dr Worthen: blah1 blah2 blah3 blah4 blah5 blah6 blah7 blah8
John Doe: zjk1 zjk2 zjk3 zjk4 zjk5 zjk6 pouy1 zjk8 pouy2
Bob Quzenheim: vbg
Dr Worthen: nvbm
Bob Quzenheim: jrrke bnbnm

Best Regards,

guy038