Eliminate excess verbiage from audio transcript
-
Fellow Notepad++ Users,
I want to eliminate excess verbiage from Zoom audio transcripts. For unknown reasons, Zoom tends to divide a single sentence into three or four lines, each proceeded by the name of the speaker.
So that you know I did my homework, I browsed and searched the Notepad++ User Manual and I searched this Community.
My technical know-how is limited. I will therefore not be offended if you explain it to me like I’m 5. ;-)
Here is the data I currently have (“before” data):
Dr Worthen: blah blah blah Dr Worthen: blah blah blah Dr Worthen: blah blah blah Dr Worthen: blah blah blah John Doe: blah blah blah John Doe: blah blah blah John Doe: blah blah blah John Doe: blah blah blah
I would like to transform that data into the following (“after” data):
Dr Worthen: blah blah blah blah blah blah blah blah blah John Doe: blah blah blah blah blah blah blah blah blah
Many thanks,
Mark
-
Fortunately, this can be done in Notepad++ without plugins, using only the find/replace form. The regular expressions required aren’t the simplest, but they’re also not incredibly hard to understand. You can read about regular expressions in Notepad++ here.
Before you start, open the
Replace
tab of the find/replace form (Search->Replace...
from the main menu, Ctrl+H using default keybindings).Next, make sure that the
Wrap around
box is checked, and select theRegular expression
circle in theSearch Mode
field.Finally, when I say
Replace all X with Y
, I mean the following:- Put
X
in the Find what field - Put
Y
in the Replace with field - Select Replace All
Suppose you start with this file:
Dr Worthen: blah1 blah2 Dr Worthen: blah3 blah4 Dr Worthen: blah5 Dr Worthen: blah6 blah7 blah8 John Doe: zjk1 John Doe: zjk2 zjk3 zjk4 zjk5 John Doe: zjk6 pouy1 zjk8 John Doe: pouy2 Bob Quzenheim: vbg Dr Worthen: nvbm Bob Quzenheim: jrrke Bob Quzenheim: bnbnm
- Replace all
(?-s)^([^:]+)(?:.*\R\1)*
with\x07${0}
- This will add the
BEL
character (ASCII code 7) at the beginning of the first line of each sequence of lines with the same speaker. - I chose the
BEL
character because it does not naturally occur in normal text documents.
- This will add the
- Replace all
(?-s)\R[^\x07\r\n][^:\r\n]*:(.*)
with${1}
- This will find all lines that don’t start with
BEL
(so all lines except the speaker’s first in that chunk), and remove the newline before that line and the name of the speaker
- This will find all lines that don’t start with
- Replace all
^\x07
with nothing (leave the Replace with box empty)- This will remove the leading
BEL
characters we used as markers.
- This will remove the leading
You will be left with the following:
Dr Worthen: blah1 blah2 blah3 blah4 blah5 blah6 blah7 blah8 John Doe: zjk1 zjk2 zjk3 zjk4 zjk5 zjk6 pouy1 zjk8 pouy2 Bob Quzenheim: vbg Dr Worthen: nvbm Bob Quzenheim: jrrke bnbnm
If you start recording a macro (
Macro->Start Recording
) immediately before step 1 and stop recording (Macro->Stop Recording
) immediately after step 3, then save the macro (Macro->Save Current Recorded Macro
), you can easily repeat this sequence of find replaces whenever you want. - Put
-
@Mark-Olson Wow! That worked like a charm. I really appreciate you explaining it all so clearly. Thank you! ~ Mark P.S. I will upvote your post as soon as I accumulate the needed 2 credits.
-
Hello, @mark-d-worthen-psyd, @mark-olson and All,
@mark-d-worthen-psyd, I didn’t answer before, as we were travelling back home. Holidays time is over :-(
Just for information, I found out a method which needs two regex S/R only !
Let’s take again the @mark-olson’s INPUT text :
Dr Worthen: blah1 blah2 Dr Worthen: blah3 blah4 Dr Worthen: blah5 Dr Worthen: blah6 blah7 blah8 John Doe: zjk1 John Doe: zjk2 zjk3 zjk4 zjk5 John Doe: zjk6 pouy1 zjk8 John Doe: pouy2 Bob Quzenheim: vbg Dr Worthen: nvbm Bob Quzenheim: jrrke Bob Quzenheim: bnbnm
With this first S/R, we simply add a line-break after each block of lines, beginning with the same string before the colon
SEARCH
(?-s)^(.+:).+\R(\1.+\R)*
REPLACE
$0\r\n
Giving this temporary text :
Dr Worthen: blah1 blah2 Dr Worthen: blah3 blah4 Dr Worthen: blah5 Dr Worthen: blah6 blah7 blah8 John Doe: zjk1 John Doe: zjk2 zjk3 zjk4 zjk5 John Doe: zjk6 pouy1 zjk8 John Doe: pouy2 Bob Quzenheim: vbg Dr Worthen: nvbm Bob Quzenheim: jrrke Bob Quzenheim: bnbnm
Then, with the second S/R :
-
It removes the line-break and the next zone till a colon, if this line-break do not begin the current line
-
It removes any pure empty line, as well
SEARCH
(?<=.)\R.+:|^\R
OR(?<!^)\R.+:|^\R
REPLACE
Leave EMPTY
And we get our expected OUTPUT text :
Dr Worthen: blah1 blah2 blah3 blah4 blah5 blah6 blah7 blah8 John Doe: zjk1 zjk2 zjk3 zjk4 zjk5 zjk6 pouy1 zjk8 pouy2 Bob Quzenheim: vbg Dr Worthen: nvbm Bob Quzenheim: jrrke bnbnm
Best Regards,
guy038
-
-