Eliminate excess verbiage from audio transcript
-
Fellow Notepad++ Users,
I want to eliminate excess verbiage from Zoom audio transcripts. For unknown reasons, Zoom tends to divide a single sentence into three or four lines, each proceeded by the name of the speaker.
So that you know I did my homework, I browsed and searched the Notepad++ User Manual and I searched this Community.
My technical know-how is limited. I will therefore not be offended if you explain it to me like I’m 5. ;-)
Here is the data I currently have (“before” data):
Dr Worthen: blah blah blah Dr Worthen: blah blah blah Dr Worthen: blah blah blah Dr Worthen: blah blah blah John Doe: blah blah blah John Doe: blah blah blah John Doe: blah blah blah John Doe: blah blah blahI would like to transform that data into the following (“after” data):
Dr Worthen: blah blah blah blah blah blah blah blah blah John Doe: blah blah blah blah blah blah blah blah blahMany thanks,
Mark
-
Fortunately, this can be done in Notepad++ without plugins, using only the find/replace form. The regular expressions required aren’t the simplest, but they’re also not incredibly hard to understand. You can read about regular expressions in Notepad++ here.
Before you start, open the
Replacetab of the find/replace form (Search->Replace...from the main menu, Ctrl+H using default keybindings).Next, make sure that the
Wrap aroundbox is checked, and select theRegular expressioncircle in theSearch Modefield.Finally, when I say
Replace all X with Y, I mean the following:- Put
Xin the Find what field - Put
Yin the Replace with field - Select Replace All
Suppose you start with this file:
Dr Worthen: blah1 blah2 Dr Worthen: blah3 blah4 Dr Worthen: blah5 Dr Worthen: blah6 blah7 blah8 John Doe: zjk1 John Doe: zjk2 zjk3 zjk4 zjk5 John Doe: zjk6 pouy1 zjk8 John Doe: pouy2 Bob Quzenheim: vbg Dr Worthen: nvbm Bob Quzenheim: jrrke Bob Quzenheim: bnbnm- Replace all
(?-s)^([^:]+)(?:.*\R\1)*with\x07${0}- This will add the
BELcharacter (ASCII code 7) at the beginning of the first line of each sequence of lines with the same speaker. - I chose the
BELcharacter because it does not naturally occur in normal text documents.
- This will add the
- Replace all
(?-s)\R[^\x07\r\n][^:\r\n]*:(.*)with${1}- This will find all lines that don’t start with
BEL(so all lines except the speaker’s first in that chunk), and remove the newline before that line and the name of the speaker
- This will find all lines that don’t start with
- Replace all
^\x07with nothing (leave the Replace with box empty)- This will remove the leading
BELcharacters we used as markers.
- This will remove the leading
You will be left with the following:
Dr Worthen: blah1 blah2 blah3 blah4 blah5 blah6 blah7 blah8 John Doe: zjk1 zjk2 zjk3 zjk4 zjk5 zjk6 pouy1 zjk8 pouy2 Bob Quzenheim: vbg Dr Worthen: nvbm Bob Quzenheim: jrrke bnbnmIf you start recording a macro (
Macro->Start Recording) immediately before step 1 and stop recording (Macro->Stop Recording) immediately after step 3, then save the macro (Macro->Save Current Recorded Macro), you can easily repeat this sequence of find replaces whenever you want. - Put
-
@Mark-Olson Wow! That worked like a charm. I really appreciate you explaining it all so clearly. Thank you! ~ Mark P.S. I will upvote your post as soon as I accumulate the needed 2 credits.
-
Hello, @mark-d-worthen-psyd, @mark-olson and All,
@mark-d-worthen-psyd, I didn’t answer before, as we were travelling back home. Holidays time is over :-(
Just for information, I found out a method which needs two regex S/R only !
Let’s take again the @mark-olson’s INPUT text :
Dr Worthen: blah1 blah2 Dr Worthen: blah3 blah4 Dr Worthen: blah5 Dr Worthen: blah6 blah7 blah8 John Doe: zjk1 John Doe: zjk2 zjk3 zjk4 zjk5 John Doe: zjk6 pouy1 zjk8 John Doe: pouy2 Bob Quzenheim: vbg Dr Worthen: nvbm Bob Quzenheim: jrrke Bob Quzenheim: bnbnmWith this first S/R, we simply add a line-break after each block of lines, beginning with the same string before the colon
SEARCH
(?-s)^(.+:).+\R(\1.+\R)*REPLACE
$0\r\nGiving this temporary text :
Dr Worthen: blah1 blah2 Dr Worthen: blah3 blah4 Dr Worthen: blah5 Dr Worthen: blah6 blah7 blah8 John Doe: zjk1 John Doe: zjk2 zjk3 zjk4 zjk5 John Doe: zjk6 pouy1 zjk8 John Doe: pouy2 Bob Quzenheim: vbg Dr Worthen: nvbm Bob Quzenheim: jrrke Bob Quzenheim: bnbnmThen, with the second S/R :
-
It removes the line-break and the next zone till a colon, if this line-break do not begin the current line
-
It removes any pure empty line, as well
SEARCH
(?<=.)\R.+:|^\ROR(?<!^)\R.+:|^\RREPLACE
Leave EMPTYAnd we get our expected OUTPUT text :
Dr Worthen: blah1 blah2 blah3 blah4 blah5 blah6 blah7 blah8 John Doe: zjk1 zjk2 zjk3 zjk4 zjk5 zjk6 pouy1 zjk8 pouy2 Bob Quzenheim: vbg Dr Worthen: nvbm Bob Quzenheim: jrrke bnbnmBest Regards,
guy038
-
-
A Alan Kilborn referenced this topic on