Fix: VTT to TXT (text) from youtube-dl subtitles captions - duplicate lines problem
-
Fix: VTT to TXT (text) from youtube-dl subtitles captions - duplicate lines problem
You can macro this solution. Problem: If you run youtube-dl for subtitles you’ll get a vtt file format if no other format is available. As many know this VTT (webVTT) file from Youtube comes with duplicate lines, surrounded by an html code brackets. I searched many fixes, but most involved a python script and didn’t allow for batching, not easily at least. This solution will allow you to open file, run macro and read and save.
-
open vtt extension file, select all, find and replace all with regex “<.?>|</.?>”, replace with blank.
-
Then Edit > Line Operations > Remove Empty Line ( containing char.)
-
Still selected all, Edit > Line Operations > Sort Line as Decimals (Ascending)
-
Finally Edit > Line Operations> Remove Consecutive Duplicate Lines
Note: May leave excluded lines, that start with dates. But these will be listed at the end of file.
Fix for review. -
-
Thanks for sharing that.
I have a feeling your regex got munged by the forum, because
<.?>|</.?>
only looks for 0-or-1-character between the angle-brackets; I assume the.
were really.*?
For future reference, in this forum, you can put ` marks around text (like regex) to format it red like my examples and to make sure the forum doesn’t munge the characters.
`<.*?>|</.*?>`
will render as<.*?>|</.*?>
, so that we can see the actual regex as you intended (and that has the benefit of not wondering if the “smart quotes” were part of the regex, or just used to separate the regex from the flow of your text)