• Login
Community
  • Login

Fix: VTT to TXT (text) from youtube-dl subtitles captions - duplicate lines problem

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
2 Posts 2 Posters 1.5k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P
    PorAhiViene Pepe
    last edited by Mar 4, 2021, 10:18 PM

    Fix: VTT to TXT (text) from youtube-dl subtitles captions - duplicate lines problem

    You can macro this solution. Problem: If you run youtube-dl for subtitles you’ll get a vtt file format if no other format is available. As many know this VTT (webVTT) file from Youtube comes with duplicate lines, surrounded by an html code brackets. I searched many fixes, but most involved a python script and didn’t allow for batching, not easily at least. This solution will allow you to open file, run macro and read and save.

    1. open vtt extension file, select all, find and replace all with regex “<.?>|</.?>”, replace with blank.

    2. Then Edit > Line Operations > Remove Empty Line ( containing char.)

    3. Still selected all, Edit > Line Operations > Sort Line as Decimals (Ascending)

    4. Finally Edit > Line Operations> Remove Consecutive Duplicate Lines

    Note: May leave excluded lines, that start with dates. But these will be listed at the end of file.
    Fix for review.

    P 1 Reply Last reply Mar 4, 2021, 10:38 PM Reply Quote 1
    • P
      PeterJones @PorAhiViene Pepe
      last edited by Mar 4, 2021, 10:38 PM

      @PorAhiViene-Pepe ,

      Thanks for sharing that.

      I have a feeling your regex got munged by the forum, because <.?>|</.?> only looks for 0-or-1-character between the angle-brackets; I assume the . were really .*?

      For future reference, in this forum, you can put ` marks around text (like regex) to format it red like my examples and to make sure the forum doesn’t munge the characters. `<.*?>|</.*?>` will render as <.*?>|</.*?> , so that we can see the actual regex as you intended (and that has the benefit of not wondering if the “smart quotes” were part of the regex, or just used to separate the regex from the flow of your text)

      1 Reply Last reply Reply Quote 2
      1 out of 2
      • First post
        1/2
        Last post
      The Community of users of the Notepad++ text editor.
      Powered by NodeBB | Contributors