Community
    • Login

    Captions for video - Find and Replace across time stamps

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    30 Posts 5 Posters 1.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • MaximillianMM
      MaximillianM
      last edited by

      Hi, I’m performing a find and replace to correct incorrect word phrases that were auto captioned by YouTube.
      My challenge is when the phrase is split over the time stamp.

      Is there a way to search and then replace across the time stamp - perhaps using specific wildcards?

      For example, in the lines below I would like to search for the phrase “like a fire” but since “fire” is 3 lines down it isn’t found.

      0:00:17.680,0:00:20.400
      vaporize like a

      0:00:19.840,0:00:22.400
      fire

      Thanks for your help.

      Alan KilbornA PeterJonesP 2 Replies Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @MaximillianM
        last edited by

        @MaximillianM

        You might try this to get a feel for what it can do:

        find: (?s)like a.*?fire
        search mode: Regular expression

        1 Reply Last reply Reply Quote 2
        • PeterJonesP
          PeterJones @MaximillianM
          last edited by

          @MaximillianM ,

          @Alan-Kilborn’s is a good starting point. But in edge cases, it might not do what you want:
          6f4f1bf2-2d3b-4675-8fe1-53a6b8dfc638-image.png

          I am going to use your example to show a procedure for “how to search for a phrase when separated by multiple types of separator sequences”, and in so developing, hopefully give you (and other readers) a good way to think about such problems.

          First, I start thinking about the simplest case: if I were searching for a phrase like a fire, I would just type that into the search field, and as long as the phrase was all on one physical line, that would be sufficient.

          To move on to the next generalization (allowing multiline match), I am going to introduce a few important regex concepts (the links point to the section in the manual that describes that feature).

          1. Instead of a space character, I will use \x20 (SPACE is ASCII 32, which is hex 0x20; \x20 is the way our regex engine represents that arbitrary character). This will make it easier to use a space in the regex, especially if it needs to be at the beginning or end of the regex

          2. The control character sequence \R will match any newline sequence (colloquially known as CR, LF, or CRLF).

          3. Using the search modifier syntax (?x-s), I set two options in the regex: (1) this enables the x option, which means if I have normal space characters, they will be ignored; this then requires using notation like \x20 for a literal space… but it also makes it easier to see the regex; (2) the -s portion of the option means that the . sequence will not match newline sequences (the same as turning off . matches newline; Alan’s post enabled . matches newline using (?s))

          4. If I want to “reuse” a regex to match something that meets the same sub-regex, I can use the subexpression control flow: so if I have a subexpression in parentheses earlier, I can reuse that same expression later using (?ℕ), where ℕ is the number of the subexpression.

          5. You can use | inside an expression or subexpression to be the logical “OR”.

          Back to the problem at hand. For the first generalization, I want to switch to using some of that notation, but still only using literal spaces as the separator between the words of a phrase:

          • (?x-s) like (\x20) a (?1) fire

          This uses the option to allow extra spaces to make it more readable, uses the \x20 for the literal space, and the (?1) to refere to the regex – so anyplace I use (?1), it is exactly equivalent to (\x20) . This expression still only matches the same as the original. A lot of extra work up front, but…

          Now we want to allow matching over multiple lines , so matching

          like a fire
          

          or

          like
          a
          fire
          

          To do this, we want to match the space \x20 OR a newline sequence \R.

          • (?x-s) like (\x20 | \R) a (?1) fire

          Now the separator expression allows a space or a newline sequence, and the (?1) is exactly equivalent to that. (At this point, some might have said “just use \s to match any horizontal or vertical whitespace”… but that won’t allow us to be as generic in the next step… and maybe you don’t want a TAB to be allowed between, only spaces or newlines)

          The next idea is to allow a timestamp line. You can specify the timestamp very loosely or very tightly, but I am going to say that a timestamp looks like zero or more newlines followed by one or two digits, followed by a colon, followed by anything else on the main timestamp line (including its newline). If you want to be more specific, you can add extra rules to it… making sure it only allows full timestamps with hms and/or the range of times, or whatever else distinguishes it from a similar line that shouldn’t be allowed. The expression \R* \d{1,2} : .*\R matches those three requirements. Adding it into our full expression, we want to say “the separator can be a space, a newline, or a timestamp”:

          • (?x-s) like (\x20 | \R | \R* \d{1,2} : .*\R) a (?1) fire

          If you’ve got a phrase with more words, just stick another (?1) between each word.

          Example data I used:

          0:00:17.680,0:00:20.400
          vaporize like a
          
          0:00:19.840,0:00:22.400
          fire
          
          0:00:17.680,0:00:20.400
          vaporize like a
          
          0:00:19.840,0:00:22.400
          phaser. He made fire instead.
          
          0:00:23.000,0:00:23.400
          like
          
          0:00:23.400,0:00:23.800
          a
          
          0:00:23.800,0:00:24.200
          fire
          
          0:00:34.500,0:00:67.800
          like
          a
          fire
          

          Hopefully this helps with the immediate problem, and with being able to generalize it in the future (do similar searches, but with different separators, or more separators, or more restrictive on one of the rules, or what have you).

          Good luck.

          ----

          Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as literal text using the </> toolbar button or manual Markdown syntax. To make regex in red (and so they keep their special characters like *), use backticks, like `^.*?blah.*?\z`. Screenshots can be pasted from the clipboard to your post using Ctrl+V to show graphical items, but any text should be included as literal text in your post so we can easily copy/paste your data. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get. Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries.

          PeterJonesP 1 Reply Last reply Reply Quote 1
          • PeterJonesP
            PeterJones @PeterJones
            last edited by

            Sorry, I meant to include a screenshot of the last regex working on the example data.

            1174f2a0-3065-4b98-a7fe-8f88c57a5ee6-image.png

            MaximillianMM 1 Reply Last reply Reply Quote 2
            • MaximillianMM
              MaximillianM @PeterJones
              last edited by

              @PeterJones Wow! Thank you so much for the detailed explanation. No wonder I couldn’t figure it out. You really know your stuff and are amazing! :-)

              One follow-up question, Is there any way to adjust this to a find/replace that wouldn’t delete the time stamp across lines? So in the example.

              0:00:17.680,0:00:20.400
              vaporize like a

              0:00:19.840,0:00:22.400
              fire

              After finding the “like a fire” across multiple lines the replace would blank out the end of the first line and change the second line fire to liquefy.

              Desired result

              0:00:17.680,0:00:20.400
              vaporize

              0:00:19.840,0:00:22.400
              liquefy

              Thanks so much!

              astrosofistaA 1 Reply Last reply Reply Quote 0
              • astrosofistaA
                astrosofista @MaximillianM
                last edited by

                @MaximillianM, @PeterJones

                One solution could be as simple as adding a space before “like a” and grouping the second time stamp in @PeterJones’ regex. Also, in the replacement expression insert a reference to recently created group and the new string. As follows:

                Search: (?x-s) \x20like (\x20 | \R | \R* \d{1,2} : .*\R) a ((?1)) fire
                Replace: $2liquefy
                

                Hope this helps

                Alan KilbornA 1 Reply Last reply Reply Quote 1
                • Alan KilbornA
                  Alan Kilborn @astrosofista
                  last edited by

                  I think what the OP is looking for here is basically a way to search for a phrase which could have a timestamp inserted into it at any word position in the phrase, i.e., it won’t occur that the timestamp is inserted into the middle of the word.

                  Then the OP wants the ability to replace with the same sort of criterion.
                  And it appears that the replacement can wholly occur after the timestamp (the portion of the search match before the timestamp is removed) and things still work.

                  I think what might be best here is a script; the script would still use regular expressions, but it would allow the user to very naturally search for a simple phrase (user specifies “like a fire”) and replace it with a phrase (user specifies “liquify”) and the script takes care of the rest.

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hello, @MaximillianM, @peterjones, @alan-kilborn, @astrosofista and All,

                    Let’s expand the problem a bit and imagine this text :

                    0:00:17.680,0:00:20.400
                    The licenses for most software are designed to
                    
                    0:00:19.840,0:00:22.400
                    take away your freedom to share and change it.
                    

                    And that the OP want to :

                    • Find the string designed to take away

                    • Replace it with the string generally made to always suppress

                    Right now, the timestamp is located between the two strings designed to and take away

                    Now, let’s use the | symbol to represent the timestamp feature

                    Which replacement is expected by @MaximillianM ?

                    designed to | take away => generally | made to always suppress    ( Case A )
                    designed to | take away => generally made | to always suppress    ( Case B )
                    designed to | take away => generally made to | always suppress    ( Case C )
                    designed to | take away => generally made to always | suppress    ( Case D )
                    

                    To my mind, the more logical one would be Case B, as, before the timestamp, there would be as many words after the replacement than before the replacement !

                    …

                    Best Regards,

                    guy038

                    Alan KilbornA 1 Reply Last reply Reply Quote 0
                    • Alan KilbornA
                      Alan Kilborn @guy038
                      last edited by

                      @guy038 said in Captions for video - Find and Replace across time stamps:

                      To my mind, the more logical one would be Case B

                      But we really don’t know what the OP’s needs in this regard are.
                      I mean, well, if you want to have fun, go off and solve any problem you’d like. :-)

                      For me, I’ll wait to see if OP returns again, clarifies need, and expresses interest in the scripted solution.

                      The scripted solution I envision would prompt with an input box for the string to search for, then would prompt with an input box for the replacement. The script would run and use some variant of Peter’s solution behind the scenes.

                      1 Reply Last reply Reply Quote 0
                      • MaximillianMM
                        MaximillianM
                        last edited by

                        Thank you all so much! You are Great! This really helps.

                        I tested replacing the text after the second timestamp using $2 which works. Also adding it after the second timestamp probably is the best default option. The only real “error” could be if the replacement made it so that one line of text was completely blank.

                        I hadn’t known about the script function and have just used recorded macros. Macros do work, but once created macros can be but are not that easy to edit.

                        Since you have been so helpful. I’ll outline the full desired outcome which might be a script.

                        1-Find and replace a list of words. Words might be added or removed to this list in the future. My current list is about 90 words.

                        Most are just simple words so a 1 to 1 on the same line works.
                        For example “cuz” replace with “because” or “um” replaced with a blank. Ideally, this list could be seen and edited

                        WordToFind, Word to Replace
                        cuz, because
                        um, [blank]

                        2-Then phrases that could go across lines. The regular expressions ones.
                        Could be in the same list/table as above with the full code or a different list/table.
                        WordToFind-----------Word to Replace
                        (?x-s) \x20like (\x20 | \R | \R* \d{1,2} : .*\R) a ((?1)) fire-----------------$2liquefy

                        or since you really know your stuff. Maybe fancy and takes the phrase
                        “like a fire” and creates the code (?x-s) \x20like (\x20 | \R | \R* \d{1,2} : .*\R) a ((?1)) fire

                        Thanks again! You have really been helping me out :-)

                        Alan KilbornA 1 Reply Last reply Reply Quote 2
                        • Alan KilbornA
                          Alan Kilborn @MaximillianM
                          last edited by

                          @MaximillianM said in Captions for video - Find and Replace across time stamps:

                          “like a fire” and creates the code (?x-s) \x20like (\x20 | \R | \R* \d{1,2} : .*\R) a ((?1)) fire

                          Yes, something like that was what I had in mind.

                          Would your search “phrase” ever have TWO+ embedded timestamps in it? Or is that a case that we don’t need to consider…?

                          Let me absorb your specs and see if I can put together a reasonable demo.

                          MaximillianMM 1 Reply Last reply Reply Quote 1
                          • MaximillianMM
                            MaximillianM @Alan Kilborn
                            last edited by

                            @Alan-Kilborn Thanks! Across two timestamps is a possibility and would be great to consider for some potential longer phrases. You keep exceeding my expectations :-) Though if it is too complicated, across one is still really good.

                            1 Reply Last reply Reply Quote 0
                            • Alan KilbornA
                              Alan Kilborn
                              last edited by

                              @MaximillianM

                              So let’s start simple and slowly. :-)

                              Here’s a test PythonScript that can demo the functionality:

                              # -*- coding: utf-8 -*-
                              
                              from Npp import editor, notepad
                              
                              class T1(object):
                              
                                  def __init__(self):
                                      search_phrase = 'like a fire'
                                      while True:
                                          search_phrase = notepad.prompt('\r\nEnter search phrase and press OK to find next:', '', search_phrase)
                                          if search_phrase == None or len(search_phrase) == 0: return  # quit
                                          word_list = search_phrase.strip().split()
                                          regex = r'(?-is)(?(DEFINE)(\x20|\R|\R*\d{1,2}:.*\R))' + '(?1)'.join(word_list)
                                          matches = []
                                          editor.research(regex, lambda m: matches.append(m.span(0)), 0, editor.getCurrentPos(), editor.getLength(), 1)
                                          if len(matches) == 0:
                                              notepad.messageBox('No (more) matches', '')
                                              return
                                          else:
                                              (match_start, match_end) = matches[0]
                                              editor.scrollRange(match_end, match_start)
                                              editor.setSelection(match_end, match_start)
                              
                              if __name__ == '__main__': T1()
                              

                              When you run it, it will prompt you for your search phrase, like this:

                              Imgur

                              When you enter it and press OK, you will be shown your first match (it will become selected) and it will prompt again:

                              Imgur

                              It will continue moving downward in a file, until no more matches occur or you press Cancel.

                              Peter has some good instructions for getting started with the PythonScript plugin HERE.
                              See if you can get what I’ve shown working for you, and then we’ll talk about where to take it from here.

                              MaximillianMM 2 Replies Last reply Reply Quote 2
                              • MaximillianMM
                                MaximillianM @Alan Kilborn
                                last edited by

                                @Alan-Kilborn Thanks, yes, I was able to get it to work!

                                1 Reply Last reply Reply Quote 0
                                • MaximillianMM
                                  MaximillianM @Alan Kilborn
                                  last edited by

                                  @Alan-Kilborn Hi Alan, Thanks for your help on this project can you point me in the right direction for the next step of adding code to use a preset list of words to find and replace? Thanks again :-)

                                  Alan KilbornA 1 Reply Last reply Reply Quote 1
                                  • Alan KilbornA
                                    Alan Kilborn @MaximillianM
                                    last edited by

                                    @MaximillianM said in Captions for video - Find and Replace across time stamps:

                                    use a preset list of words to find and replace

                                    Sure; again let’s start small…

                                    Let’s just put the list inside the code, but maybe make it look like it is coming from a file (because you may want to go there, later).

                                    So I’d suggest a list like this:

                                    ::findable_you:replaceable_you
                                    :I can contain spaces:So I see
                                    

                                    where the format is:

                                    • one-character delimiter
                                    • find word/phrase
                                    • delimiter (as previously defined by first character)
                                    • replace word/phrase

                                    In Python we might do it like this:

                                    the_list = [
                                        ':findable_you:replaceable_you',
                                        ':I can contain spaces:So I see',
                                        ':look_for_me:really_want_to_be_you',
                                        '!simple!complex',
                                        '$fire$liquify',
                                    ]
                                    

                                    We need some code to process that list:

                                    # -*- coding: utf-8 -*-
                                    
                                    from Npp import editor
                                    
                                    class T2(object):
                                    
                                        def __init__(self):
                                    
                                            the_list = [
                                                ':findable_you:replaceable_you',
                                                ':I can contain spaces:So I see',
                                                ':look_for_me:really_want_to_be_you',
                                                '!simple!complex',
                                                '$fire$liquify',
                                            ]
                                    
                                            editor.beginUndoAction()
                                            for definition in the_list:
                                                delim = definition[0]
                                                (find_what, repl_with) = definition[1:].split(delim, 2)
                                                editor.replace(find_what, repl_with)
                                            editor.endUndoAction()
                                    
                                    if __name__ == '__main__': T2()
                                    
                                    MaximillianMM 1 Reply Last reply Reply Quote 2
                                    • MaximillianMM
                                      MaximillianM @Alan Kilborn
                                      last edited by

                                      @Alan-Kilborn Thanks! Yes, I like the idea of making it a separate file for ease of updates.

                                      I experimented with the code.

                                      What is the difference between using the ! or $ for the word in the list? They seem to do a similar find/replace in the example, at least in my small test. But I’m probably missing something.

                                      The next step seems to be adding the search over multiple lines code as from your earlier example. How can I do that?

                                      Thanks!

                                      Alan KilbornA 2 Replies Last reply Reply Quote 0
                                      • Alan KilbornA
                                        Alan Kilborn @MaximillianM
                                        last edited by

                                        @MaximillianM said in Captions for video - Find and Replace across time stamps:

                                        What is the difference between using the ! or $ for the word in the list? They seem to do a similar find/replace in the example, at least in my small test. But I’m probably missing something.

                                        So the “delimiter” variability is useful if the data itself contains the delimiter.

                                        Say we hardcoded the delimiter to be a colon (:).
                                        Then if you wanted to replace something like a:b with c:d it would be difficult.

                                        The way I defined it, you could just use a different delimiter for this case, e.g. !a:b!c:d.

                                        1 Reply Last reply Reply Quote 3
                                        • Alan KilbornA
                                          Alan Kilborn @MaximillianM
                                          last edited by

                                          @MaximillianM said in Captions for video - Find and Replace across time stamps:

                                          The next step seems to be adding the search over multiple lines code as from your earlier example. How can I do that?

                                          This is the point where I’m having trouble envisioning how it would work.
                                          I know you said something about it before, but I didn’t quite understand it.

                                          Would you put a special symbol in the replacement part that you’d want the timestamp to be replaced by?
                                          Maybe a more in-depth walk-through (example(s)) of what is wanted?

                                          I’m certainly willing to do it, or at least help you get started…

                                          MaximillianMM 1 Reply Last reply Reply Quote 2
                                          • MaximillianMM
                                            MaximillianM @Alan Kilborn
                                            last edited by

                                            @Alan-Kilborn Thanks again. I see you are helping many other people so I should have put a summary in to help you :-)

                                            The problem
                                            1-Simple Find and replace with a list of words (your most recent code does this)
                                            2-Find and replace multi-word string that goes across a timestamp
                                            Find “like a fire” replace with $2liquefy

                                            0:00:17.680,0:00:20.400
                                            vaporize like a

                                            0:00:19.840,0:00:22.400
                                            fire

                                            I would like to combine the most recent code with this one (minus the manual entry box so it could use the list as in your most recent code) to search across the time stamp.

                                            -- coding: utf-8 --

                                            from Npp import editor, notepad

                                            class T1(object):

                                            def __init__(self):
                                                search_phrase = 'like a fire'
                                                while True:
                                                    search_phrase = notepad.prompt('\r\nEnter search phrase and press OK to find next:', '', search_phrase)
                                                    if search_phrase == None or len(search_phrase) == 0: return  # quit
                                                    word_list = search_phrase.strip().split()
                                                    regex = r'(?-is)(?(DEFINE)(\x20|\R|\R*\d{1,2}:.*\R))' + '(?1)'.join(word_list)
                                                    matches = []
                                                    editor.research(regex, lambda m: matches.append(m.span(0)), 0, editor.getCurrentPos(), editor.getLength(), 1)
                                                    if len(matches) == 0:
                                                        notepad.messageBox('No (more) matches', '')
                                                        return
                                                    else:
                                                        (match_start, match_end) = matches[0]
                                                        editor.scrollRange(match_end, match_start)
                                                        editor.setSelection(match_end, match_start)
                                            

                                            if name == ‘main’: T1()

                                            Thanks again :-)

                                            Alan KilbornA 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors