Community
    • Login

    Captions for video - Find and Replace across time stamps

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    30 Posts 5 Posters 1.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • MaximillianMM
      MaximillianM @Alan Kilborn
      last edited by

      @Alan-Kilborn Thanks! Yes, I like the idea of making it a separate file for ease of updates.

      I experimented with the code.

      What is the difference between using the ! or $ for the word in the list? They seem to do a similar find/replace in the example, at least in my small test. But I’m probably missing something.

      The next step seems to be adding the search over multiple lines code as from your earlier example. How can I do that?

      Thanks!

      Alan KilbornA 2 Replies Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @MaximillianM
        last edited by

        @MaximillianM said in Captions for video - Find and Replace across time stamps:

        What is the difference between using the ! or $ for the word in the list? They seem to do a similar find/replace in the example, at least in my small test. But I’m probably missing something.

        So the “delimiter” variability is useful if the data itself contains the delimiter.

        Say we hardcoded the delimiter to be a colon (:).
        Then if you wanted to replace something like a:b with c:d it would be difficult.

        The way I defined it, you could just use a different delimiter for this case, e.g. !a:b!c:d.

        1 Reply Last reply Reply Quote 3
        • Alan KilbornA
          Alan Kilborn @MaximillianM
          last edited by

          @MaximillianM said in Captions for video - Find and Replace across time stamps:

          The next step seems to be adding the search over multiple lines code as from your earlier example. How can I do that?

          This is the point where I’m having trouble envisioning how it would work.
          I know you said something about it before, but I didn’t quite understand it.

          Would you put a special symbol in the replacement part that you’d want the timestamp to be replaced by?
          Maybe a more in-depth walk-through (example(s)) of what is wanted?

          I’m certainly willing to do it, or at least help you get started…

          MaximillianMM 1 Reply Last reply Reply Quote 2
          • MaximillianMM
            MaximillianM @Alan Kilborn
            last edited by

            @Alan-Kilborn Thanks again. I see you are helping many other people so I should have put a summary in to help you :-)

            The problem
            1-Simple Find and replace with a list of words (your most recent code does this)
            2-Find and replace multi-word string that goes across a timestamp
            Find “like a fire” replace with $2liquefy

            0:00:17.680,0:00:20.400
            vaporize like a

            0:00:19.840,0:00:22.400
            fire

            I would like to combine the most recent code with this one (minus the manual entry box so it could use the list as in your most recent code) to search across the time stamp.

            -- coding: utf-8 --

            from Npp import editor, notepad

            class T1(object):

            def __init__(self):
                search_phrase = 'like a fire'
                while True:
                    search_phrase = notepad.prompt('\r\nEnter search phrase and press OK to find next:', '', search_phrase)
                    if search_phrase == None or len(search_phrase) == 0: return  # quit
                    word_list = search_phrase.strip().split()
                    regex = r'(?-is)(?(DEFINE)(\x20|\R|\R*\d{1,2}:.*\R))' + '(?1)'.join(word_list)
                    matches = []
                    editor.research(regex, lambda m: matches.append(m.span(0)), 0, editor.getCurrentPos(), editor.getLength(), 1)
                    if len(matches) == 0:
                        notepad.messageBox('No (more) matches', '')
                        return
                    else:
                        (match_start, match_end) = matches[0]
                        editor.scrollRange(match_end, match_start)
                        editor.setSelection(match_end, match_start)
            

            if name == ‘main’: T1()

            Thanks again :-)

            Alan KilbornA 1 Reply Last reply Reply Quote 0
            • Alan KilbornA
              Alan Kilborn @MaximillianM
              last edited by

              @MaximillianM said in Captions for video - Find and Replace across time stamps:

              Find “like a fire” replace with $2liquefy
              0:00:17.680,0:00:20.400
              vaporize like a
              0:00:19.840,0:00:22.400
              fire

              Yes, you used this example before, but I didn’t fully understand it.
              I take it the $2 represents the bridged timestamp, and where it would appear in the replacement text.

              BTW, why $2?
              Is it because a search match could possibly bridge two timestamps?
              And $1 might possibly appear in the replace expression as well?
              Or $3 etc?

              MaximillianMM 1 Reply Last reply Reply Quote 0
              • MaximillianMM
                MaximillianM @Alan Kilborn
                last edited by

                @Alan-Kilborn Hi, I’m a beginner and was just using the $2 that astrosofista suggested in this post so I don’t fully understand it.

                He suggested
                Search: (?x-s) \x20like (\x20 | \R | \R* \d{1,2} : .*\R) a ((?1)) fire
                Replace: $2liquefy

                If I don’t use the $2 before the replacement word then the timestamp is removed in the replacement. I tried $1 and $3 before the replacement word as a test and the time stamp was removed in both cases.

                Putting the replacement expression in the second part of the string (second timestamp) is preferable as it is the most likely scenario.

                Just one bridged time-stamp is the basic requirement, in the future I might look at across multiple time-stamps.

                There a blank line between the timestamp/phrase as in the example below.
                $1, $2, $3, would not be present in the text so ok to use in our expression.

                Find “like a fire” replace (end of expression) with liquefy

                0:00:17.680,0:00:20.400
                vaporize like a

                0:00:19.840,0:00:22.400
                fire

                0:00:22.400,0:00:24.300
                next phrase

                Thanks :-)

                Alan KilbornA 1 Reply Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @MaximillianM
                  last edited by

                  @MaximillianM said in Captions for video - Find and Replace across time stamps:

                  I’m a beginner and was just using the $2 that astrosofista suggested in this post so I don’t fully understand it.

                  I’m not a beginner, but I don’t see how this is going to work in the bigger scheme of things. I mean, well, maybe I see can see it if I squint at it, but I don’t have the desire/time to sort out the regexes needed down to the Nth level so that every situation is covered.

                  I think finding the matches is one level of difficulty (which has already been conquered), but replacing them introduces a whole new level of complexity to it. Even the single bridged timestamp can be nuancy when you really think about some examples that a generic replace could encounter.

                  I’m sorry if I misrepresented that I would do the whole solution for your “list” based replacement. My intent was to demo a few things to show what’s possible with scripting, not come up with a full-blown solution for some very specific data.

                  If someone else (@guy038 loves to do this sort of thing, or maybe @PeterJones since he got the original ball rolling) is willing to do it, I can certainly help put together the final script using the information. What is needed is a find/replace regex pair that would walk through a document doing the replacements desired.

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hello, @maximillianm, @peterjones, @alan-kilborn, @astrosofista and All,

                    @maximillianm, I assume that the timestamp always begins lines of your file, without any leading blank characters ! If it’s not the case, just tell me !

                    Here is a generic regex which searches any range of text, containing one timestamp feature, and replace it with any range of text, still containing the same timestamp

                    SEARCH (\R+\d{1,2}:\d\d:\d\d\.\d{3},\d{1,2}:\d\d:\d\d\.\d{3}\R)(*F)|(?-i)Before_Find_Text((?1))After_Find_Text

                    REPLACE Before_Replace_Text\2After_Replace_Text

                    where :

                    • Before_Find_Text represents the text to search, located BEFORE the time-stamp line

                    • After_Find_Text represents the text to search, located AFTER the time-stamp line

                    • Before_Replace_Text represents the text to replace BEFORE the time-stamp line

                    • After_Replace_Text represents the text to replace AFTER the time-stamp line


                    First example :

                    Given your initial text :

                    0:00:17.680,0:00:20.400
                    vaporize like a
                    
                    0:00:19.840,0:00:22.400
                    fire
                    

                    And the expected result :

                    0:00:17.680,0:00:20.400
                    vaporize
                    
                    0:00:19.840,0:00:22.400
                    liquefy
                    

                    The different variable parts of the generic regex S/R are :

                    • Before_Find_Text = vaporize like a

                    • After_Find_Text = fire

                    • Before_Replace_Text = vaporize

                    • After_Replace_Text = liquefy

                    which gives the functional regex S/R :

                    SEARCH (\R+\d{1,2}:\d\d:\d\d\.\d{3},\d{1,2}:\d\d:\d\d\.\d{3}\R)(*F)|(?-i)vaporize like a((?1))fire

                    REPLACE vaporize\2liquefy


                    Second example :

                    Given this initial example, taken from my previous post :

                    0:00:17.680,0:00:20.400
                    The licenses for most software are designed to
                    
                    0:00:19.840,0:00:22.400
                    take away your freedom to share and change it.
                    

                    And the expected result :

                    0:00:17.680,0:00:20.400
                    The licenses for most software are generally made to
                    
                    0:00:19.840,0:00:22.400
                    always suppress your freedom to share and change it.
                    

                    The different variable parts of the generic regex S/R are, this time :

                    • Before_Find_Text = designed to

                    • After_Find_Text = take away

                    • Before_Replace_Text = generally made to

                    • After_Replace_Text = always suppress

                    which gives the functional regex S/R :

                    SEARCH (\R+\d{1,2}:\d\d:\d\d\.\d{3},\d{1,2}:\d\d:\d\d\.\d{3}\R)(*F)|(?-i)designed to((?1))take away

                    REPLACE generally made to\2always suppress


                    Third example :

                    Given, again, this initial example, taken from my previous post :

                    0:00:17.680,0:00:20.400
                    The licenses for most software are designed to
                    
                    0:00:19.840,0:00:22.400
                    take away your freedom to share and change it.
                    

                    And the expected result :

                    0:00:17.680,0:00:20.400
                    The licenses for most software are
                    
                    0:00:19.840,0:00:22.400
                    designed to suppress your freedom.
                    

                    The different variable parts of the generic regex S/R are, this time :

                    • Before_Find_Text = are designed to

                    • After_Find_Text = take away your freedom to share and change it

                    • Before_Replace_Text = are

                    • After_Replace_Text = designed to suppress your freedom

                    which gives the functional regex S/R :

                    SEARCH (\R+\d{1,2}:\d\d:\d\d\.\d{3},\d{1,2}:\d\d:\d\d\.\d{3}\R)(*F)|(?-i)are designed to((?1))take away your freedom to share and change it

                    REPLACE are\2designed to suppress your freedom


                    Fourth example :

                    Given this initial example :

                    0:00:17.680,0:00:20.400
                    The licenses for most software are designed to
                    
                    0:00:19.840,0:00:22.400
                    take away your freedom to share and change it.
                    

                    And the expected result :

                    0:00:17.680,0:00:20.400
                    The licenses for most software
                    
                    0:00:19.840,0:00:22.400
                    prevent you from sharing and changing it.
                    

                    The different variable parts of the generic regex S/R are, this time :

                    • Before_Find_Text = software are designed to

                    • After_Find_Text = take away your freedom to share and change

                    • Before_Replace_Text = software

                    • After_Replace_Text = prevent you from sharing and changing

                    which gives the functional regex S/R :

                    SEARCH (\R+\d{1,2}:\d\d:\d\d\.\d{3},\d{1,2}:\d\d:\d\d\.\d{3}\R)(*F)|(?-i)software are designed to((?1))take away your freedom to share and change

                    REPLACE software\2prevent you from sharing and changing


                    Notes :

                    • The first alternative of this search regex (\R+\d{1,2}:\d\d:\d\d\.\d{3},\d{1,2}:\d\d:\d\d\.\d{3}\R)(*F) is never matched, due to backtracking control verb (*F) which forces a failure of the match attempt.

                    • However, the regex (\R+\d{1,2}:\d\d:\d\d\.\d{3},\d{1,2}:\d\d:\d\d\.\d{3}\R), which would match any range of line-breaks, followed with a complete timestamp line, is stored in group1 for later use, in the second alternative of the search regex

                    • As you can see, the timestamp 0:00:19.840,0:00:22.400 is kept, after replacement because it’s stored in group 2 ( Current timestamp value of the subroutine call (?1), in the regex part ((?1)) ! )

                    • If you prefer an “insensitive to case” search, simply change the part (?-i) by (?i)

                    Best regards

                    guy038

                    1 Reply Last reply Reply Quote 1
                    • Alan KilbornA
                      Alan Kilborn
                      last edited by

                      @guy038 said:

                      Before_Find_Text = vaporize like a
                      After_Find_Text = fire

                      Before_Find_Text = designed to
                      After_Find_Text = take away

                      Before_Find_Text = are designed to
                      After_Find_Text = take away your freedom to share and change it

                      Before_Find_Text = software are designed to
                      After_Find_Text = take away your freedom to share and change

                      But the OP wants to simply specify the following for each of those searches:

                      • vaporize like a fire
                      • designed to take away
                      • are designed to take away your freedom to share and change it
                      • software are designed to take away your freedom to share and change

                      In other words, transparency of the timestamp and where it occurs.
                      Thus, timestamp could occur at any one of the following points (denoted by TS):

                      • vaporizeTSlikeTSaTSfire
                      • designedTStoTStakeTSaway
                      • areTSdesignedTStoTStakeTSawayTSyourTSfreedomTStoTSshareTSandTSchangeTSit
                      • softwareTSareTSdesignedTStoTStakeTSawayTSyourTSfreedomTStoTSshareTSandTSchange

                      And, this problem has already been solved, by Peter, way above.
                      What is needed now is a replacement regex that works, for all cases of a generic substitution, to accompany the original regex scheme.

                      Now, ok, it is fine if the orginal search regex mutates somewhat to meet this need, but the “spirit” of it needs to be retained.

                      And, it may be simpler than I think it is, truly. But what I don’t want to have happen is the usual – people put a lot of time into it, and then the reality of it is that a different problem was solved than what was wanted.

                      PeterJonesP 1 Reply Last reply Reply Quote 0
                      • PeterJonesP
                        PeterJones @Alan Kilborn
                        last edited by

                        @Alan-Kilborn said in Captions for video - Find and Replace across time stamps:

                        What is needed now is a replacement regex that works, for all cases of a generic substitution, to accompany the original regex scheme.

                        So, the problem with the expression developed by me and @astrosofista , if I’ve understood my catching up with this discussion, is that when you have the “vaporizeTSlikeTSaTSfire”, the last of the timestamps is in group 3 instead of group 2. Or if there are more timestamps, the one you want to keep is always only the final timestamp. (I assume always the final timestamp, because that makes the most sense to me from the way that the OP originally phrased it)

                        If I understand what you’re doing in your PythonScript, you are just building the regex based on joining the elements of word_list with (?1) (which doesn’t create that final capture group that @astrosofista added to make the replacement work better)

                        Could you instead join all elements except the last with (?1) and then manually append ((?1)) and word_list[n-1] to the end: '(?1)'.join(word_list[0:n-1]) + '((?1))' + word_list[n-1] – this would then have the group2 always be the final space or timestamp.

                        That would have worked, except I realized that if our phrase were in the subtitle file as “first twoTSmiddleTSlast two”, so it spans multiple timestamps, but might have multiple words per timestamp, then my put-the-group2-around-the-last-backreference wouldn’t work, because the backreference matches either spaces or the timestamps, so it would throw away the final timestamp and just capture the final space, which isn’t what we want.

                        At this point, I’d be more tempted to build a regex that captured each of the spaces or TS ('((?1))'.join(word_list)), and use editor.rereplace with a callback rather than a literal replace. That way, in the callback, you can look at each element from m.group and use a space for all the separators except the very last timestamp, which should be kept. I don’t have the time right now to implement that… but I think that’s the direction I’d go at this point.

                        (Doing a search-and-replace where the match and replacement can go across multiple timestamps is a convoluted mess.)

                        Alan KilbornA 1 Reply Last reply Reply Quote 0
                        • Alan KilbornA
                          Alan Kilborn @PeterJones
                          last edited by

                          @PeterJones said in Captions for video - Find and Replace across time stamps:

                          (Doing a search-and-replace where the match and replacement can go across multiple timestamps is a convoluted mess.)

                          So, yes, that’s a convoluted mess.

                          But also there can be the situation where (potentially) all text connected to a timestamp is removed (e.g. a replace-with-nothing situation), and then an orphan timestamp is left sitting there in the file.

                          This could (logically) happen if the timestamp to be orphaned occurs ahead of the match data, or after it.

                          Of course, OP said nothing about “replace with nothing”, but in trying to make this a “generic” operation, that’s certainly a possibility with a “replace” algorithm.

                          I dislike reading too much into an OP’s need, but if this is not done we can end up, after some effort, with the aforementioned “solution is not what I wanted” problem.

                          I got thinking about this as I was playing around with some of the data originally postulated by Peter, above.

                          MaximillianMM 1 Reply Last reply Reply Quote 0
                          • MaximillianMM
                            MaximillianM @Alan Kilborn
                            last edited by

                            @Alan-Kilborn @guy038 Thank you so much for all your responses and I see how there are multiple ways to accomplish this and how each solution solves one situation and might mess up another solution. So trying to solve all options get complicated fast.

                            You both have gone above and beyond and I really appreciate it. :-)

                            What I think would be most helpful now is creating a script using a list such as Alan suggested a bit back and then parse the list so I could use it in first a simple search and replace (word(s) on the same line) then one of the more complicated ones (words over multiple lines) like Alan or Guy suggested.

                            The first part was solved by Alan

                            Alan’s script
                            the_list = [
                            ‘:findable_you:replaceable_you’,
                            ‘:I can contain spaces:So I see’,
                            ‘:look_for_me:really_want_to_be_you’,
                            ‘!simple!complex’,
                            ‘$fire$liquify’,
                            ]

                                editor.beginUndoAction()
                                for definition in the_list:
                                    delim = definition[0]
                                    (find_what, repl_with) = definition[1:].split(delim, 2)
                                    editor.replace(find_what, repl_with)
                                editor.endUndoAction()
                            

                            Then a way that
                            I could parse the list (or a separate list might be easier) so the scripted search/replace could use what is the most common situation

                            the_list = [
                            ‘:like a fire:vaporize’,

                            Search: (?x-s) \x20like (\x20 | \R | \R* \d{1,2} : .*\R) a ((?1)) fire
                            REPLACE vaporize on second line

                            With that my original problem would be solved and then I could experiment with Guy’s regex for the next scenario.

                            Thanks again. You really know your stuff and I so appreciate your support!

                            Alan KilbornA 1 Reply Last reply Reply Quote 0
                            • Alan KilbornA
                              Alan Kilborn @MaximillianM
                              last edited by

                              @MaximillianM

                              In the spirit of following thru on something I’ve signed up for, I’ll put together something very basic on this, and then it can be evaluated by you and others. After that I’ll have to hand it off, and absolve myself of future development. :-) It may take me a bit of time, not because of the task itself but because of other demands currently biting at me --check back here periodically.

                              1 Reply Last reply Reply Quote 0
                              • Alan KilbornA
                                Alan Kilborn
                                last edited by

                                So in attempting to get something to work, I’ve come up with the following regular expression replacement (generated via code):

                                find: (?(DEFINE)(?<TS>\d{1,3}:\d{2}:\d{2}\.\d{3},\d{1,3}:\d{2}:\d{2}\.\d{3}))like(?<RE>\x20|\R|\R*(?<TSLINE>(?P>TS))\R)a(?P>RE)fire
                                repl: it so that it will \r\n\r\n$+{TSLINE}\r\nliquify

                                And running it on this data:

                                0:00:17.680,0:00:20.400
                                vaporize like a
                                
                                0:00:19.840,0:00:22.400
                                fire
                                
                                0:00:17.680,0:00:20.400
                                vaporize like a
                                
                                0:00:19.840,0:00:22.400
                                fire
                                

                                Should produce this result:

                                0:00:17.680,0:00:20.400
                                vaporize it so that it will 
                                
                                0:00:19.840,0:00:22.400
                                liquify
                                
                                0:00:17.680,0:00:20.400
                                vaporize it so that it will 
                                
                                0:00:19.840,0:00:22.400
                                liquify
                                

                                But it is ignoring my use of $+{TSLINE} in the replace expression and produces the following:

                                0:00:17.680,0:00:20.400
                                vaporize it so that it will
                                
                                
                                liquefy
                                
                                0:00:17.680,0:00:20.400
                                vaporize it so that it will
                                
                                
                                liquefy
                                

                                It is working fine in the Boost emulation mode for RegexBuddy, just not in Notepad++.

                                Any ideas?

                                I’m also open to someone changing the regexes I’m using.

                                Given that this is a general problem with a complicated regex situation between every word in a multiple word search string, I thought it better to switch to named groups instead of numbered groups, but as of yet I don’t feel confident with any approach.

                                1 Reply Last reply Reply Quote 1
                                • Alan KilbornA Alan Kilborn referenced this topic on
                                • First post
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors