Regex replace only on the first group not the others



  • First, I’m not a regex expert, but here is the scenario.
    I have several .AAS subtitles files that I need to edit in batch-like approach, googling about it, I found that n++ can do multiple replaces using regex, which I believe is perfect for my claim, so:

    The problem is a I have multiple lines among all my files as the following sample:

    Dialogue: 0,0:00:06.43,0:00:10.49,Song - Romaji,0,0,0,Subtitle Line
    Dialogue: 0,0:00:06.43,0:00:10.49,Song - Translation,0,0,0,Another Subtitle Line
    Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line

    I need to replace ONLY the “Dialog:” word on lines containing “Song - Romaji” OR “Song - Translation”.

    I tried regx -> Dialogue:|(Song - (Romaji,|Translation,))
    But it’s matching the default lines too.
    Also tried -> ^(?=.?\bDialogue\b)(?=.?\bSong\b).*$
    But it’s matching the correct target lines, but I didn’t realize how to make the replace regx field.



  • @Danilo-de-Queiroz

    from the infos given I assume the following regex should do the job.

    ^Dialogue(?=.*?Song - (?=Romaji|Translation).*$)
    

    You are looking for lines which start with word Dialogue
    followed by, but not counted as match,

    • any chars (less as possible) followed by
    • Song - followed by (but again not counted) either
    • Romajii or
    • Translation followed by
    • any chars until end of line

    In replace put the string you want.

    Cheers
    Claudia



  • Hello, Danilo and Claudia,

    No problem, Danilo, regexes can do miracles, indeed !

    Claudia, as you, I thought about a look-ahead structure. I just shortened it, a bit ;-)

    Indeed, we don’t need a second look-ahead, to verify if the Romaji or Translation strings are present. We don’t need, also, to test for the presence of the range, possibly empty, of standard characters, till the end of the line ( .*$ ). This won’t change the main test ( Is the string Song - Romaji OR the string Song - Translation exists, further on, in the current line ? ) anyway !

    So, my regex attempt would be :

    SEARCH (?-s)^Dialogue(?=.+Song - (Romaji|Translation))

    REPLACE Any Text you want

    Notes :

    • First, the modifier, (?-s), ensures you that the dot special character will match, ONLY, a single standard character ( and not an EOL character )

    • The second part, Dialogue is the string to match

    • The ending part, (?=.+Song - (Romaji|Translation)), called a positive look-ahead, is a *condition, which must be true, in order to valid the overall regex !

    • The condition to test is : After the word dialogue, is there, further on, a string Song - Romaji OR a string Song - Translation, in the current line ?

    • If that condition is true, the search match ( Dialogue ) is, then, replaced by the contents of the Replace with field

    So, Danilo, let’s consider the original text, of nine lines, below :

    Dialogue: 0,0:00:06.43,0:00:10.49,Song - Romaji,0,0,0,Subtitle Line
    Dialogue: 0,0:00:06.43,0:00:10.49,Song - Translation,0,0,0,Another Subtitle Line
    Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line
    Dialogue: 0,0:00:06.43,0:00:10.49,Song - TEST,0,0,0,Subtitle Line
    Dialogue: 0,0:00:06.43,0:00:10.49,Song - Translation,0,0,0,Another Subtitle Line
    Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line
    Dialogue: 0,0:00:06.43,0:00:10.49,Song - Romaji,0,0,0,Subtitle Line
    Dialogue: 0,0:00:06.43,0:00:10.49,Song - XXXXXXXX,0,0,0,Another Subtitle Line
    Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line
    

    If the Replace with: field contains the string Test_001, then, after clicking on the Replace All button, you should obtain the changed text, below :

    Test_001: 0,0:00:06.43,0:00:10.49,Song - Romaji,0,0,0,Subtitle Line
    Test_001: 0,0:00:06.43,0:00:10.49,Song - Translation,0,0,0,Another Subtitle Line
    Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line
    Dialogue: 0,0:00:06.43,0:00:10.49,Song - TEST,0,0,0,Subtitle Line
    Test_001: 0,0:00:06.43,0:00:10.49,Song - Translation,0,0,0,Another Subtitle Line
    Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line
    Test_001: 0,0:00:06.43,0:00:10.49,Song - Romaji,0,0,0,Subtitle Line
    Dialogue: 0,0:00:06.43,0:00:10.49,Song - XXXXXXXX,0,0,0,Another Subtitle Line
    Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line
    

    Best Regards,

    guy038



  • Hello Danilo and guy038,

    you are right, second lookahead isn’t necessary, so the modified version is

    ^Dialogue(?=.*?Song - (Romaji|Translation).*$)
    

    which is a char less than your example ;-D
    and also a little bit faster ;-D

    regex((?-s)^Dialogue(?=.+Song - (Romaji|Translation))) -> took 0.010000 seconds
    regex(^Dialogue(?=.?Song - (Romaji|Translation).$)) -> took 0.008000 seconds

    Your turn ;-)

    Cheers
    Claudia



  • Hi, Claudia,

    First of all, due to the Markdown syntax, in our site, I suppose that two star symbols are missing, in your regex. So the exact regexes are :

    (?-s)^Dialogue(?=.+Song - (Romaji|Translation))      Me
    
    ^Dialogue(?=.*?Song - (Romaji|Translation).*$)       You
    

    Oh ! I’ve never thought about timing regex’s execution, yet ! So, I’ve lost for 0.002s only :-(( I’ll never recover after such an event !

    Out of curiosity, could you time this similar regex ^Dialogue(?=.+Song - (Romaji|Translation)) ?. I just omitted the modifier (?-s), at the beginning. Of course, this implies that the . matches newline option must be unchecked, in the Replace dialog, before performing the S/R

    Two remarks :

    • I still think that the block .*$, at the end of your regex, is not necessary for knowing if, either, the string Song - Romaji OR Song - Translation occurs, in the current line !

    • As these two strings may be located, anywhere, after the word Dialogue, I don’t think, also, that the lazy quantifier *? is necessary, at the beginning of the look-ahead !

    Finally, your regex could be shortened to ^Dialogue(?=.*Song - (Romaji|Translation))

    which is quite similar to my regex syntax ^Dialogue(?=.+Song - (Romaji|Translation)), without the (?-s) modifier !

    Cheers,

    guy038



  • @guy038 @Claudia-Frank
    Hi Buddies I didn’t tested each of this expressions you mentioned except the first one that just did the job.
    Now I’m trying to understand each one of this characters for further uses, any good doc to point.
    Thanks for both your help :)



  • Hi, Danilo,

    First, don’t bother about the choice of the regex, Our regexes are quite similar, anyway ! Just the pleasure to discuss with Claudia !


    I just forgot to give you some information for improving your knowledge of regular expressions !

    Begin with that article, in N++ Wiki :

    http://docs.notepad-plus-plus.org/index.php/Regular_Expressions

    In addition, you’ll find good documentation, about the new Boost C++ Regex library, v1.55.0 ( similar to the PERL Regular Common Expressions, v1.48.0 ), used by Notepad++, since its 6.0 version, at the TWO addresses below :

    http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

    http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

    • The FIRST link explains the syntax, of regular expressions, in the SEARCH part

    • The SECOND link explains the syntax, of regular expressions, in the REPLACEMENT part


    You may, also, look for valuable informations, on the sites, below :

    http://www.regular-expressions.info

    http://www.rexegg.com

    http://perldoc.perl.org/perlre.html

    Be aware that, as any documentation, it may contain some errors ! Anyway, if you detected one, that’s good news : you’re improving ;-))



  • Hi to everyone,

    Guy, after doing some tests,
    I would say it doesn’t matter if using the (?-s) or not,
    because sometimes it was faster and sometimes not.
    Seems that external influence has much more significance.

    What can be said so is that non-greedy beats greedy regexes.

    Out of interest I did a test with two regexes which describes the pattern more concrete,
    as I thought this could be faster but it turned out it wasn’t.
    I assume it is related to the fact that each char needs to be checked anyway.

    Results of that simple test (10.000 iterations per each regex)

    (?-s)^Dialogue(?=.+Song - (Romaji|Translation))       -> took 14.708000 seconds
    (?-s)^Dialogue(?=.*Song - (Romaji|Translation))       -> took 14.631000 seconds
    
    (?-s)^Dialogue(?=.+Song - (Romaji|Translation).+$)    -> took 15.046000 seconds
    (?-s)^Dialogue(?=.*Song - (Romaji|Translation).*$)    -> took 15.035000 seconds
    
    ^Dialogue(?=.+Song - (Romaji|Translation))            -> took 14.635000 seconds
    ^Dialogue(?=.*Song - (Romaji|Translation))            -> took 14.697000 seconds
    
    ^Dialogue(?=.+?Song - (Romaji|Translation))           -> took 13.568000 seconds    
    ^Dialogue(?=.*?Song - (Romaji|Translation))           -> took 13.575000 seconds
    
    ^Dialogue(?=.+Song - (Romaji|Translation).+$)         -> took 14.885000 seconds
    ^Dialogue(?=.*Song - (Romaji|Translation).*$)         -> took 14.947000 seconds
    
    ^Dialogue(?=.+?Song - (Romaji|Translation).+?$)       -> took 13.928000 seconds
    ^Dialogue(?=.*?Song - (Romaji|Translation).*?$)       -> took 13.972000 seconds
    
    ^Dialogue(?=: \d,\d\:\d\d\:\d\d\.\d\d,\d\:\d\d\:\d\d\.\d\d,Song - (Romaji|Translation)) -> took 16.183000 seconds
    ^Dialogue(?=: \d,\d\:\d{2}\:\d{2}\.\d{2},\d\:\d{2}\:\d{2}\.\d{2},Song - (Romaji|Translation)) -> took 20.889000 seconds
    

    Cheers
    Claudia



  • Hi, Danilo and Claudia,

    Claudia, Your serie of tests, on regexes, is very interesting, indeed ! We can deduce some facts :

    • 1) If a quantifier, with syntax Regex{n} has a small value, it’s better to use the syntax RegexRegex… Regex than Regex{n} ! ( Refer to the time difference between your two last examples : the regex containing \d\d and the one containing \d{2} )

    • 2) When a range of text is NOT needed, for the results of the regex, replace it by .* or .+, if needed text is located, after, in the regex, ELSE don’t add it, at all !

    • 3) When it does not matter between using a lazy quantifier and a greedy one, for the results of the regex, always prefer the lazy form !

    • 4) If a range of text cannot be a null length string, prefer the + quantifier( idem {1,x} ) to the * quantifier ( idem {0,x} )


    Finally, Danilo, thanks to Claudia’s tests about timing, and according to the rules above, the faster and shorter regex, for your case, seems to be :

    ^Dialogue(?=.+?Song - (Romaji|Translation))        ( 13.568000 seconds for 10,000 iterations )
    

    But, as this regex does not contain the (?-s) modifier, at beginning, just be sure that the . matches newline option is not enabled, in the Replace dialog !

    Cheers,

    guy038



  • Hi Guy,

    I would still use the more descriptive version of (?-s) because there isn’t really a difference

    (?-s)^Dialogue(?=.+Song - (Romaji|Translation))       -> took 14.708000 seconds
    (?-s)^Dialogue(?=.*Song - (Romaji|Translation))       -> took 14.631000 seconds
    
    ^Dialogue(?=.+Song - (Romaji|Translation))            -> took 14.635000 seconds
    ^Dialogue(?=.*Song - (Romaji|Translation))            -> took 14.697000 seconds
    

    but has the advantage of settings the s switch explicitly - so you’re sure about what should be done.

    Cheers
    Claudia


Log in to reply