Regex replace only on the first group not the others

Danilo de Queiroz

First, I’m not a regex expert, but here is the scenario.
I have several .AAS subtitles files that I need to edit in batch-like approach, googling about it, I found that n++ can do multiple replaces using regex, which I believe is perfect for my claim, so:

The problem is a I have multiple lines among all my files as the following sample:

Dialogue: 0,0:00:06.43,0:00:10.49,Song - Romaji,0,0,0,Subtitle Line
Dialogue: 0,0:00:06.43,0:00:10.49,Song - Translation,0,0,0,Another Subtitle Line
Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line

I need to replace ONLY the “Dialog:” word on lines containing “Song - Romaji” OR “Song - Translation”.

I tried regx -> Dialogue:|(Song - (Romaji,|Translation,))
But it’s matching the default lines too.
Also tried -> ^(?=.?\bDialogue\b)(?=.?\bSong\b).*$
But it’s matching the correct target lines, but I didn’t realize how to make the replace regx field.

Claudia Frank

@Danilo-de-Queiroz

from the infos given I assume the following regex should do the job.

^Dialogue(?=.*?Song - (?=Romaji|Translation).*$)

You are looking for lines which start with word Dialogue
followed by, but not counted as match,

any chars (less as possible) followed by
Song - followed by (but again not counted) either
Romajii or
Translation followed by
any chars until end of line

In replace put the string you want.

Cheers
Claudia

guy038

Hello, Danilo and Claudia,

No problem, Danilo, regexes can do miracles, indeed !

Claudia, as you, I thought about a look-ahead structure. I just shortened it, a bit ;-)

Indeed, we don’t need a second look-ahead, to verify if the Romaji or Translation strings are present. We don’t need, also, to test for the presence of the range, possibly empty, of standard characters, till the end of the line ( .*$ ). This won’t change the main test ( Is the string Song - Romaji OR the string Song - Translation exists, further on, in the current line ? ) anyway !

So, my regex attempt would be :

SEARCH (?-s)^Dialogue(?=.+Song - (Romaji|Translation))

REPLACE Any Text you want

Notes :

First, the modifier, (?-s), ensures you that the dot special character will match, ONLY, a single standard character ( and not an EOL character )
The second part, Dialogue is the string to match
The ending part, (?=.+Song - (Romaji|Translation)), called a positive look-ahead, is a *condition, which must be true, in order to valid the overall regex !
The condition to test is : After the word dialogue, is there, further on, a string Song - Romaji OR a string Song - Translation, in the current line ?
If that condition is true, the search match ( Dialogue ) is, then, replaced by the contents of the Replace with field

So, Danilo, let’s consider the original text, of nine lines, below :

Dialogue: 0,0:00:06.43,0:00:10.49,Song - Romaji,0,0,0,Subtitle Line
Dialogue: 0,0:00:06.43,0:00:10.49,Song - Translation,0,0,0,Another Subtitle Line
Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line
Dialogue: 0,0:00:06.43,0:00:10.49,Song - TEST,0,0,0,Subtitle Line
Dialogue: 0,0:00:06.43,0:00:10.49,Song - Translation,0,0,0,Another Subtitle Line
Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line
Dialogue: 0,0:00:06.43,0:00:10.49,Song - Romaji,0,0,0,Subtitle Line
Dialogue: 0,0:00:06.43,0:00:10.49,Song - XXXXXXXX,0,0,0,Another Subtitle Line
Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line

If the Replace with: field contains the string Test_001, then, after clicking on the Replace All button, you should obtain the changed text, below :

Test_001: 0,0:00:06.43,0:00:10.49,Song - Romaji,0,0,0,Subtitle Line
Test_001: 0,0:00:06.43,0:00:10.49,Song - Translation,0,0,0,Another Subtitle Line
Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line
Dialogue: 0,0:00:06.43,0:00:10.49,Song - TEST,0,0,0,Subtitle Line
Test_001: 0,0:00:06.43,0:00:10.49,Song - Translation,0,0,0,Another Subtitle Line
Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line
Test_001: 0,0:00:06.43,0:00:10.49,Song - Romaji,0,0,0,Subtitle Line
Dialogue: 0,0:00:06.43,0:00:10.49,Song - XXXXXXXX,0,0,0,Another Subtitle Line
Dialogue: 0,0:02:37.90,0:02:40.36,Default,0,0,0,Some Default style line

Best Regards,

guy038

Claudia Frank

Hello Danilo and guy038,

you are right, second lookahead isn’t necessary, so the modified version is

^Dialogue(?=.*?Song - (Romaji|Translation).*$)

which is a char less than your example ;-D
and also a little bit faster ;-D

regex((?-s)^Dialogue(?=.+Song - (Romaji|Translation))) -> took 0.010000 seconds
regex(^Dialogue(?=.?Song - (Romaji|Translation).$)) -> took 0.008000 seconds

Your turn ;-)

Cheers
Claudia

guy038

Hi, Claudia,

First of all, due to the Markdown syntax, in our site, I suppose that two star symbols are missing, in your regex. So the exact regexes are :

(?-s)^Dialogue(?=.+Song - (Romaji|Translation))      Me

^Dialogue(?=.*?Song - (Romaji|Translation).*$)       You

Oh ! I’ve never thought about timing regex’s execution, yet ! So, I’ve lost for 0.002s only :-(( I’ll never recover after such an event !

Out of curiosity, could you time this similar regex ^Dialogue(?=.+Song - (Romaji|Translation)) ?. I just omitted the modifier (?-s), at the beginning. Of course, this implies that the . matches newline option must be unchecked, in the Replace dialog, before performing the S/R

Two remarks :

I still think that the block .*$, at the end of your regex, is not necessary for knowing if, either, the string Song - Romaji OR Song - Translation occurs, in the current line !
As these two strings may be located, anywhere, after the word Dialogue, I don’t think, also, that the lazy quantifier *? is necessary, at the beginning of the look-ahead !

Finally, your regex could be shortened to ^Dialogue(?=.*Song - (Romaji|Translation))

which is quite similar to my regex syntax ^Dialogue(?=.+Song - (Romaji|Translation)), without the (?-s) modifier !

Cheers,

guy038

Danilo de Queiroz

@guy038 @Claudia-Frank
Hi Buddies I didn’t tested each of this expressions you mentioned except the first one that just did the job.
Now I’m trying to understand each one of this characters for further uses, any good doc to point.
Thanks for both your help :)

guy038

Hi, Danilo,

First, don’t bother about the choice of the regex, Our regexes are quite similar, anyway ! Just the pleasure to discuss with Claudia !

I just forgot to give you some information for improving your knowledge of regular expressions !

Begin with that article, in N++ Wiki :

http://docs.notepad-plus-plus.org/index.php/Regular_Expressions

In addition, you’ll find good documentation, about the new Boost C++ Regex library, v1.55.0 ( similar to the PERL Regular Common Expressions, v1.48.0 ), used by Notepad++, since its 6.0 version, at the TWO addresses below :

http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

The FIRST link explains the syntax, of regular expressions, in the SEARCH part
The SECOND link explains the syntax, of regular expressions, in the REPLACEMENT part

You may, also, look for valuable informations, on the sites, below :

http://www.regular-expressions.info

http://www.rexegg.com

http://perldoc.perl.org/perlre.html

Be aware that, as any documentation, it may contain some errors ! Anyway, if you detected one, that’s good news : you’re improving ;-))

Claudia Frank

Hi to everyone,

Guy, after doing some tests,
I would say it doesn’t matter if using the (?-s) or not,
because sometimes it was faster and sometimes not.
Seems that external influence has much more significance.

What can be said so is that non-greedy beats greedy regexes.

Out of interest I did a test with two regexes which describes the pattern more concrete,
as I thought this could be faster but it turned out it wasn’t.
I assume it is related to the fact that each char needs to be checked anyway.

Results of that simple test (10.000 iterations per each regex)

(?-s)^Dialogue(?=.+Song - (Romaji|Translation))       -> took 14.708000 seconds
(?-s)^Dialogue(?=.*Song - (Romaji|Translation))       -> took 14.631000 seconds

(?-s)^Dialogue(?=.+Song - (Romaji|Translation).+$)    -> took 15.046000 seconds
(?-s)^Dialogue(?=.*Song - (Romaji|Translation).*$)    -> took 15.035000 seconds

^Dialogue(?=.+Song - (Romaji|Translation))            -> took 14.635000 seconds
^Dialogue(?=.*Song - (Romaji|Translation))            -> took 14.697000 seconds

^Dialogue(?=.+?Song - (Romaji|Translation))           -> took 13.568000 seconds    
^Dialogue(?=.*?Song - (Romaji|Translation))           -> took 13.575000 seconds

^Dialogue(?=.+Song - (Romaji|Translation).+$)         -> took 14.885000 seconds
^Dialogue(?=.*Song - (Romaji|Translation).*$)         -> took 14.947000 seconds

^Dialogue(?=.+?Song - (Romaji|Translation).+?$)       -> took 13.928000 seconds
^Dialogue(?=.*?Song - (Romaji|Translation).*?$)       -> took 13.972000 seconds

^Dialogue(?=: \d,\d\:\d\d\:\d\d\.\d\d,\d\:\d\d\:\d\d\.\d\d,Song - (Romaji|Translation)) -> took 16.183000 seconds
^Dialogue(?=: \d,\d\:\d{2}\:\d{2}\.\d{2},\d\:\d{2}\:\d{2}\.\d{2},Song - (Romaji|Translation)) -> took 20.889000 seconds

Cheers
Claudia

guy038

Hi, Danilo and Claudia,

Claudia, Your serie of tests, on regexes, is very interesting, indeed ! We can deduce some facts :

1) If a quantifier, with syntax Regex{n} has a small value, it’s better to use the syntax RegexRegex… Regex than Regex{n} ! ( Refer to the time difference between your two last examples : the regex containing \d\d and the one containing \d{2} )
2) When a range of text is NOT needed, for the results of the regex, replace it by .* or .+, if needed text is located, after, in the regex, ELSE don’t add it, at all !
3) When it does not matter between using a lazy quantifier and a greedy one, for the results of the regex, always prefer the lazy form !
4) If a range of text cannot be a null length string, prefer the + quantifier( idem {1,x} ) to the * quantifier ( idem {0,x} )

Finally, Danilo, thanks to Claudia’s tests about timing, and according to the rules above, the faster and shorter regex, for your case, seems to be :

^Dialogue(?=.+?Song - (Romaji|Translation))        ( 13.568000 seconds for 10,000 iterations )

But, as this regex does not contain the (?-s) modifier, at beginning, just be sure that the . matches newline option is not enabled, in the Replace dialog !

Cheers,

guy038

Claudia Frank

Hi Guy,

I would still use the more descriptive version of (?-s) because there isn’t really a difference

(?-s)^Dialogue(?=.+Song - (Romaji|Translation))       -> took 14.708000 seconds
(?-s)^Dialogue(?=.*Song - (Romaji|Translation))       -> took 14.631000 seconds

^Dialogue(?=.+Song - (Romaji|Translation))            -> took 14.635000 seconds
^Dialogue(?=.*Song - (Romaji|Translation))            -> took 14.697000 seconds

but has the advantage of settings the s switch explicitly - so you’re sure about what should be done.

Cheers
Claudia