Regex alternation in both FIND & REPLACE strings

M Andre Z Eckenrode

(I hereby swear that the advice requested in this post is intended for use specifically in Notepad++.) :-)

Not wanting an excellent suggestion to go to waste, I took special notice of the recent post by @Alan-Kilborn in which he provided an example of using regex alternation in both FIND and REPLACE strings, which I’d never seen done before. I found an opportunity to give it a try here, but it didn’t quite work out the way I intended. The following example text shows what it looks like before my intended regex operation:

1. “Track 1" (Outtake)	3:24
2. “Track 2" (Alternate Version)	3:30
3. “Track 3" (Drone Mix)	2:51
4. “Track 4" (Alternate Version)	4:08
5. “Track 5" (Rehearsal)	4:34
6. “Track 6" (Alternate Version)	3:34
7. “Track 7" (String Mix)	4:29
8. “Track 8" (Rehearsal)	1:40
9. “Track 9" (Alternate Version)	4:50
10. “Track 10" (Rehearsal)	2:52
11. “Track 11" (Piano Mix)	6:09
12. “Track 12" (unlisted instrumental)	5:17

My regex expressions (with bold curly quotes added to help clarify the presence of leading and/or trailing spaces):

REPLACE: “((?1)(?2 [alt])(?3 [drone mix])(?4 [rehearsal])(?5 [string mix])(?6 [piano mix])(?7 [instrumental]))” — ”

My intended result:

1. “Track 1” — 3:24
2. “Track 2 [alt]” — 3:30
3. “Track 3 [drone mix]” — 2:51
4. “Track 4 [alt]” — 4:08
5. “Track 5 [rehearsal]” — 4:34
6. “Track 6 [alt]” — 3:34
7. “Track 7 [string mix]” — 4:29
8. “Track 8 [rehearsal]” — 1:40
9. “Track 9 [alt]” — 4:50
10. “Track 10 [rehearsal]” — 2:52
11. “Track 11 [piano mix]” — 6:09
12. “Track 12 [instrumental]” — 5:17

My actual result:

1. “Track 1 [alt]” — 3:24
2. “Track 2 [drone mix]” — 3:30
3. “Track 3 [rehearsal]” — 2:51
4. “Track 4 [drone mix]” — 4:08
5. “Track 5 [string mix]” — 4:34
6. “Track 6 [drone mix]” — 3:34
7. “Track 7 [piano mix]” — 4:29
8. “Track 8 [string mix]” — 1:40
9. “Track 9 [drone mix]” — 4:50
10. “Track 10 [string mix]” — 2:52
11. “Track 11 [instrumental]” — 6:09
12. “Macbeth” — 5:17

Is there a trick I’m missing?

M Andre Z Eckenrode

Correction — last line of actual result should have read:

12. “Track 12” — 5:17

Terry R

@M-Andre-Z-Eckenrode said in Regex alternation in both FIND & REPLACE strings:

Is there a trick I’m missing?

No trick. Look again at the ( at the start. You have 2 before you get to the first alternate Outtake. It is inside the 2nd set so it is the ?2 in the replacement. All the others are also 1 out from what you thought.

If you put the find what regex into Notepad++ as if it is just text then by moving through the code Notepad++ will show you the pair (highlight in red) to each ( or ) to help you figure out where you need to prune an unwanted set. Alternatively change each ?# in the Replace With code adding 1 to the current number. That might get you closer to what you want.

I’m not on a PC currently to test but think that should do it.

Terry

Terry R

The best description I can currently find is here
http://rexegg.com/regex-capture.html#groupnumbers
Basically you count left brackets from left to right as 1,2,3,… . There are however exceptions to that rule including branch resets (Very similar to the type of coding you used) so you do need to know how to count these correctly.

I think you assumed ?1 referred to the first alternate selection, whereas that will always be the selection within the first set of brackets. So if an alternate code exists at the end of a complex regex quite possibly the first alternate group might be something like ?8.

The site I linked above I find gives a lot of great information but be aware there are many regular expression engines out there and this site does have some information which does NOT fit exactly to Notepad++ needs.

Terry

M Andre Z Eckenrode

@Terry-R said in Regex alternation in both FIND & REPLACE strings:

Look again at the ( at the start. You have 2 before you get to the first alternate Outtake. It is inside the 2nd set so it is the ?2 in the replacement. All the others are also 1 out from what you thought.

Ha! Right you are! Which leads me to another realization (read on…)

Alternatively change each ?# in the Replace With code adding 1 to the current number.

I’m sure that would work as well, but I think what I really should have done to begin with was make the outer () a non-capture group, so my find string should have started " (?:(\(Outtake\))|(\(Alternate Version\))|… which I’ve now verified does the job.

Some background: My first attempt at this was made without the outer (), but I found I had to do something different because my initial results ended up looking like this:

1. “Track 1” — 	3:24
2. “Track 2"  [alt]” — 	3:30
3. “Track 3"  [drone mix]” — 	2:51
4. “Track 4"  [alt]” — 	4:08
5. “Track 5"  [rehearsal]” — 	4:34
6. “Track 6"  [alt]” — 	3:34
7. “Track 7"  [string mix]” — 	4:29
8. “Track 8"  [rehearsal]” — 	1:40
9. “Track 9"  [alt]” — 	4:50
10. “Track 10"  [rehearsal]” — 	2:52
11. “Track 11"  [piano mix]” — 	6:09
12. “Track 12"  [instrumental]” — 5:17

…in which all the characters outside of the alternation sequence in my FIND string (" at the beginning, \t at the end) apparently got ignored, except in the first line where the " actually did get replaced (and I’m not clear on why that happened, but maybe one of you regex masters is?). So I tried adding the outer (), but neglected to make it non-capturing.

Thanks much for the advice and helpful references, Terry.

guy038

Hello @m-andre-z-eckenrode, @terry-r and All,

First, I noticed that your text is displayed, on our forum, with quotations characters “ and ” ( of Unicode code-points \x201C and \x201D ). And I suppose that your real text, before of after the regex operation, contains the usual " character ( \x22 ), only !
Secondly, in the replacement part, seemingly, you use the EM DASH ( \x{2014} ) character, instead of the usual - char ( \x2d )

So this leads to the following regex S/R :

SEARCH (?x-si) "\x20\( (?: (Outtake) | (Alternate\x20Version) | (unlisted\x20instrumental) | (.+?) ) \)\t

REPLACE (?1:\x20[(?2alt)(?3instrumental)(?4\L\4)])"\x20\x{2014}\x20

So from the text :

1. "Track 1" (Outtake)	3:24
2. "Track 2" (Alternate Version)	3:30
3. "Track 3" (Drone Mix)	2:51
4. "Track 4" (Alternate Version)	4:08
5. "Track 5" (Rehearsal)	4:34
6. "Track 6" (Alternate Version)	3:34
7. "Track 7" (String Mix)	4:29
8. "Track 8" (Rehearsal)	1:40
9. "Track 9" (Alternate Version)	4:50
10. "Track 10" (Rehearsal)	2:52
11. "Track 11" (Piano Mix)	6:09
12. "Track 12" (unlisted instrumental)	5:17

We get the expected text, with usual " quotation and the EM Dash char — before each track’s duration :

1. "Track 1" — 3:24
2. "Track 2 [alt]" — 3:30
3. "Track 3 [drone mix]" — 2:51
4. "Track 4 [alt]" — 4:08
5. "Track 5 [rehearsal]" — 4:34
6. "Track 6 [alt]" — 3:34
7. "Track 7 [string mix]" — 4:29
8. "Track 8 [rehearsal]" — 1:40
9. "Track 9 [alt]" — 4:50
10. "Track 10 [rehearsal]" — 2:52
11. "Track 11 [piano mix]" — 6:09
12. "Track 12 [instrumental]" — 5:17

Remark that I use the free-spacing mode (?x) in order to separate the different parts of the search regex for a better comprehension

Next time, I could explain these regex syntaxes, if you feel difficulties to understand them ! Presently, I’ve got some other posts to reply to ;-))

Best Regards,

guy038

M Andre Z Eckenrode

@guy038 said in Regex alternation in both FIND & REPLACE strings:

I noticed that your text is displayed, on our forum, with quotations characters “ and ” ( of Unicode code-points \x201C and \x201C ). And I suppose that your real text, before of after the regex operation, contains the usual " character ( \x22 ), only !

Actually, I prefer and routinely use typographical (curly) quotation marks “ ” and ‘ ’ in text I’m editing, except when the straight variety " " and ' ' are required for any computer code purposes. But the vast majority of my text editing is done as ANSI/Windows-1252 rather than Unicode, so those code points are usually decimal 145–148, or hex 91–94 for me.

in the replacement part, seemingly, you use the EM DASH ( \x{2014} ) character, instead of the usual - char ( \x2d )

Also intentional, and sometimes I also use the EN DASH (decimal code points 150/151, or hex 96/97), as in my previous paragraph.

So this leads to the following regex S/R :

SEARCH (?x-si) "\x20\( (?: (Outtake) | (Alternate\x20Version) | (unlisted\x20instrumental) | (.+?) ) \)\t

REPLACE (?1:\x20[(?2alt)(?3instrumental)(?4\L\4)])"\x20\x{2014}\x20

Of course, those only work for me if I’m using Unicode text, but I’m usually not, and, as noted above, my use of “ ” was intentional, though that would be easy enough to change in your suggested expressions.

Remark that I use the free-spacing mode (?x) in order to separate the different parts of the search regex for a better comprehension

Yes, noted. I’ll keep that in mind for possible future use.

Next time, I could explain these regex syntaxes, if you feel difficulties to understand them !

I think I’m good, but thanks! For the record, these are the working expressions I settled on (not counting the surrounding black bold “ ”):

REPLACE: “((?1)(?2 [alt])(?3 [drone mix])(?4 [rehearsal])(?5 [string mix])(?6 [piano mix])(?7 [instrumental]))” — ”

I am still curious as to why my first attempted expressions, detailed in my previous post above, replaced the " in only the first line (1. “Track 1" (Outtake) 3:24), if you or anybody else has any idea.

guy038

Hi, @m-andre-z-eckenrode, @terry-r and All,

@m-andre-z-eckenrode, many thanks for your clear and fast reply ! Ok, so you could refer to this link to get the main punctuation characters :

https://www.unicode.org/charts/PDF/U2000.pdf

They belong to the Unicode General Punctuation block ( range U+0000 - U+206F ) Of course, most of them are repeated in common Windows encodings, from Windows-1250 till Windows-1258, with the input method Alt + 0### to write them in a text ! Refer to :

https://en.wikipedia.org/wiki/Windows-1252#Character_set

Thus, just change the classical " char by the ” character, or the syntax \x{201D}, in the search and replacement regexes !

Regarding the free-spacing mode ( (?x) ), 4 important points to remember :

In this mode, space(s) and line-break(s) are irrelevant. So if you need :
- To search for a space, use the forms \x20 or [ ] or escape the space char with the \ character
- To search for a line-break, use, as usual, the \r\n, or \r or \n or the composite \R syntaxes
In this mode any # character begins a single-line comment zone. So, use the syntaxes \x23, \# or [#] to search for a literal # char
In this mode, your may split your search regex in several lines for readability, commenting each line ! See, below.
Unfortunately, this mode cannot be used for the Replace zone

(?x)                          # FREE-SPACING mode
(?-si)                        # Any DOT matches a SINGLE STANDARD char, ONLY and the search process is SENSITIVE to case
"  \x20  \(                   # A DOUBLE-QUOTATION mark , a SPACE char and an OPENING PARENTHESIS
(?:                           # Start of a NON-CAPTURING group
  (Outtake)                   #   String "Outtake", STORED in group 1
|                             # OR
  (Alternate\x20Version)      #   String "Alternate\x20Version", STORED in group 2
|                             # OR
  (unlisted\x20instrumental)  #   String "unlisted instrumental", STORED in group 3
|                             # OR
  (.+?)                       #   The SMALLEST NON-NULL range of STANDARD characters, ... till an ENDING PARENTHESIS, STORED in group 4
                              #     ( It corresponds to the OTHER strings which JUST need to be REWRITTEN in LOWERCASE, during the REPLACEMENT phase )
)                             # End of the NON-CAPTURING group
\)  \t                        # An ENDING PARENTHESIS and a TABULATION char

Just select all the lines and open the Search or Replace dialog !

Although NOT allowed in the replacement zone, here is the logic of the replacement :

(?1                         # CONDITIONAL replacement, based on EXISTENCE of group 1
                            #   Part THEN, before the COLON, EMPTY => SUPPRESSION of the string "Outtake"
:                           # ELSE ( for ANY group DIFFERENT from group 1 or anything NOT in a group )
  \x20  [                   #   A SPACE char and an OPENING SQUARED bracket are rewritten
  (?2alt)                   #   CONDITIONAL replacement, based on EXISTENCE of group 2 => The string "Alternate Version" is changed into "alt"
  (?3instrumental)          #   CONDITIONAL replacement, based on EXISTENCE of group 3 => The string "unlisted instrumental" is changed into "instrumental"
  (?4\L\4)                  #   CONDITIONAL replacement, based on EXISTENCE of group 4 => Any OTHER string is just REWRITTEN in LOWERCASE, due to the \L MODIFIER
  ]                         #   an ENDING SQUARED bracket is rewritten
)                           # End of the first CONDITIONAL replacement
"  \x20  \x{2014}  \x20     # A DOUBLE-QUOTATION mark, a SPACE char, an EM_DASH char and a final SPACE char are rewritten

Allow me to repeat : the multi-lines replacement, right above, must NOT be used and is intended for explanations only !

You said :

I am still curious as to why my first attempted expressions, detailed in my previous post above

Well, let’s imagine the regex (ABC|DEF|XYZ)\d+. This regex is equivalent to the regex ABC\d+|DEF\d+|XYZ\d+ which searches for strings ABC, DEF or XYZ, each, followed by, at least, one digit

But, of course, if you omit the parentheses, the regex ABC|DEF|XYZ\d+ searches for the string ABC OR the string DEF OR the string XYZ followed by, at least, one digit

Now , let’s try to simplify your search regex :

Inside the non-capturing group each string is embedded with \( and \) => they could be placed outside the non-capturing group ( I use the \x20 syntax for a space, in order to avoid your black bold notation ! )

In my search regex, note that I did not take in account the strings different from Outtake, Alternate Version and unlisted instrumental, as all the other strings, caught with the (.+?) syntax, are simply rewritten in lowercase, due to the \L replacement syntax

Your replacement regex may be simplified, too : the outer parentheses are useless :

=> (?1)(?2 [alt])(?3 [drone mix])(?4 [rehearsal])(?5 [string mix])(?6 [piano mix])(?7 [instrumental])\x{201D}\x20\x{2014}\x20

This means that the part "\x20\x{2014}\x20, at the end, is rewritten whatever group matches the search regex. In other words, it’s like this pseudo syntax (?*\x{201D}\x20\x{2014}\x20), for any group, was possible !

Cheers,

guy038

M Andre Z Eckenrode

@guy038 said in Regex alternation in both FIND & REPLACE strings:

Ok, so you could refer to this link to get the main punctuation characters :

https://www.unicode.org/charts/PDF/U2000.pdf

Thank. I’ll keep that in mind for if/when I ever need to use Unicode text on a large scale, though I’m not currently anticipating it will happen.

This means that the part "\x20\x{2014}\x20, at the end, is rewritten whatever group matches the search regex.

Ok, thanks for the detailed explanation and further suggestions as well. I will look for opportunities to apply them in future projects.