Replace character in capture group

M Andre Z Eckenrode

I could have sworn I’ve asked this question before, and/or perhaps previously researched it online and located some code somewhere that I was able to adapt into something that worked for me at the time, but I’m not finding my adapted code, nor relevant posts here, nor any code I think I’m likely to have adapted from, now. I have a bunch of data that looks like this:

"10001-word1-word2-word3"
"35058-word4-word5"
"41115-word6-word7-word8-word9-word10"
"47172-word11-word12"
"53229-word13-word14-word15"
"59286-word16-word17-word18-word19"
"65343-word20-word21-word22-word23-word24-word25"
"71400-word26-word27"
"77457-word28-word29-word30"

I want to end up with this:

word1 word2 word3
word4 word5
word6 word7 word8 word9 word10
word11 word12
word13 word14 word15
word16 word17 word18 word19
word20 word21 word22 word23 word24 word25
word26 word27
word28 word29 word30

So basically, just all of the word#s, with hyphens replaced by spaces. I get as far as "\d+-(.+?)", but that doesn’t replace the hyphens, of course. I see a few Stack Overflow topics here and here that appear to be tackling similar operations, but I fail to understand the proposed solutions sufficiently to make it possible for me to adapt them myself.

Also, can anybody tell me what particular settings are the best to use for testing regex intended for NPP at Regex101? I notice that \l works in NPP, and in NPP Python scripts, to match a lowercase letter, but not at Regex101 when I have it set to PCRE, in which case I have to change \l to [[:lower:]] or [a-z]. But if I switch Regex101 to Python, my code that otherwise worked no longer does, and I’m not sure what the issues are.

M Andre Z Eckenrode

Clarification: The data example provided in my first post does not represent ALL of the data in my file, and some of the other data contains hyphens, but none of the other data looks like "#####-word#-word#-word#", etc.

Terry R

@M-Andre-Z-Eckenrode said in Replace character in capture group:

and some of the other data contains hyphens, but none of the other data looks like “#####-word#-word#-word#”, etc.

I doubt anyone is going to (be able to) give you a solution when that statement above is given. To spend time on a solution we need ALL the types of data that you need to change, in much the same as you portrayed the data examples above. A regex can be made for that but if it doesn’t do the whole work it may be a waste of effort.

Terry

Ekopalypse

@M-Andre-Z-Eckenrode said in Replace character in capture group:

Also, can anybody tell me what particular settings are the best to use for testing regex intended for NPP at Regex101?

It is PCRE, but npp uses boost::regex, which has a different (additional) implementation than regex101.
And PythonScript uses if you use the research and rereplace methods,
also boost::regex implementation but if you use import re and its methods it uses the Python re engine.

Which means that the different regex engines use some common concepts, but “the devil is in the details”.
Just for the record, the PythonScript plugin has a regex tester example,
this could be used to test your regexes - I know others don’t like it but
I think it has the best intersection with “native” npp regex engine.

M Andre Z Eckenrode

@Terry-R said in Replace character in capture group:

we need ALL the types of data that you need to change

The example in my first post is the only type of data I need to change, at least at this time, and can be reliably matched by "\d+-(.+?)", as far as I have seen so far. Other examples of hyphens used throughout my data file are probably too numerous to try to give them all here, but none of the others I’ve made note of are sandwiched between "\d+- and " I’m just hoping to be able to replace what’s matched with just my target words AND replace the hyphens with spaces at the same time.

@Ekopalypse said in Replace character in capture group:

the PythonScript plugin has a regex tester example

Anything to be said for it versus the separate Regex Trainer plugin? I’ve used the latter a bit, but found it to be not as informative as Regex101.

Ekopalypse

@M-Andre-Z-Eckenrode said in Replace character in capture group:

Anything to be said for it versus the separate Regex Trainer plugin? I’ve used the latter a bit, but found it to be not as informative as Regex101.

I wasn’t aware of this one - the script doesn’t do any explanation like the regex101 site. It just colors the text based on your regex instantly.

Ekopalypse

@M-Andre-Z-Eckenrode

The RegexTrainer seems to use dotnet regex enigne which, if I correctly remember,
is based on the java regex engine.

guy038

Hello, @m-andre-z-eckenrode, @ekopalypse, @terry-r and All,

@m-andre-z-eckenrode, I’m afraid that the sole decent regex tester is RegexBuddy from the Just Great Software company, created by Jan Goyvaerts

Of course, this product is not free, but if you need support on regexes, that you cannot get, after Notepad++ regex tests, it’s The Reference product in the “regex world” !! Note that the price of about 35$ for such a powerful software is not very expensive. considering its good return on investment ratio ;-))

For instance, just have a look to all the regex flavors that RegexBuddy can handle : 264 flavors !! Refer to the list, below :

https://www.regexbuddy.com/compare.html#flavors

And I would say, that the closest flavor to use with N++ should be boost::wregex 1.66–1.73 as N++ is compiled with Boost v1.70

Now, let’s go back to your problem :

From your initial text :

"10001-word1-word2-word3"
"35058-word4-word5"
"41115-word6-word7-word8-word9-word10"
"47172-word11-word12"
"53229-word13-word14-word15"
"59286-word16-word17-word18-word19"
"65343-word20-word21-word22-word23-word24-word25"
"71400-word26-word27"
"77457-word28-word29-word30"

I suppose that the following regex S/R

SEARCH ^"\d+-|\G([\u\l]+\d+)((-)|")

REPLACE ?1\1?3\x20

just gives the expected output :

word1 word2 word3
word4 word5
word6 word7 word8 word9 word10
word11 word12
word13 word14 word15
word16 word17 word18 word19
word20 word21 word22 word23 word24 word25
word26 word27
word28 word29 word30

However, be aware of two points :

If, right before running this S/R, the cursor is on a line, outside your block, which begins with a double quote followed with some digits and a dash as, for instance "987772-, a match would occur and be deleted !
In the same way, due to the \G feature, if the cursor is , outside your block, right before the string word123- or word37" a match would also occur and would changed as Word123 followed with a space char or as word37

A simple solution, to avoid any problem, is to use the Wrap around option and insert an empty line to the very beginning of file

Best Regards

guy038

M Andre Z Eckenrode

@Ekopalypse & @guy038, thank you both for the information about the various regex testers.

@guy038 said in Replace character in capture group:

I suppose that the following regex S/R

SEARCH ^"\d+-|\G([\u\l]+\d+)((-)|")

REPLACE ?1\1?3\x20

just gives the expected output :

Works well in NPP here. Thanks much again. (Note: When I tried using this search pattern in the Regex Trainer: Expression panel, it reported an error parsing it. The Regex Tester that came as a sample script with Python Script looks like I need to study it a bit before I can effectively use it.)

A simple solution, to avoid any problem, is to use the Wrap around option and insert an empty line to the very beginning of file

Thanks for the warnings. It shouldn’t be a problem with my usage, though.

Makwana Prahlad

Hello,@M-Andre-Z-Eckenrode
Please follow these steps, To Replace a character in the capture group

Step 1:- Open Notepad++ with the file for replace
Step 2:- Replace menu Ctrl+H
Step 3:- or Find menu - Ctrl+F
Step 4:- Check the Regular expression (at the bottom)
Step 5:- Write in Find what
Step 6:- \d+
Step 7:- Replace with:X
Step 8:- ReplaceAll

I hope this information will be useful to you.
Thank you.

M Andre Z Eckenrode

@guy038:

As I wrote in my previous reply to your post, your suggested regex code works perfectly for my current needs. I’m not sure I’ll ever have another need for the same sort of operation (replacing all instances of a particular character within a capture group), but I’d like very much to be able to have a full understanding of how your regex code works, so that I could effectively apply in some other situation, should the need ever arise. Towards that end, I’ve been generating different sets of sample data of my own invention, and attempting to adapt your code to operate on them. For example, I came up with this fictitious set of html code:

<span class="contributors">Writer – <a href="/contribs/001-John-Doe">J. Doe</a></span><span class="contributors">Producer – <a href="/contribs/002-Timothy-Smith">T. Smith</a></span><span class="contributors">Director – <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span>

My attempt to adapt your regex to work on the code above:

FIND : – <a href="\/contribs\/\d+-|\G(\w+)((-)|")>.+?<\/a><\/span>
REPLACE: — ?1\1?3\x20\r\n

My desired result:

<span class="contributors">Writer — John Doe
<span class="contributors">Producer — Timothy Smith
<span class="contributors">Director — Jane Johnson

Actual result:

<span class="contributors">Writer — John-Doe">J. Doe</a></span><span class="contributors">Producer — Timothy-Smith">T. Smith</a></span><span class="contributors">Director — Jane-Johnson">J. Johnson</a></span>

If I’d gotten that to work so far, I next would have tried my hand at also removing the  tags at the beginning and swapping the two sets of remaining information before and after the m-dash —:

John Doe — Writer
Timothy Smith — Producer
Jane Johnson — Director

Any chance you could break your regex code down and explain the various parts to me? Much of it I’m sure I already know, from other (simpler) operations I’ve done, but here’s all I know or have guessed so far:

^ = Beginning of line (or of text, if (?s) setting were used)

"\d+- = literal double-quote, followed by 1 or more digits, followed by literal hyphen

| = I know this as a divider between alternation expressions, but not sure why it’s used here in this location, since "\d+- isn’t part of an alternation sequence

\G = matches only at end of last match found, or at start of text being matched if no previous match found

([\u\l]+\d+) = Capture group 1: Any combination of one or more upper or lower case characters, followed by one or digits

((-)|") = Capture group 2: Alternation sequence of a hyphen (capture group 3), OR a double quote mark, one of which follows each word matched by ([\u\l]+\d+)

The REPLACE expression is trickier, for me:

?1 = Not strictly familiar with this use, that I recall, though looks somewhat like (?***N***), defined in the Boost v1.70.0 regex docs as a recursive execution of sub-expression N, but without the parenthesis.

\1 = Capture group 1 (AKA sub-expression 1, I think)

?3 = Execution of sub-expression 3?

\x20 = space

Terry R

@M-Andre-Z-Eckenrode said in Replace character in capture group:

but here’s all I know or have guessed so far:

Unfortunately some of your guesses aren’t quite right. Might I suggest you plug this into the website:
https://regex101.com/
as that provides great explanations of all the sub-expressions.

As a starter you will see that the | is in fact the alternation symbol and yes the \d+ DOES form part of a alternate sub-expression.

Terry

guy038

Hi, @m-andre-z-eckenrode, @ekopalypse, @terry-r and All,

First I would like to apologize ! Indeed, in the example of your previous post, the different parts to search are consecutive. So the \G assertion ,which searches from the location of the end of the previous match, is not needed at all !

So my previous S/R is :

SEARCH ^"\d+-|([\u\l]+\d+)((-)|")

REPLACE ?1\1?3\x20

Now, in your recent example, the general idea is to match a complete range <span...... and to only extract pertinent parts that you want to keep in replacement and re-order them as you like !

I will use the free-spacing mode ( (?x) ) which generally helps to better understand complicated regexes . In this mode, the regex can be split over several lines.

Any line can be commented after a # symbol. To search for a literal # just escape it \#
Any space symbol is irrelevant so use the syntaxes \x20, [ ] or escape it with the \ symbol to search for a space char

Before, Just an example to grasp the nuance between greedy and lazy quantifiers :

Let’s suppose the regex **ABC.+XYZ*, with the lazy quantifer +, against the string 67890ABC123451234512345XYZ678906789067890XYZ12345 => It catches the string
ABC123451234512345XYZ678906789067890XYZ, so the greatest non-null range of chars between the strings ABC and XYZ

Now, if we add a question mark right after the sign +, we get the regex ABC.+?XYZ, with the lazy quantifier +?. Thus, it would only match the string ABC123451234512345XYZ which is the smallest non-null range of chars between the strings ABC and XYZ

OK. So, the search regex can be written according to this form :

(?x)                              #  FREE-SPACING mode
(?-s)                             #  A DOT matches a SINGLE STANDARD char ( Not EOL chars )
  <span\x20class="contributors">  #    LITERAL string  span class="contributors">
(                                 #  START of CAPTURING group 1 ( the PROFESSION )
  .+?                             #    SMALLEST NON-NULL range of STANDARDS characters... till the string \x20–\x20<a
)                                 #  END of CAPTURING group 1
\x20–\x20<a                       #  LITERAL string SPACE + EN-Dash \x{2013} + SPACE + "<a" s
.+?                               #  SMALLEST NON-NULL range of STANDARDS characters... till a DASH punctuation sign
-                                 #  The LITTERAL DASH punctuation sign
(                                 #  START of CAPTURING group 2 ( the COMPLETE name )
  .+?                             #    SMALLEST NON-NULL range of STANDARDS characters... till the string ">
)                                 #    END of CAPTURING group 2
">                                #  LITERAL string  ">
.+?                               #  SMALLEST NON-NULL range of STANDARDS characters... till the string </span>
</span>                           #  LITERAL string </span>

And written in a single line, it becomes :

SEARCH (?x-s)<span\x20class="contributors">(.+?)\x20–\x20<a.+?-(.+?)">.+?

Unfortunately, this free-spacing mode is not available for the replacement regex syntax. So we still need to write :

REPLACEMENT \2 — \1\r\n which can be decomposed as :

\2        = The COMPLETE name ( Group 2 )
 —        = A SPACE char + a EM DASH char \x{2014} + a SPACE
\1        = The PROFESSION  ( Group 1 )
\r\n      = A LINE-BREAK

So, from your initial text :

<span class="contributors">Writer – <a href="/contribs/001-John-Doe">J. Doe</a></span><span class="contributors">Producer – <a href="/contribs/002-Timothy-Smith">T. Smith</a></span><span class="contributors">Director – <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span>

After running the regex S/R, we get :

John-Doe — Writer
Timothy-Smith — Producer
Jane-Johnson — Director

Now, we just have to run this trivial regex S/R, to change any dash, between the forename and the name, with a space character

SEARCH -

REPLACE \x20

Here is your expected text :

John Doe — Writer
Timothy Smith — Producer
Jane Johnson — Director

Now, in order to be fluent in regex matters, I’d like to advise you not to fixate on these ready-made regex examples from this forum and, instead, to start the "b-a-ba" with this excellent tutorial on regular expressions ( the reference !)

https://www.regular-expressions.info/

You’ll probably need half a month to be acquainted with and, let’s say, four months to build up correct regexes, for a specificneed, in a few minutes ! But it’s really worth it ;-))

Best Regards,

guy038

M Andre Z Eckenrode

@Terry-R said in Replace character in capture group:

Unfortunately some of your guesses aren’t quite right.

Figured that would turn out to be the case. :-)

Might I suggest you plug this into the website:
https://regex101.com/

I have fairly often used that site — in fact, I brought up the subject of my mixed successes with it in my first post for this topic thread — and concur that it’s often helpful and informative, but sometimes frustrating, at least for an amateur whose ambitions often exceed his understanding and abilities, like me. For the regex operations we’re discussing in this thread, Regex101 seems not very helpful at all with the substitution expressions. If I plug @guy038’s original suggested expressions (in response to my first post) into Regex101:

FIND: ^"\d+-|\G([\u\l]+\d+)((-)|")

REPLACE: ?1\1?3\x20

…I have to change [\u\l] to something else like [[:alpha:]] because PCRE via Regex101 apparently doesn’t recognize the former. And used there, the substitution expression results in:

?1?3 ?1word1?3 ?1word2?3 ?1word3?3 
?1?3 ?1word4?3 ?1word5?3 
?1?3 ?1word6?3 ?1word7?3 ?1word8?3 ?1word9?3 ?1word10?3

I don’t know if there are other ways of expressing it that are Regex101/PCRE-friendly.

@guy038 said in Replace character in capture group:

First I would like to apologize !

No apologies necessary! You’re way better at this than I am, and I appreciate your help (and everyone else’s)!

So the \G assertion, which searches from the location of the end of the previous match, is not needed at all !

Noted, and thanks for all the detailed explanations.

Now, we just have to run this trivial regex S/R, to change any dash, between the forename and the name, with a space character

I’m afraid that would be a less-than-ideal solution, but I think it’s my own fault for neglecting to provide adequate examples and explanation. In the fictitious example HTML code I provided, all the contributors had only first and last names, but of course in real life some people get referred to using three or more names — John David Hatch, Mary Anne Perry, etc. I was specifically trying to adapt your regex search/replace methods in ^"\d+-|\G([\u\l]+\d+)((-)|") and ?1\1?3\x20 to use with my made-up HTML, and would want it to also work if any persons had three or more names. Also, I assume that if I ever actually needed to operate on HTML similar to my example code, there might also be other hyphens, outside of the blocks of code I’d be targeting for manipulation, that need to be left alone. Again, I failed to mention these possibilities in my posts, even though I had them in my mind, and I apologize.

https://www.regular-expressions.info/

I have consulted that site on occasion as well.

Trying a modified tactic now… My data to be manipulated:

<p class="credits"><span class="contributors">Writer – <a href="/contribs/001-John-Doe">J. Doe</a>, <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span><span class="contributors">Producer – <a href="/contribs/002-Timothy-Smith">T. Smith</a></span><span class="contributors">Director – <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span></p>

The difference between the HTML immediately above and that which I’d posted here is that now there are two names/hyperlinks after “Writer”, so I’m looking to make this step of regex break the credit role/name(s) into one line per set, whether or not there are multiple names/hyperlinks given for a credit role.

FIND: (?:()|(<\/span>)\1|\2<\/p>)

REPLACE: (?1\t\1)(?2\2\r\n\t\1)(?3\2)

Desired result:

	<span class="contributors">Writer – <a href="/contribs/001-John-Doe">J. Doe</a>, <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span>
	<span class="contributors">Producer – <a href="/contribs/002-Timothy-Smith">T. Smith</a></span>
	<span class="contributors">Director – <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span>

Actual result:

	<span class="contributors">Writer – <a href="/contribs/001-John-Doe">J. Doe</a>, <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span><span class="contributors">Producer – <a href="/contribs/002-Timothy-Smith">T. Smith</a></span><span class="contributors">Director – <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span></p>

Looks like in both NPP and Regex101, only the first alternation expression () matches anything. No idea why the other two won’t. I can match any of them separately, but not as other than a first alternation expression.

If I had gotten this to work, my next, separate regex step would be to try to get to this:

	John Doe, Jane Johnson — writer
	Timothy Smith — producer
	Jane Johnson — director

M Andre Z Eckenrode

Ok, so it looks like I can use:

(?:()|(<\/span>)|<\/span><\/p>)

…but not:

(?:()|(<\/span>)\1|\2<\/p>)

…so I think I’ve learned that numbered backreferences used in alternation sequences are unique for each sequence. That wasn’t clear to me from the online docs for NPP and Boost Perl Regular Expression Syntax 1.70.0, but I guess makes sense now that I think about it. :-)

Alan Kilborn

@M-Andre-Z-Eckenrode said in Replace character in capture group:

…but not:

Not 100% sure because I haven’t followed the preceding in a super-detailed fashion, but maybe what you’re looking for is called a “subroutine call” and not a “backreference”?

The syntactical difference is:

\1 🡢 backreference
(?1) -> subroutine

See more in this excellent posting: https://community.notepad-plus-plus.org/post/56447

If I’m totally off-base, well, at least the “excellent posting” reference contains some otherwise good stuff. :-)

M Andre Z Eckenrode

@Alan-Kilborn said in Replace character in capture group:

maybe what you’re looking for is called a “subroutine call” and not a “backreference”?
See more in this excellent posting:

I don’t THINK I’m confusing the two — I’m actually trying to utilize both — though considering my track record with this particular excercise, it wouldn’t come as a complete shock to learn otherwise. But thanks in any case for the link to that truly informative post. I think I could, however, benefit from many working examples of usage in various situations.

As far as named capture groups go, I can’t get any of the syntaxes listed in the post and the online NPP doc to actually work in NPP. For example, given text ABCDEFGHIJKLMNOPQRSTUVWXYZ, and search expression ABC(?<Name>.+?)XYZ, I get the following:

Replacement Expression		Result
------------------------------------------
\g<Name>             	=	g<Name>
\g'Name'             	=	g'Name'
\g{Name}             	=	g{Name}

Equivalent results using \k. Do any of these actually work for anybody else?

Alan Kilborn

@M-Andre-Z-Eckenrode said in Replace character in capture group:

I can’t get any of the syntaxes

If I use this as the replace-with expression for your search-for expression and data:

find: ABC(?<Name>.+?)XYZ
repl: abc_$+{Name}_xyz
data to search: ABCDEFGHIJKLMNOPQRSTUVWXYZ

I obtain:

abc_DEFGHIJKLMNOPQRSTUVW_xyz

I tell you that because you were asking about “replacement expression”.

However, your examples show you were trying to use \g which I believe only works in the find expression. Example:

find: (?<Name>t...)ING\g<Name>

which would match:

data to search: testINGtest or testINGtrip

A similar but distinctly different example:

find: (?<Name>t...)ING(?&Name)

which would match:

data to search: testINGtest or tripINGtrip but not testINGtrip

PeterJones

@M-Andre-Z-Eckenrode ,

I can’t get any of the syntaxes listed … Replacement Expression

@Alan-Kilborn said in Replace character in capture group:

I believe only works in the find expression

You are correct.

And you weren’t the first person this week to not notice that the \g and \k syntaxes are in the search section, and not in the replacement section (which tried to be explicit that any syntax not mentioned in the replacement section was not valid in the replacement field, but has apparently failed).

Could you both look at the proposed capture groups and backreferences phrasing and substitution phrasing , and make sure that the updated sections makes the distinction more clear?

—
Note to future readers: those “phrasing” links are to a temporary branch, and in the future, they will not work. https://npp-user-manual.org/docs/searching/ is the official location of the search documentation, and https://github.com/notepad-plus-plus/npp-usermanual/blob/master/content/docs/searching.md is the master github source for the document.

M Andre Z Eckenrode

@Alan-Kilborn said in Replace character in capture group:

repl: abc_$+{Name}_xyz
your examples show you were trying to use \g which I believe only works in the find expression.

Aha! Looks that’s true in NPP — though \g<Name> actually DOES work in PCRE replacement expressions at Regex101.

Thanks for the education.