Replace character in capture group
-
As I wrote in my previous reply to your post, your suggested regex code works perfectly for my current needs. I’m not sure I’ll ever have another need for the same sort of operation (replacing all instances of a particular character within a capture group), but I’d like very much to be able to have a full understanding of how your regex code works, so that I could effectively apply in some other situation, should the need ever arise. Towards that end, I’ve been generating different sets of sample data of my own invention, and attempting to adapt your code to operate on them. For example, I came up with this fictitious set of html code:
<span class="contributors">Writer – <a href="/contribs/001-John-Doe">J. Doe</a></span><span class="contributors">Producer – <a href="/contribs/002-Timothy-Smith">T. Smith</a></span><span class="contributors">Director – <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span>
My attempt to adapt your regex to work on the code above:
FIND :
– <a href="\/contribs\/\d+-|\G(\w+)((-)|")>.+?<\/a><\/span>
REPLACE:— ?1\1?3\x20\r\n
My desired result:
<span class="contributors">Writer — John Doe <span class="contributors">Producer — Timothy Smith <span class="contributors">Director — Jane Johnson
Actual result:
<span class="contributors">Writer — John-Doe">J. Doe</a></span><span class="contributors">Producer — Timothy-Smith">T. Smith</a></span><span class="contributors">Director — Jane-Johnson">J. Johnson</a></span>
If I’d gotten that to work so far, I next would have tried my hand at also removing the
<span>
tags at the beginning and swapping the two sets of remaining information before and after the m-dash—
:John Doe — Writer Timothy Smith — Producer Jane Johnson — Director
Any chance you could break your regex code down and explain the various parts to me? Much of it I’m sure I already know, from other (simpler) operations I’ve done, but here’s all I know or have guessed so far:
^
= Beginning of line (or of text, if(?s)
setting were used)"\d+-
= literal double-quote, followed by 1 or more digits, followed by literal hyphen|
= I know this as a divider between alternation expressions, but not sure why it’s used here in this location, since"\d+-
isn’t part of an alternation sequence\G
= matches only at end of last match found, or at start of text being matched if no previous match found([\u\l]+\d+)
= Capture group 1: Any combination of one or more upper or lower case characters, followed by one or digits((-)|")
= Capture group 2: Alternation sequence of a hyphen (capture group 3), OR a double quote mark, one of which follows each word matched by([\u\l]+\d+)
The REPLACE expression is trickier, for me:
?1
= Not strictly familiar with this use, that I recall, though looks somewhat like(?***N***)
, defined in the Boost v1.70.0 regex docs as a recursive execution of sub-expression N, but without the parenthesis.\1
= Capture group 1 (AKA sub-expression 1, I think)?3
= Execution of sub-expression 3?\x20
= space -
@M-Andre-Z-Eckenrode said in Replace character in capture group:
but here’s all I know or have guessed so far:
Unfortunately some of your guesses aren’t quite right. Might I suggest you plug this into the website:
https://regex101.com/
as that provides great explanations of all the sub-expressions.As a starter you will see that the
|
is in fact the alternation symbol and yes the\d+
DOES form part of a alternate sub-expression.Terry
-
Hi, @m-andre-z-eckenrode, @ekopalypse, @terry-r and All,
First I would like to apologize ! Indeed, in the example of your previous post, the different parts to search are consecutive. So the
\G
assertion ,which searches from the location of the end of the previous match, is not needed at all !So my previous S/R is :
SEARCH
^"\d+-|([\u\l]+\d+)((-)|")
REPLACE
?1\1?3\x20
Now, in your recent example, the general idea is to match a complete range
<span......</span>
and to only extract pertinent parts that you want to keep in replacement and re-order them as you like !I will use the free-spacing mode (
(?x)
) which generally helps to better understand complicated regexes . In this mode, the regex can be split over several lines.-
Any line can be commented after a
#
symbol. To search for a literal#
just escape it\#
-
Any space symbol is irrelevant so use the syntaxes
\x20
,[ ]
or escape it with the\
symbol to search for a space char
Before, Just an example to grasp the nuance between greedy and lazy quantifiers :
Let’s suppose the regex **
ABC.+XYZ
*, with the lazy quantifer+
, against the string 67890ABC123451234512345XYZ678906789067890XYZ12345 => It catches the string
ABC123451234512345XYZ678906789067890XYZ, so the greatest non-null range of chars between the stringsABC
andXYZ
Now, if we add a question mark right after the sign
+
, we get the regexABC.+?XYZ
, with the lazy quantifier+?
. Thus, it would only match the string ABC123451234512345XYZ which is the smallest non-null range of chars between the stringsABC
andXYZ
OK. So, the search regex can be written according to this form :
(?x) # FREE-SPACING mode (?-s) # A DOT matches a SINGLE STANDARD char ( Not EOL chars ) <span\x20class="contributors"> # LITERAL string span class="contributors"> ( # START of CAPTURING group 1 ( the PROFESSION ) .+? # SMALLEST NON-NULL range of STANDARDS characters... till the string \x20–\x20<a ) # END of CAPTURING group 1 \x20–\x20<a # LITERAL string SPACE + EN-Dash \x{2013} + SPACE + "<a" s .+? # SMALLEST NON-NULL range of STANDARDS characters... till a DASH punctuation sign - # The LITTERAL DASH punctuation sign ( # START of CAPTURING group 2 ( the COMPLETE name ) .+? # SMALLEST NON-NULL range of STANDARDS characters... till the string "> ) # END of CAPTURING group 2 "> # LITERAL string "> .+? # SMALLEST NON-NULL range of STANDARDS characters... till the string </span> </span> # LITERAL string </span>
And written in a single line, it becomes :
SEARCH
(?x-s)<span\x20class="contributors">(.+?)\x20–\x20<a.+?-(.+?)">.+?</span>
Unfortunately, this free-spacing mode is not available for the replacement regex syntax. So we still need to write :
REPLACEMENT
\2 — \1\r\n
which can be decomposed as :\2 = The COMPLETE name ( Group 2 ) — = A SPACE char + a EM DASH char \x{2014} + a SPACE \1 = The PROFESSION ( Group 1 ) \r\n = A LINE-BREAK
So, from your initial text :
<span class="contributors">Writer – <a href="/contribs/001-John-Doe">J. Doe</a></span><span class="contributors">Producer – <a href="/contribs/002-Timothy-Smith">T. Smith</a></span><span class="contributors">Director – <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span>
After running the regex S/R, we get :
John-Doe — Writer Timothy-Smith — Producer Jane-Johnson — Director
Now, we just have to run this trivial regex S/R, to change any dash, between the forename and the name, with a space character
SEARCH
-
REPLACE
\x20
Here is your expected text :
John Doe — Writer Timothy Smith — Producer Jane Johnson — Director
Now, in order to be fluent in regex matters, I’d like to advise you not to fixate on these ready-made regex examples from this forum and, instead, to start the
"b-a-ba"
with this excellent tutorial on regular expressions ( the reference !)https://www.regular-expressions.info/
You’ll probably need half a month to be acquainted with and, let’s say, four months to build up correct regexes, for a specificneed, in a few minutes ! But it’s really worth it ;-))
Best Regards,
guy038
-
-
@Terry-R said in Replace character in capture group:
Unfortunately some of your guesses aren’t quite right.
Figured that would turn out to be the case. :-)
Might I suggest you plug this into the website:
https://regex101.com/I have fairly often used that site — in fact, I brought up the subject of my mixed successes with it in my first post for this topic thread — and concur that it’s often helpful and informative, but sometimes frustrating, at least for an amateur whose ambitions often exceed his understanding and abilities, like me. For the regex operations we’re discussing in this thread, Regex101 seems not very helpful at all with the substitution expressions. If I plug @guy038’s original suggested expressions (in response to my first post) into Regex101:
FIND:
^"\d+-|\G([\u\l]+\d+)((-)|")
REPLACE:
?1\1?3\x20
…I have to change
[\u\l]
to something else like[[:alpha:]]
because PCRE via Regex101 apparently doesn’t recognize the former. And used there, the substitution expression results in:?1?3 ?1word1?3 ?1word2?3 ?1word3?3 ?1?3 ?1word4?3 ?1word5?3 ?1?3 ?1word6?3 ?1word7?3 ?1word8?3 ?1word9?3 ?1word10?3
I don’t know if there are other ways of expressing it that are Regex101/PCRE-friendly.
@guy038 said in Replace character in capture group:
First I would like to apologize !
No apologies necessary! You’re way better at this than I am, and I appreciate your help (and everyone else’s)!
So the
\G
assertion, which searches from the location of the end of the previous match, is not needed at all !Noted, and thanks for all the detailed explanations.
Now, we just have to run this trivial regex S/R, to change any dash, between the forename and the name, with a space character
I’m afraid that would be a less-than-ideal solution, but I think it’s my own fault for neglecting to provide adequate examples and explanation. In the fictitious example HTML code I provided, all the contributors had only first and last names, but of course in real life some people get referred to using three or more names — John David Hatch, Mary Anne Perry, etc. I was specifically trying to adapt your regex search/replace methods in
^"\d+-|\G([\u\l]+\d+)((-)|")
and?1\1?3\x20
to use with my made-up HTML, and would want it to also work if any persons had three or more names. Also, I assume that if I ever actually needed to operate on HTML similar to my example code, there might also be other hyphens, outside of the blocks of code I’d be targeting for manipulation, that need to be left alone. Again, I failed to mention these possibilities in my posts, even though I had them in my mind, and I apologize.I have consulted that site on occasion as well.
Trying a modified tactic now… My data to be manipulated:
<p class="credits"><span class="contributors">Writer – <a href="/contribs/001-John-Doe">J. Doe</a>, <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span><span class="contributors">Producer – <a href="/contribs/002-Timothy-Smith">T. Smith</a></span><span class="contributors">Director – <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span></p>
The difference between the HTML immediately above and that which I’d posted here is that now there are two names/hyperlinks after “Writer”, so I’m looking to make this step of regex break the credit role/name(s) into one line per set, whether or not there are multiple names/hyperlinks given for a credit role.
FIND:
(?:<p class="credits">(<span class="contributors">)|(<\/span>)\1|\2<\/p>)
REPLACE:
(?1\t\1)(?2\2\r\n\t\1)(?3\2)
Desired result:
<span class="contributors">Writer – <a href="/contribs/001-John-Doe">J. Doe</a>, <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span> <span class="contributors">Producer – <a href="/contribs/002-Timothy-Smith">T. Smith</a></span> <span class="contributors">Director – <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span>
Actual result:
<span class="contributors">Writer – <a href="/contribs/001-John-Doe">J. Doe</a>, <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span><span class="contributors">Producer – <a href="/contribs/002-Timothy-Smith">T. Smith</a></span><span class="contributors">Director – <a href="/contribs/003-Jane-Johnson">J. Johnson</a></span></p>
Looks like in both NPP and Regex101, only the first alternation expression
<p class="credits">(<span class="contributors">)
matches anything. No idea why the other two won’t. I can match any of them separately, but not as other than a first alternation expression.If I had gotten this to work, my next, separate regex step would be to try to get to this:
John Doe, Jane Johnson — writer Timothy Smith — producer Jane Johnson — director
-
Ok, so it looks like I can use:
(?:<p class="credits">(<span class="contributors">)|(<\/span>)<span class="contributors">|<\/span><\/p>)
…but not:
(?:<p class="credits">(<span class="contributors">)|(<\/span>)\1|\2<\/p>)
…so I think I’ve learned that numbered backreferences used in alternation sequences are unique for each sequence. That wasn’t clear to me from the online docs for NPP and Boost Perl Regular Expression Syntax 1.70.0, but I guess makes sense now that I think about it. :-)
-
@M-Andre-Z-Eckenrode said in Replace character in capture group:
…but not:
Not 100% sure because I haven’t followed the preceding in a super-detailed fashion, but maybe what you’re looking for is called a “subroutine call” and not a “backreference”?
The syntactical difference is:
\1
🡢 backreference(?1)
-> subroutine
See more in this excellent posting: https://community.notepad-plus-plus.org/post/56447
If I’m totally off-base, well, at least the “excellent posting” reference contains some otherwise good stuff. :-)
-
@Alan-Kilborn said in Replace character in capture group:
maybe what you’re looking for is called a “subroutine call” and not a “backreference”?
See more in this excellent posting:I don’t THINK I’m confusing the two — I’m actually trying to utilize both — though considering my track record with this particular excercise, it wouldn’t come as a complete shock to learn otherwise. But thanks in any case for the link to that truly informative post. I think I could, however, benefit from many working examples of usage in various situations.
As far as named capture groups go, I can’t get any of the syntaxes listed in the post and the online NPP doc to actually work in NPP. For example, given text
ABCDEFGHIJKLMNOPQRSTUVWXYZ
, and search expressionABC(?<Name>.+?)XYZ
, I get the following:Replacement Expression Result ------------------------------------------ \g<Name> = g<Name> \g'Name' = g'Name' \g{Name} = g{Name}
Equivalent results using
\k
. Do any of these actually work for anybody else? -
@M-Andre-Z-Eckenrode said in Replace character in capture group:
I can’t get any of the syntaxes
If I use this as the replace-with expression for your search-for expression and data:
find:
ABC(?<Name>.+?)XYZ
repl:abc_$+{Name}_xyz
data to search:ABCDEFGHIJKLMNOPQRSTUVWXYZ
I obtain:
abc_DEFGHIJKLMNOPQRSTUVW_xyz
I tell you that because you were asking about “replacement expression”.
However, your examples show you were trying to use
\g
which I believe only works in the find expression. Example:find:
(?<Name>t...)ING\g<Name>
which would match:
data to search:
testINGtest
ortestINGtrip
A similar but distinctly different example:
find:
(?<Name>t...)ING(?&Name)
which would match:
data to search:
testINGtest
ortripINGtrip
but nottestINGtrip
-
I can’t get any of the syntaxes listed … Replacement Expression
@Alan-Kilborn said in Replace character in capture group:
I believe only works in the find expression
You are correct.
And you weren’t the first person this week to not notice that the
\g
and\k
syntaxes are in the search section, and not in the replacement section (which tried to be explicit that any syntax not mentioned in the replacement section was not valid in the replacement field, but has apparently failed).Could you both look at the proposed capture groups and backreferences phrasing and substitution phrasing , and make sure that the updated sections makes the distinction more clear?
—
Note to future readers: those “phrasing” links are to a temporary branch, and in the future, they will not work. https://npp-user-manual.org/docs/searching/ is the official location of the search documentation, and https://github.com/notepad-plus-plus/npp-usermanual/blob/master/content/docs/searching.md is the master github source for the document. -
@Alan-Kilborn said in Replace character in capture group:
repl:
abc_$+{Name}_xyz
your examples show you were trying to use\g
which I believe only works in the find expression.Aha! Looks that’s true in NPP — though
\g<Name>
actually DOES work in PCRE replacement expressions at Regex101.Thanks for the education.
-
DO NOT rely on regex101 for the more esoteric aspects of regex. Doing so, and then intending to use the results in Notepad++ will cause frustration. Sure, okay, for simple cases, but the caliber of stuff you have been discussing in this thread is going to be different in N++ and regex101.
-
@PeterJones said in Replace character in capture group:
Could you both look at the proposed capture groups and backreferences phrasing and substitution phrasing , and make sure that the updated sections makes the distinction more clear?
Looks good to me so far, though coming from a fairly green regex user like me, I’d take that with a grain of salt. :-)
On a tangent here, I’ve noticed, on occasion when doing find/replace operations, that the
In selection
checkbox was sometimes ghosted (not available to check or uncheck), which I keep meaning to compile a list of circumstances for presentation and inquiry in these forums sometime. I notice that in both official and proposed versions of the doc, there seems to be no mention of any limitations on when theIn selection
checkbox is available. There seem to be some known limitations (at least one of which is mentioned here). Maybe they should be added to the docs? -
@Alan-Kilborn said in Replace character in capture group:
the caliber of stuff you have been discussing in this thread is going to be different in N++ and regex101.
I think I’ve already made it fairly clear, in my previous posts to this thread, that that’s what I’m finding to be the case.
-
@M-Andre-Z-Eckenrode said in Replace character in capture group:
I think I’ve already made it fairly clear, in my previous posts to this thread, that that’s what I’m finding to be the case.
Perhaps, but I get the feeling you might be holding on to regex101 a bit much. :-)
Plus, I’m kind of a late joiner to this thread; there’s a lot of content.
-
@M-Andre-Z-Eckenrode said in Replace character in capture group:
In selection checkbox was sometimes ghosted
In selection checkbox enabled condition: A single selection of one or more characters, that is NOT a column block selection.
Note that the checkbox’s appearance status can only be relied upon when you actually switch input focus to the find (family) window – upon activation the code runs a check to make sure you have the proper type of selection, and updates the checkbox and its state at that time.
-
@M-Andre-Z-Eckenrode said in Replace character in capture group:
Looks good to me so far
Thanks. Submitted PR #127. Hopefully, it will make it in before the next release of the npp-user-manual.org website.
-
@PeterJones said in Replace character in capture group:
Looks good to me so far
Looked fine to me as well.
Thanks for your fine attention to the manual.
I just need to read it more when I have trouble with things. :-) -
Hello, @peterjones,
Sorry, I’ve just seen your post where you asked people to verify the N++ official documentation ! I’ll try to have a look, myself, very soon. It would be better to do it before the next release of the website !
But, as I said to Alan, at the moment, my TO DO list, concerning N++ or else, is getting much longer ;-))
Cheers,
guy038