Delete number strings in the middle of lines of data
-
@guy038 said in Delete number strings in the middle of lines of data:
(?(DEFINE)…)
It’s a nice construct. It is documented here for those that don’t know:
https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
as
(?(DEFINE)never-exectuted-pattern) Defines a block of code that is never executed and matches no characters: this is usually used to define one or more named sub-expressions which are referred to from elsewhere in the pattern.I don’t think it has a mention in the official Notepad++ docs, though.
It doesn’t mean a lot if you simply read it, but a lot of value is added with a concrete example such as that provided by @guy038
One thing that I don’t like about it is that it consumes a capture group number. Wouldn’t it be better to work with named and not numbered groups? Indeed the docs say “…define one or more named sub-expressions…” so this would be equivalent for “my” regex (regex A) above:
(?x-si) ( ?(DEFINE) (?<ALAN> (?<![.\w+-]) [+-]?\d+(?:\.\d+)?(?:E[+-]?\d+)? (?![.\w]) ) ) (^(?P>ALAN)\h|(?P>ALAN)$)But alas, even though I’ve used a group named
ALANabove, it is equivalent to group #1, thus a possible equivalency use case could look like this:(?x-si) ( ?(DEFINE) (?<ALAN> (?<![.\w+-]) [+-]?\d+(?:\.\d+)?(?:E[+-]?\d+)? (?![.\w]) ) ) (^(?1)\h|(?1)$)Note that the difference is, even though I’ve named the group
ALANat “define” time, I refer to it as1when actually used.So why is this a downside? Well, because it couples the left side (definition) with the right side (use). Maybe I have a library of definitions, that I want to largely ignore (except their names), and I’m wanting to write a regex I’m going to use to match some data–maybe in the regex I want to backrefer to my own capture group #1. Well, because of the coupling, group #1 would already be in use.
Ok, so maybe it is a slight downside that wouldn’t come up often, but, I just happened to encounter that scenario recently… :-)
Did this turn into a Boost regex forum accidentally, or what?!? So sorry…
-
Hi, @alan-kilborn and All,
Yes, Alan, I’m agree with you that named groups should not be numbered by the regex engine and, thus, the user should only use them, as backreferences, with their names, in search and/or replacement !
However, the
.NETregex engine, has an intelligent way to have the best of both worlds ! Indeed, the.NETregex engine scans all unnamed groups, first, numbering them from value1, then re-scans the regex, continuing to number all the named groups, from after the greatest number used in unnamed groups ;-))In the old version, below, of the
Regular-Expressionsmanual, ofJan Goyvaerts( creator of theRegular-expressions.infosite ),https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf
it is said, pages
36-37Names and Numbers for Capturing Groups :
Here is where things get a bit ugly. Python and PCRE treat named capturing groups just like unnamed capturing groups, and number both kinds from left to right, starting with one. The regex
(a)(?P<x>b)(c)(?P<y>d)matches abcd as expected. If you do a search-and-replace with this regex and the replacement\1\2\3\4, you will get abcd. All four groups were numbered from left to right, from one till four. Easy and logical.Things are quite a bit more complicated with the .NET framework. The regex
(a)(?<x>b)(c)(?<y>d)again matches abcd. However, if you do a search-and-replace with$1$2$3$4as the replacement, you will get acbd. Probably not what you expected.The .NET framework does number named capturing groups from left to right, but numbers them after all the unnamed groups have been numbered. So the unnamed groups (a) and (c) get numbered first, from left to right, starting at one. Then the named groups
(?<x>b)and(?<y>d)get their numbers, continuing from the unnamed groups, in this case: three.To make things simple, when using .NET’s regex support, just assume that named groups do not get numbered at all, and reference them by name exclusively.
But, with the
Boostregex engine of Notepad++, we have to make do with the usual numbering of the groups, which just does one regex scan and numbers any group, named or not, one after the other !Best Regards,
guy038
-
Maybe getting really off-topic now, but with the “DEFINE” stuff it got me thinking about a similar “problem” I have. I say “problem” because it is nothing I can’t workaround, but I’m wondering if there is a better solution.
Consider:
search:
(?-i)(Xxx)|(XXX)|(Yyy)
replace:(?1Zzz)(?2ZZZ)(?3Www)This would convert this text:
The quick Xxx Yyy jumped over the lazy XXXintoThe quick Zzz Www jumped over the lazy ZZZSo please don’t consider the wrong problem. What I have is a simplified example of something more complicated, and the above is just for illustration.
What I’d like to do is to NOT have to specify the capitalized version of
ZZZin the replace, but rather use theZzztext without respecifying it (important!) in combination with a\Uoption.So in pseudo-regex, because I know this won’t work, without even trying it:
replace:
(?1Zzz)(?2\U${1}\E)(?3Www)So I was just wondering if you had any thoughts on this. TIA. :-)
-
Hi, @alan-kilborn,
Your replacement cannot work because, when the search regex matches the string
XXX, due to the different alternatives, the group2is the only group defined, anyway :-((In addition, seemingly, you’re not interested by the group
1, itself, but only with the replacement string of this group , so that you would like something like(?2\UREPLACEMENT of (\1)\E)!!
Let’s imagine the text sample, below, which is used in all subsequent tests :
Xxx XXX XXX---XxxThen with the regex S/R :
SEARCH (?x-i) ^(Xxx)$ | ^(XXX)$ | (\2---\1) Groups : 1 2 3 REPLACE \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\nWe get :
Group 1 >Xxx< Group 2 >< Group 3 >< Group 1 >< Group 2 >XXX< Group 3 >< XXX---XxxAs explained above, the search regex does match the
XxxandXXXstrings but fails to find theXXX---xxxbecause when trying the3rdalternative, the groups\1and\2are not defined
OK, let’s try another syntax, using sub-routine calls
(?#):SEARCH (?x-i) ^(Xxx)$ | ^(XXX)$ | ((?2)---(?1)) Groups : 1 2 3 REPLACE \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\nText turns into :
Group 1 >Xxx< Group 2 >< Group 3 >< Group 1 >< Group 2 >XXX< Group 3 >< Group 1 >< Group 2 >< Group 3 >XXX---Xxx<This time, the result is better as, when matching the string
XXX---xxx, with the alternative((?2)---(?1)), it makes reference to groups1and2, outside the alternative matched, due to the(DEFINE)syntax !However, we don’t get the groups
1and2, individually
Let’s use, again, an other syntax, where any sub-routine call
(?#)is embedded in parentheses, itself, so((?#))SEARCH (?x-i) ^(Xxx)$ | ^(XXX)$ | ((?2))---((?1)) Groups : 1 2 3 4 REPLACE \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\nGroup 4 >\4<\r\nJust note that the
3rdalternative is not embedded, itself, between parentheses. After execution, we’re left with :Group 1 >Xxx< Group 2 >< Group 3 >< Group 4 >< Group 1 >< Group 2 >XXX< Group 3 >< Group 4 >< Group 1 >< Group 2 >< Group 3 >XXX< Group 4 >Xxx<Ah!.. ,now, when the regex engine tries the
3rdalternative, it does match the stringXXX-Xxxand, in replacement, we note that groups3and4( which are identical to groups2and1, respectively, not part of the present match ), are both defined :-))So, using a more natural example, below :
SEARCH (?x-i) ^(Xxx)$ | ^(XXX)$ | ((?2))---((?1)) Groups : 1 2 3 4 REPLACE (?1ABC)(?2DEF)(?3Group 1 = \4 and Group 2 = \3)The sample text :
Xxx XXX XXX---Xxxis changed into :
ABC DEF Group 1 = Xxx and Group 2 = XXX
However, there’s still a problem, as, in your example, you would like to refer to the replacement part of a group, which does not participate to the overall match, anyway ! More complicated…
We must find a way :
-
To match and capture the string
XXX -
To capture the string
ZZZ, in the same alternative, although the stringZZZwould not be part of the overall match
Still searching !
Best Regards,
guy038
-
-
@guy038 I was working on a paper wen i notice i had to replace averything after a (space)
exemple 2020-04-10 21,25,25I found the pdf pud i’m just Dum
how to remove every regular expresion: 21,25,25
so everything after the year-month-day?
And sorry If I did broke few rules éditor Notepad++ -
@cracksoft said in Delete number strings in the middle of lines of data:
@guy038 I was working on a *papier wen i notice i had to replace averything after a (space)
exemple 2020-04-10 21,25,25I found the pdf pud i’m just Dum
how to remove every regular expresion: 21,25,25
so everything after the year-month-day?
And sorry If I did broke few rules éditor Notepad++
*edit -
@cracksoft **edit I may be on the right track I just found front the pdf you provide in this post space = \s if i’m not wrong?
-
Hello, @craksoft, and All,
If I fully understood your needs, you would like to delete the part after a date, which, I suppose, is the hour part ?
If so :
-
SEARCH
(?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} -
REPLACE
Leave EMPTY -
Select the
Regular expressionsearch mode
Best Regards,
guy038
P.S. :
For regex documentation, follow this link :
https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation
-
-
@guy038 said in Delete number strings in the middle of lines of data:
(?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8}
So this long thing (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} is 6 number ?
Still thank it work you made my escape of selecting and deleting few hours of work ^^ -
Hi, @craksoft, and All,
You said :
So this long thing (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} is 6 number ?
I don’t know what you means, exactly !?
The regex expression
(?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8}deletes blanks characters and the next8characters, when preceded by a date, with theYYYY-MM-DDformat. No more, no less :-)BR
guy038