Delete number strings in the middle of lines of data
-
@guy038 said in Delete number strings in the middle of lines of data:
(?(DEFINE)…)
It’s a nice construct. It is documented here for those that don’t know:
https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
as
(?(DEFINE)never-exectuted-pattern) Defines a block of code that is never executed and matches no characters: this is usually used to define one or more named sub-expressions which are referred to from elsewhere in the pattern.
I don’t think it has a mention in the official Notepad++ docs, though.
It doesn’t mean a lot if you simply read it, but a lot of value is added with a concrete example such as that provided by @guy038
One thing that I don’t like about it is that it consumes a capture group number. Wouldn’t it be better to work with named and not numbered groups? Indeed the docs say “…define one or more named sub-expressions…” so this would be equivalent for “my” regex (regex A) above:
(?x-si) ( ?(DEFINE) (?<ALAN> (?<![.\w+-]) [+-]?\d+(?:\.\d+)?(?:E[+-]?\d+)? (?![.\w]) ) ) (^(?P>ALAN)\h|(?P>ALAN)$)
But alas, even though I’ve used a group named
ALAN
above, it is equivalent to group #1, thus a possible equivalency use case could look like this:(?x-si) ( ?(DEFINE) (?<ALAN> (?<![.\w+-]) [+-]?\d+(?:\.\d+)?(?:E[+-]?\d+)? (?![.\w]) ) ) (^(?1)\h|(?1)$)
Note that the difference is, even though I’ve named the group
ALAN
at “define” time, I refer to it as1
when actually used.So why is this a downside? Well, because it couples the left side (definition) with the right side (use). Maybe I have a library of definitions, that I want to largely ignore (except their names), and I’m wanting to write a regex I’m going to use to match some data–maybe in the regex I want to backrefer to my own capture group #1. Well, because of the coupling, group #1 would already be in use.
Ok, so maybe it is a slight downside that wouldn’t come up often, but, I just happened to encounter that scenario recently… :-)
Did this turn into a Boost regex forum accidentally, or what?!? So sorry…
-
Hi, @alan-kilborn and All,
Yes, Alan, I’m agree with you that named groups should not be numbered by the regex engine and, thus, the user should only use them, as backreferences, with their names, in search and/or replacement !
However, the
.NET
regex engine, has an intelligent way to have the best of both worlds ! Indeed, the.NET
regex engine scans all unnamed groups, first, numbering them from value1
, then re-scans the regex, continuing to number all the named groups, from after the greatest number used in unnamed groups ;-))In the old version, below, of the
Regular-Expressions
manual, ofJan Goyvaerts
( creator of theRegular-expressions.info
site ),https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf
it is said, pages
36-37
Names and Numbers for Capturing Groups :
Here is where things get a bit ugly. Python and PCRE treat named capturing groups just like unnamed capturing groups, and number both kinds from left to right, starting with one. The regex
(a)(?P<x>b)(c)(?P<y>d)
matches abcd as expected. If you do a search-and-replace with this regex and the replacement\1\2\3\4
, you will get abcd. All four groups were numbered from left to right, from one till four. Easy and logical.Things are quite a bit more complicated with the .NET framework. The regex
(a)(?<x>b)(c)(?<y>d)
again matches abcd. However, if you do a search-and-replace with$1$2$3$4
as the replacement, you will get acbd. Probably not what you expected.The .NET framework does number named capturing groups from left to right, but numbers them after all the unnamed groups have been numbered. So the unnamed groups (a) and (c) get numbered first, from left to right, starting at one. Then the named groups
(?<x>b)
and(?<y>d)
get their numbers, continuing from the unnamed groups, in this case: three.To make things simple, when using .NET’s regex support, just assume that named groups do not get numbered at all, and reference them by name exclusively.
But, with the
Boost
regex engine of Notepad++, we have to make do with the usual numbering of the groups, which just does one regex scan and numbers any group, named or not, one after the other !Best Regards,
guy038
-
Maybe getting really off-topic now, but with the “DEFINE” stuff it got me thinking about a similar “problem” I have. I say “problem” because it is nothing I can’t workaround, but I’m wondering if there is a better solution.
Consider:
search:
(?-i)(Xxx)|(XXX)|(Yyy)
replace:(?1Zzz)(?2ZZZ)(?3Www)
This would convert this text:
The quick Xxx Yyy jumped over the lazy XXX
intoThe quick Zzz Www jumped over the lazy ZZZ
So please don’t consider the wrong problem. What I have is a simplified example of something more complicated, and the above is just for illustration.
What I’d like to do is to NOT have to specify the capitalized version of
ZZZ
in the replace, but rather use theZzz
text without respecifying it (important!) in combination with a\U
option.So in pseudo-regex, because I know this won’t work, without even trying it:
replace:
(?1Zzz)(?2\U${1}\E)(?3Www)
So I was just wondering if you had any thoughts on this. TIA. :-)
-
Hi, @alan-kilborn,
Your replacement cannot work because, when the search regex matches the string
XXX
, due to the different alternatives, the group2
is the only group defined, anyway :-((In addition, seemingly, you’re not interested by the group
1
, itself, but only with the replacement string of this group , so that you would like something like(?2\U
REPLACEMENT of (\1
)\E)
!!
Let’s imagine the text sample, below, which is used in all subsequent tests :
Xxx XXX XXX---Xxx
Then with the regex S/R :
SEARCH (?x-i) ^(Xxx)$ | ^(XXX)$ | (\2---\1) Groups : 1 2 3 REPLACE \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\n
We get :
Group 1 >Xxx< Group 2 >< Group 3 >< Group 1 >< Group 2 >XXX< Group 3 >< XXX---Xxx
As explained above, the search regex does match the
Xxx
andXXX
strings but fails to find theXXX---xxx
because when trying the3rd
alternative, the groups\1
and\2
are not defined
OK, let’s try another syntax, using sub-routine calls
(?#)
:SEARCH (?x-i) ^(Xxx)$ | ^(XXX)$ | ((?2)---(?1)) Groups : 1 2 3 REPLACE \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\n
Text turns into :
Group 1 >Xxx< Group 2 >< Group 3 >< Group 1 >< Group 2 >XXX< Group 3 >< Group 1 >< Group 2 >< Group 3 >XXX---Xxx<
This time, the result is better as, when matching the string
XXX---xxx
, with the alternative((?2)---(?1))
, it makes reference to groups1
and2
, outside the alternative matched, due to the(DEFINE)
syntax !However, we don’t get the groups
1
and2
, individually
Let’s use, again, an other syntax, where any sub-routine call
(?#)
is embedded in parentheses, itself, so((?#))
SEARCH (?x-i) ^(Xxx)$ | ^(XXX)$ | ((?2))---((?1)) Groups : 1 2 3 4 REPLACE \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\nGroup 4 >\4<\r\n
Just note that the
3rd
alternative is not embedded, itself, between parentheses. After execution, we’re left with :Group 1 >Xxx< Group 2 >< Group 3 >< Group 4 >< Group 1 >< Group 2 >XXX< Group 3 >< Group 4 >< Group 1 >< Group 2 >< Group 3 >XXX< Group 4 >Xxx<
Ah!.. ,now, when the regex engine tries the
3rd
alternative, it does match the stringXXX-Xxx
and, in replacement, we note that groups3
and4
( which are identical to groups2
and1
, respectively, not part of the present match ), are both defined :-))So, using a more natural example, below :
SEARCH (?x-i) ^(Xxx)$ | ^(XXX)$ | ((?2))---((?1)) Groups : 1 2 3 4 REPLACE (?1ABC)(?2DEF)(?3Group 1 = \4 and Group 2 = \3)
The sample text :
Xxx XXX XXX---Xxx
is changed into :
ABC DEF Group 1 = Xxx and Group 2 = XXX
However, there’s still a problem, as, in your example, you would like to refer to the replacement part of a group, which does not participate to the overall match, anyway ! More complicated…
We must find a way :
-
To match and capture the string
XXX
-
To capture the string
ZZZ
, in the same alternative, although the stringZZZ
would not be part of the overall match
Still searching !
Best Regards,
guy038
-
-
@guy038 I was working on a paper wen i notice i had to replace averything after a (space)
exemple 2020-04-10 21,25,25I found the pdf pud i’m just Dum
how to remove every regular expresion: 21,25,25
so everything after the year-month-day?
And sorry If I did broke few rules éditor Notepad++ -
@cracksoft said in Delete number strings in the middle of lines of data:
@guy038 I was working on a *papier wen i notice i had to replace averything after a (space)
exemple 2020-04-10 21,25,25I found the pdf pud i’m just Dum
how to remove every regular expresion: 21,25,25
so everything after the year-month-day?
And sorry If I did broke few rules éditor Notepad++
*edit -
@cracksoft **edit I may be on the right track I just found front the pdf you provide in this post space = \s if i’m not wrong?
-
Hello, @craksoft, and All,
If I fully understood your needs, you would like to delete the part after a date, which, I suppose, is the hour part ?
If so :
-
SEARCH
(?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8}
-
REPLACE
Leave EMPTY
-
Select the
Regular expression
search mode
Best Regards,
guy038
P.S. :
For regex documentation, follow this link :
https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation
-
-
@guy038 said in Delete number strings in the middle of lines of data:
(?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8}
So this long thing (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} is 6 number ?
Still thank it work you made my escape of selecting and deleting few hours of work ^^ -
Hi, @craksoft, and All,
You said :
So this long thing (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} is 6 number ?
I don’t know what you means, exactly !?
The regex expression
(?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8}
deletes blanks characters and the next8
characters, when preceded by a date, with theYYYY-MM-DD
format. No more, no less :-)BR
guy038