Delete number strings in the middle of lines of data

Alan Kilborn

@guy038 said in Delete number strings in the middle of lines of data:

(?(DEFINE)…)

It’s a nice construct. It is documented here for those that don’t know:

https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

as

(?(DEFINE)never-exectuted-pattern) Defines a block of code that is never executed and matches no characters: this is usually used to define one or more named sub-expressions which are referred to from elsewhere in the pattern.

I don’t think it has a mention in the official Notepad++ docs, though.

It doesn’t mean a lot if you simply read it, but a lot of value is added with a concrete example such as that provided by @guy038

One thing that I don’t like about it is that it consumes a capture group number. Wouldn’t it be better to work with named and not numbered groups? Indeed the docs say “…define one or more named sub-expressions…” so this would be equivalent for “my” regex (regex A) above:

(?x-si)    (    ?(DEFINE)    (?<ALAN>    (?<![.\w+-]) [+-]?\d+(?:\.\d+)?(?:E[+-]?\d+)? (?![.\w])    )    )        (^(?P>ALAN)\h|(?P>ALAN)$)

But alas, even though I’ve used a group named ALAN above, it is equivalent to group #1, thus a possible equivalency use case could look like this:

(?x-si)    (    ?(DEFINE)    (?<ALAN>    (?<![.\w+-]) [+-]?\d+(?:\.\d+)?(?:E[+-]?\d+)? (?![.\w])    )    )        (^(?1)\h|(?1)$)

Note that the difference is, even though I’ve named the group ALAN at “define” time, I refer to it as 1 when actually used.

So why is this a downside? Well, because it couples the left side (definition) with the right side (use). Maybe I have a library of definitions, that I want to largely ignore (except their names), and I’m wanting to write a regex I’m going to use to match some data–maybe in the regex I want to backrefer to my own capture group #1. Well, because of the coupling, group #1 would already be in use.

Ok, so maybe it is a slight downside that wouldn’t come up often, but, I just happened to encounter that scenario recently… :-)

Did this turn into a Boost regex forum accidentally, or what?!? So sorry…

guy038

Hi, @alan-kilborn and All,

Yes, Alan, I’m agree with you that named groups should not be numbered by the regex engine and, thus, the user should only use them, as backreferences, with their names, in search and/or replacement !

However, the .NET regex engine, has an intelligent way to have the best of both worlds ! Indeed, the .NET regex engine scans all unnamed groups, first, numbering them from value 1, then re-scans the regex, continuing to number all the named groups, from after the greatest number used in unnamed groups ;-))

In the old version, below, of the Regular-Expressions manual, of Jan Goyvaerts ( creator of the Regular-expressions.info site ),

https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

it is said, pages 36-37

Names and Numbers for Capturing Groups :

Here is where things get a bit ugly. Python and PCRE treat named capturing groups just like unnamed capturing groups, and number both kinds from left to right, starting with one. The regex (a)(?P<x>b)(c)(?P<y>d) matches abcd as expected. If you do a search-and-replace with this regex and the replacement \1\2\3\4, you will get abcd. All four groups were numbered from left to right, from one till four. Easy and logical.

Things are quite a bit more complicated with the .NET framework. The regex (a)(?<x>b)(c)(?<y>d) again matches abcd. However, if you do a search-and-replace with $1$2$3$4 as the replacement, you will get acbd. Probably not what you expected.

The .NET framework does number named capturing groups from left to right, but numbers them after all the unnamed groups have been numbered. So the unnamed groups (a) and (c) get numbered first, from left to right, starting at one. Then the named groups (?<x>b) and (?<y>d) get their numbers, continuing from the unnamed groups, in this case: three.

To make things simple, when using .NET’s regex support, just assume that named groups do not get numbered at all, and reference them by name exclusively.

But, with the Boost regex engine of Notepad++, we have to make do with the usual numbering of the groups, which just does one regex scan and numbers any group, named or not, one after the other !

Best Regards,

guy038

Alan Kilborn

@guy038

Maybe getting really off-topic now, but with the “DEFINE” stuff it got me thinking about a similar “problem” I have. I say “problem” because it is nothing I can’t workaround, but I’m wondering if there is a better solution.

Consider:

search: (?-i)(Xxx)|(XXX)|(Yyy)
replace: (?1Zzz)(?2ZZZ)(?3Www)

This would convert this text: The quick Xxx Yyy jumped over the lazy XXX into The quick Zzz Www jumped over the lazy ZZZ

So please don’t consider the wrong problem. What I have is a simplified example of something more complicated, and the above is just for illustration.

What I’d like to do is to NOT have to specify the capitalized version of ZZZ in the replace, but rather use the Zzz text without respecifying it (important!) in combination with a \U option.

So in pseudo-regex, because I know this won’t work, without even trying it:

replace: (?1Zzz)(?2\U${1}\E)(?3Www)

So I was just wondering if you had any thoughts on this. TIA. :-)

guy038

Hi, @alan-kilborn,

Your replacement cannot work because, when the search regex matches the string XXX, due to the different alternatives, the group 2 is the only group defined, anyway :-((

In addition, seemingly, you’re not interested by the group 1, itself, but only with the replacement string of this group , so that you would like something like (?2\UREPLACEMENT of (\1)\E) !!

Let’s imagine the text sample, below, which is used in all subsequent tests :

Xxx
XXX
XXX---Xxx

Then with the regex S/R :

SEARCH    (?x-i)   ^(Xxx)$ | ^(XXX)$ | (\2---\1)
Groups :            1         2        3

REPLACE   \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\n

We get :

Group 1 >Xxx<
Group 2 ><
Group 3 ><


Group 1 ><
Group 2 >XXX<
Group 3 ><

XXX---Xxx

As explained above, the search regex does match the Xxx and XXX strings but fails to find the XXX---xxx because when trying the 3rd alternative, the groups \1 and \2 are not defined

OK, let’s try another syntax, using sub-routine calls (?#) :

SEARCH    (?x-i)   ^(Xxx)$ | ^(XXX)$ | ((?2)---(?1))
Groups :            1         2        3

REPLACE   \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\n

Text turns into :

Group 1 >Xxx<
Group 2 ><
Group 3 ><


Group 1 ><
Group 2 >XXX<
Group 3 ><


Group 1 ><
Group 2 ><
Group 3 >XXX---Xxx<

This time, the result is better as, when matching the string XXX---xxx, with the alternative ((?2)---(?1)), it makes reference to groups 1 and 2, outside the alternative matched, due to the (DEFINE) syntax !

However, we don’t get the groups 1 and 2, individually

Let’s use, again, an other syntax, where any sub-routine call (?#) is embedded in parentheses, itself, so ((?#))

SEARCH    (?x-i)   ^(Xxx)$ | ^(XXX)$ | ((?2))---((?1))
Groups :            1         2        3        4 

REPLACE   \r\nGroup 1 >\1<\r\nGroup 2 >\2<\r\nGroup 3 >\3<\r\nGroup 4 >\4<\r\n

Just note that the 3rd alternative is not embedded, itself, between parentheses. After execution, we’re left with :

Group 1 >Xxx<
Group 2 ><
Group 3 ><
Group 4 ><


Group 1 ><
Group 2 >XXX<
Group 3 ><
Group 4 ><


Group 1 ><
Group 2 ><
Group 3 >XXX<
Group 4 >Xxx<

Ah!.. ,now, when the regex engine tries the 3rd alternative, it does match the string XXX-Xxx and, in replacement, we note that groups 3 and 4 ( which are identical to groups 2 and 1, respectively, not part of the present match ), are both defined :-))

So, using a more natural example, below :

SEARCH    (?x-i)   ^(Xxx)$ | ^(XXX)$ | ((?2))---((?1))
Groups :            1         2        3        4 

REPLACE   (?1ABC)(?2DEF)(?3Group 1 = \4 and Group 2 = \3)

The sample text :

Xxx
XXX
XXX---Xxx

is changed into :

ABC
DEF
Group 1 = Xxx and Group 2 = XXX

However, there’s still a problem, as, in your example, you would like to refer to the replacement part of a group, which does not participate to the overall match, anyway ! More complicated…

We must find a way :

To match and capture the string XXX
To capture the string ZZZ, in the same alternative, although the string ZZZ would not be part of the overall match

Still searching !

Best Regards,

guy038

cracksoft

@guy038 I was working on a paper wen i notice i had to replace averything after a (space)
exemple 2020-04-10 21,25,25

I found the pdf pud i’m just Dum
how to remove every regular expresion: 21,25,25
so everything after the year-month-day?
And sorry If I did broke few rules éditor Notepad++

cracksoft

@cracksoft said in Delete number strings in the middle of lines of data:

@guy038 I was working on a *papier wen i notice i had to replace averything after a (space)
exemple 2020-04-10 21,25,25

I found the pdf pud i’m just Dum
how to remove every regular expresion: 21,25,25
so everything after the year-month-day?
And sorry If I did broke few rules éditor Notepad++
*edit

cracksoft

@cracksoft **edit I may be on the right track I just found front the pdf you provide in this post space = \s if i’m not wrong?

guy038

Hello, @craksoft, and All,

If I fully understood your needs, you would like to delete the part after a date, which, I suppose, is the hour part ?

If so :

SEARCH (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8}
REPLACE Leave EMPTY
Select the Regular expression search mode

Best Regards,

guy038

P.S. :

For regex documentation, follow this link :

https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation

cracksoft

@guy038 said in Delete number strings in the middle of lines of data:

(?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8}

So this long thing (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} is 6 number ?
Still thank it work you made my escape of selecting and deleting few hours of work ^^

guy038

Hi, @craksoft, and All,

You said :

So this long thing (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} is 6 number ?

I don’t know what you means, exactly !?

The regex expression (?-s)(?<=\d{4}-\d\d-\d\d)\h+.{8} deletes blanks characters and the next 8 characters, when preceded by a date, with the YYYY-MM-DD format. No more, no less :-)

BR

guy038