Can RegEx do the work for me?

Thomas Karlsson

Hi all,

I have this big transaction file. The structure in the file could be describe as two blocks. Both blocks has x numbers of “in-lines”. Each block starts with a BEGIN segment and ends with END segment.

The thing is that I would like to create some functionality to constantly exclude/delete block 2, regardless what information it contain. Tried to play around a bit with RegEx in Notpad++ (v6.5) but no success so far.

So thought I drop the question here among all you talent people to help me further.

[BLOCK 1]
BEGIN|JUNE_CUST_123|1|UPC|SW|PREFIX0000123456|NS
INLINE| 123456|||||||||1573||||||||||||
INLINE|2019-06-23|$_HolderAcctID|1994|Client_No 1134||1134
INLINE|PRE19007801|NSP070|ZZ||||0
INLINE|NSP123|1234||||
INLINE|NSP123|MI| 123456|9|1.000||||||||||||||||||||
INLINE|NSP123| 123456|1677
INLINE|1234|5678
END
[BLOCK 2]
BEGIN|JUNE_CUST_123|2|END|SW|PREFIX0000123456|NS
INLINE|2019-06-23|$_HolderAcctID|1994|Client_No 1134||1134
INLINE|ZZ|||||
INLINE|||||123456789| |00002|00002| |||||||
INLINE|1688|1134
END
[BLOCK 1]
BEGIN|.…
….
….
END
[BLOCK 2]
BEGIN|.…
….
….
END
[BLOCK 1]
BEGIN|.…
….
….
END
[BLOCK 2]
BEGIN|.…
….
….
END
and so on….

As I said - tried to used RegEx but get stuck as the very same row is included in both block one as well as block two. Numer INLINE nr 2 in block one and INLINE nr 1 in block 2.

Thanks in advance
Thomas

Ekopalypse

@Thomas-Karlsson

if I understood correctly, what about
find what:(?s)(BEGIN.*?\REND\R)(BEGIN.*?\REND\R)
replace with: $1

Alan Kilborn

This is one of those cases where I had to see the solution to understand what the problem really was! :)

Ekopalypse

@Alan-Kilborn

:-D well, then please explain to me because I’m not sure if this is really correct what I’ve posted?

Alan Kilborn

@Ekopalypse

I think you are correct, at least, but it sure would be better for posters to say:

This is what I have:

foo
foo
foo

And this is what I need when I’m done:

bar
bar
bar

And BTW you don’t have to come up with a regex for that.

Thomas Karlsson

Thanks for your answers, appreciated.

Not sure if you are trying to help or if you just making fun of my question?

So, regex or not. Can I somehow read through 1000 and 1000 rows of these 2 blocks and bookmark/exclude/delete Block 2 using Notepad++

Thanks

Alan Kilborn

@Thomas-Karlsson

Nobody’s making fun.

And Yup, just use Eko’s regex replacement.

Ekopalypse

@Thomas-Karlsson

did you try my regex? I thought it should do what you want. It is just not 100% clear if we understood your request correctly in first place.

Thomas Karlsson

Thanks a lot @Ekopalypse & @Alan Kilborn

I’ll see. Sorry for misinterpreted your answers. My bad!

Tried your regex Eko. But it seems to just delete the word BEGIN in both Block1 and Block 2.

Ekopalypse

@Thomas-Karlsson

I assume that the block of data looks like this.
(by the way this is formatted by using ~~~ data ~~~)

BEGIN|JUNE_CUST_123|1|UPC|SW|PREFIX0000123456|NS
INLINE| 123456|||||||||1573||||||||||||
INLINE|2019-06-23|$_HolderAcctID|1994|Client_No 1134||1134
INLINE|PRE19007801|NSP070|ZZ||||0
INLINE|NSP123|1234||||
INLINE|NSP123|MI| 123456|9|1.000||||||||||||||||||||
INLINE|NSP123| 123456|1677
INLINE|1234|5678
END
BEGIN|JUNE_CUST_123|2|END|SW|PREFIX0000123456|NS
INLINE|2019-06-23|$_HolderAcctID|1994|Client_No 1134||1134
INLINE|ZZ|||||
INLINE|||||123456789| |00002|00002| |||||||
INLINE|1688|1134
END
BEGIN| ...

Is this the case? Or is [Block X] really part of the data?

PeterJones

@Thomas-Karlsson ,

Joining the conversation: as the others have hinted at, it helps if:

example data is succinct, without extraneous information, but long enough to show what you want.
you give us distinct blocks of what you want before and after the search/replace
you format the data in a way that it renders properly in the forum. for more on this, see my boilerplate below

-----
FYI: here is some helpful information for finding out more about regular expressions, and for formatting posts in this forum (especially quoting data) so that we can fully understand what you’re trying to ask:

This forum is formatted using Markdown, with a help link buried on the little grey ? in the COMPOSE window/pane when writing your post. For more about how to use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end. It is very important that you use these formatting tips – using single backtick marks around small snippets, and using code-quoting for pasting multiple lines from your example data files – because otherwise, the forum will change normal quotes ("") to curly “smart” quotes (“”), will change hyphens to dashes, will sometimes hide asterisks (or if your text is c:\folder\*.txt, it will show up as c:\folder*.txt, missing the backslash). If you want to clearly communicate your text data to us, you need to properly format it.

If you have further search-and-replace (“matching”, “marking”, “bookmarking”, regular expression, “regex”) needs, study this FAQ and the documentation it points to. Before asking a new regex question, understand that for future requests, many of us will expect you to show what data you have (exactly), what data you want (exactly), what regex you already tried (to show that you’re showing effort), why you thought that regex would work (to prove it wasn’t just something randomly typed), and what data you’re getting with an explanation of why that result is wrong. When you show that effort, you’ll see us bend over backward to get things working for you. If you need help formatting, see the paragraph above.

Please note that for all regex and related queries, it is best if you are explicit about what needs to match, and what shouldn’t match, and have multiple examples of both in your example dataset. Often, what shouldn’t match helps define the regular expression as much or more than what should match.

Thomas Karlsson

Hi guys,

Thanks for all you help and tips in this matters - appreciated.

Yes @Ekopalypse, the data blocks looks like this:

Block one

BEGIN|JUNE_CUST_123|1|UPC|SW|PREFIX0000123456|NS
INLINE|   123456|||||||||1573||||||||||||
INLINE|2019-06-23|$_HolderAcctID|1994|Client_No 1134||1134
INLINE|PRE19007801|NSP070|ZZ||||0
INLINE|NSP123|1234||||
INLINE|NSP123|MI|   123456|9|1.000||||||||||||||||||||
INLINE|NSP123|   123456|1677
INLINE|1234|5678
END

Block two

BEGIN|JUNE_CUST_123|2|END|SW|PREFIX0000123456|NS
INLINE|2019-06-23|$_HolderAcctID|1994|Client_No 1134||1134
INLINE|ZZ|||||
INLINE|||||123456789| |00002|00002| |||||||
INLINE|1688|1134
END

The two Blocks are very similar in it’s structure. But one with more data/rows between the BEGIN and END segments. This Block is the one I call Block One. The other one, what I describe as Block Two has also this BEGIN and END segment but less data/row in between.

The Blocks, one and two, then repeats it self (filled with different data of course) with up to 10.000 times. The goal here is to create some kind of logic that can mark-up/exclude/delete the data in Block Two.
Thinking if the first row in each block could be used e.g

if row begins with:

BEGIN|JUNE_CUST_123|1|UPC|....

read, regardless, until the word END (last word in the block) and keep the block.

if row begins with:

BEGIN|JUNE_CUST_123|2|END|SW|....

read, regardless, until the word END (last word in the block) and markup/exclude/delete the block.

Not sure if any of this make any sense.

Thanks in advance.

Ekopalypse

@Thomas-Karlsson

but this is what my regex does.

Sure you used the exact regex I’ve posted?

PeterJones

@Thomas-Karlsson,

Thanks for adding more detail, and adding the formatting… That really helps your post be more understandable.

@Ekopalypse said:

but this is what my regex does.

Maybe my interpretation is slightly different than yours. You seem to think that @Thomas-Karlsson always expects blocks one and two to be alternating, and that block one will always come immediately before block two. If that’s so, then yours would work.

I looked more into the data, and decided that it’s the |1| vs |2| that determines whether it’s block one or block two, so I would search for (?s)BEGIN\|JUNE_CUST_123\|2\|END.*?END$\R* and replace with nothing. This shows what it looks like with “mark” rather than search/replace:

(Actually, that image was taken before I added the \R* on the end, because when I switched to Replace mode, it left a blank line, which the \R* at the end gets rid of.)

Ekopalypse

@PeterJones

You seem to think that @Thomas-Karlsson always expects blocks one and two to be alternating,

yes, that’s exactly what I thought is the case :-)

PeterJones

@Ekopalypse said:

yes, that’s exactly what I thought is the case :-)

Easily understandable; that’s probably the more natural interpretation. On this forum (and similar), I just tend to assume that the OP isn’t clear in requirements, so I often try to look for more subtle clues, like the |1| vs |2|, in case the OP has accidentally implied more than intended. Sometimes, this means that my solution does not work for the OP… but sometimes it means mine does work when other, more literal interpretations, do not.

Since the OP implied yours wasn’t working (though hasn’t answered your direct question), maybe my interpretation will be the lucky one this time. :-)

If not, @Thomas-Karlsson will have to be more explicit about whether the blocks always alternate, or whether there are ever non-block rows between the two, or other such things.

Thomas Karlsson

@Ekopalypse and @PeterJones - a big BIG thanks for all your help and support.

Eko - your regex was marked up each block, both one and tow so it did the job but at the same time not.

PeterJones - your regex just nailed it and I’m soooo happy!! So impressed. You have know idea how much time I will save now using your regex. Previously I have search and deleted all of theses 1000 and 1000 rows manually.

You have made my day, week, year :) Thanks and thanks in million!

Cheers from happy Scandinavia guy :)