Wanting to Format 2000000 Lines of Text
-
Howdy and good Day to you all,
I have a problem that I am hoping that someone will have a solution to, or something to point me in the right direction.
I have a report about 2 000 000 Lines long, its a outputted formatted report from a *nux based access control system, that the client would like changed from what I’ve given them, to something that can be brought into excel or something similar.
The issue is as follows, the outputted text, can have 1/2/3/4 lines per incident, which means that all I’ve come up with so far is to go through and delete out a lot of the text, I’ve been at it for a few hours, and I am sure there is an easier way to do this.
They havent used TAB’s or Commas to delim the fields, rather they have used spaces… Why? I dont know, perhaps its something to do with the export procedure from the *nux box to the windows terminal, I am unsure (Sadly I cannot get back to the terminal to see if there is something else I can get, but when I exported it a few months ago (yes the client got back to me and said they want it in something they can view etc) the only output format available through the terminal was this text file.
So the basic structure is
Line 1:
Line 2:
Line 3:
Line 4:The basic Fields are
Field 1 - Incident Number (194xxxx —> Onwards)
Field 2 - Date
Field 3 - Time
Field 4 - The Access Controls Time (So the time in the equipment as apposed to field 2/3 which is the time on the terminal)
Field 5 - The verbose output from the transaction, which is split over 1/2/3 lines (sometimes 4)or
1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)ENTRY GRANTED to USER
6123: Michelle Coghlan at Main Entry Gate
1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER
6123: Michelle Coghlan at ACUTE CARE AT HOME
1947730 Sun Nov 26 06:59:33 2017 (CH: Nov 26 07:02:36)ENTRY GRANTED to USER
6123: Michelle Coghlan at ACUTE CARE AT HOME
1947743 Sun Nov 26 06:59:42 2017 (CH: Nov 26 07:02:38)ENTRY GRANTED to USER
6123: Michelle Coghlan at ACUTE CARE AT HOME
1947744 Sun Nov 26 07:00:38 2017 (CH: Nov 26 07:03:42)ENTRY GRANTED to USER
6512: Stephen Knynenburg at Main Entry Gate
1947745 Sun Nov 26 07:00:39 2017 (CH: Nov 26 07:03:43)ENTRY GRANTED to USER
6512: Stephen Knynenburg at Main Entry Gate
1947746 Sun Nov 26 07:01:58 2017 (CH: Nov 26 07:05:02)ENTRY GRANTED to USER
6512: Stephen Knynenburg at ACUTE CARE AT
HOMEI’ve tried a few things, but basically all I’ve come up with is removing the (spaces) on the second and third line, then going to the end of each line and deleting the cr/lf and repeating…
I am really hoping someone can help with this…
Thanks!
-
You’ve given a pretty good example of what you are starting with, but maybe you should give an example (using the same data) as what you want to end up with as well (instead of just loosely describing it: “all I’ve come up with…”).
You say the results of your effort should be “brought into Excel or…similar”, so I’d suggest that you also verify that the example transformed data actually imports into Excel cleanly and as you expect.
At that point probably someone here will jump in and offer some helpful advice on how to transform your data.
-
AND!
Every 80-90 Lines of Text there is a page header as well that I would like to remove, I can simply run CTRL-H and replace the text and make white space and delete it, bit is there a nicer way, or did I hit in on the head the first time?
What would like to end up with is either
1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME
1947730 Sun Nov 26 06:59:33 2017 (CH: Nov 26 07:02:36)ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME
1947743 Sun Nov 26 06:59:42 2017 (CH: Nov 26 07:02:38)ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME
1947744 Sun Nov 26 07:00:38 2017 (CH: Nov 26 07:03:42)ENTRY GRANTED to USER6512: Stephen Knynenburg at Main Entry Gate
1947745 Sun Nov 26 07:00:39 2017 (CH: Nov 26 07:03:43)ENTRY GRANTED to USER6512: Stephen Knynenburg at Main Entry Gate
1947746 Sun Nov 26 07:01:58 2017 (CH: Nov 26 07:05:02)ENTRY GRANTED to USER 6512: Stephen Knynenburg at ACUTE CARE AT HOMEAll on one line, rather than being spread over multiple lines OR
1947729,Sun Nov 26 06:59:32 2017,(CH: Nov 26 07:02:35),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME
1947730,Sun Nov 26 06:59:33 2017,(CH: Nov 26 07:02:36),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME
1947743,Sun Nov 26 06:59:42 2017,(CH: Nov 26 07:02:38),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME
1947744,Sun Nov 26 07:00:38 2017,(CH: Nov 26 07:03:42),ENTRY GRANTED to USER6512: Stephen Knynenburg at Main Entry Gate
1947745,Sun Nov 26 07:00:39 2017,(CH: Nov 26 07:03:43),ENTRY GRANTED to USER6512: Stephen Knynenburg at Main Entry Gate
1947746,Sun Nov 26 07:01:58 2017,(CH: Nov 26 07:05:02),ENTRY GRANTED to USER 6512: Stephen Knynenburg at ACUTE CARE AT HOMESomething that is delimited so I can get some columns going with excel, or such product.
Thanks!
-
Essentially I would like to find a way to remove all the mutliple lines that belong to the same lines data, so that its easier to read for the client, and they can import it directly into an excel
eg current if you where to pull it into excel you get ( For : 1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME)
(The output from the TXT File is as follows
1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER
6123: Michelle Coghlan at ACUTE CARE AT HOME
)Line 1: Cell 1 Cell 2 Cell 3 Cell 4
Line 2: Cell 5
Line 3: Cell 6 (sometimes)
Line 4: Cell 7 (Sometimes) (Depends on the length of the verbose output and how many characters it has.
Example
Line 1: Cell 1 1947729 Cell 2 Sun Nov 26 06:59:32 2017 Cell 3 (CH: Nov 26 07:02:35) Cell 4 ENTRY GRANTED to USER6
Line 2: Cell 5 123: Michelle Coghlan at ACUTE CARE AT HOMENow when I pull the raw txt file into excel I get lines like this
-
If I were you, I’d do it in several steps.
First, remove your page header. Since you didn’t provide any detail, you’re on your own on that. Hopefully you know what to do with this.
Second, combine your multi-line data into a single line. Basically what you want to do here is to look at how a line starts and if it doesn’t start with a 7-digit number followed by a day of the week, then you want to combine that line with the following line. Here’s what I’d do:
Find what zone:
(?-is)^(\d{7} (?:Sun|Mon|Tue).+)\R(?!\d{7} (?:Sun|Mon|Tue))(.+)
Replace with zone:\1\2
Search mode: Regular expression
Wrap around: ticked
Action: Press Replace All button (do this multiple times, I guess up to 4 or 5 based upon your original problem statement, until all of the broken lines are combinedNote: I left out the remaining days of the week, you can easily fill those in between
|
characters. Exercise for the reader.Third, reformat the data for Excel purposes. That part isn’t so hard, so again, left as exercise for the reader.
-
Hello, @mark-allen, and All,
This morning ( in France ), I succeeded to find a correct regex which should do what your want to. But, I was away from home, this afternoon ! This regex changes each multi-lines transaction into a single line transaction and, in addition, separates the different fields with commas ;-))
Of course, I needed some time to figure out all the problem ! And, as Scott said, in his last post, we need to know the structure of your page header in order to build the correct final regex ! In the meanwhile, I supposed that your file does not contain any header
In your first post, yous said :
the outputted text, can have 1/2/3/4 lines per incident
and
Field 5 - The verbose output from the transaction, which is split over 1/2/3 lines (sometimes 4)
So, I imagined the simple test text , below, that you’ll paste in a new tab :
1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)<Output_1> <Output_2> <Output_3> <Output_4> 1941234 ......... 1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)<Output_1> <Output_2> <Output_3> 1941234 ......... 1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)<Output_1> <Output_2> 1941234 .........
As you can notice, I supposed that the verbose output is always split in, at least,
2
lines, and sometimes in3
or4
lines ( Hence, no verbose output in a single line only ! ). If I’m wrong about it, just tell in order to modify the regex ;-)
Now, open the Replace dialog
Insert the regex
(?-s)(^194\d{4})\x20(.+)\x20(\(.+\))(.+)\R(.+)(?:\R(.+)(?:\R(.+))??)??(?=\R194\d{4}\x20|\Z)
, in the Find what: zoneInsert the regex
\1,\2,\3,\4\5\6\7
, in the Replace with: zoneTick the
Wrap around
optionClick, once, on the
Replace All
button or several times on theReplace
buttonWe get the following text :
1947728,Sun Nov 26 06:57:54 2017,(CH: Nov 26 07:00:57),<Output_1><Output_2><Output_3><Output_4> 1941234 ......... 1947728,Sun Nov 26 06:57:54 2017,(CH: Nov 26 07:00:57),<Output_1><Output_2><Output_3> 1941234 ......... 1947728,Sun Nov 26 06:57:54 2017,(CH: Nov 26 07:00:57),<Output_1><Output_2> 1941234 .........
So the important question is : Do you expect such a text ?
Secondly, using your list of transactions, also given in your first post :
1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)ENTRY GRANTED to USER 6123: Michelle Coghlan at Main Entry Gate 1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947730 Sun Nov 26 06:59:33 2017 (CH: Nov 26 07:02:36)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947743 Sun Nov 26 06:59:42 2017 (CH: Nov 26 07:02:38)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947744 Sun Nov 26 07:00:38 2017 (CH: Nov 26 07:03:42)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Main Entry Gate 1947745 Sun Nov 26 07:00:39 2017 (CH: Nov 26 07:03:43)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Main Entry Gate 1947746 Sun Nov 26 07:01:58 2017 (CH: Nov 26 07:05:02)ENTRY GRANTED to USER 6512: Stephen Knynenburg at ACUTE CARE AT HOME
This same regex S/R produced :
1947728,Sun Nov 26 06:57:54 2017,(CH: Nov 26 07:00:57),ENTRY GRANTED to USER6123: Michelle Coghlan at Main Entry Gate 1947729,Sun Nov 26 06:59:32 2017,(CH: Nov 26 07:02:35),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME 1947730,Sun Nov 26 06:59:33 2017,(CH: Nov 26 07:02:36),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME 1947743,Sun Nov 26 06:59:42 2017,(CH: Nov 26 07:02:38),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME 1947744,Sun Nov 26 07:00:38 2017,(CH: Nov 26 07:03:42),ENTRY GRANTED to USER6512: Stephen Knynenburg at Main Entry Gate 1947745,Sun Nov 26 07:00:39 2017,(CH: Nov 26 07:03:43),ENTRY GRANTED to USER6512: Stephen Knynenburg at Main Entry Gate 1947746,Sun Nov 26 07:01:58 2017,(CH: Nov 26 07:05:02),ENTRY GRANTED to USER6512: Stephen Knynenburg at ACUTE CARE ATHOME
May be, you noticed that the last line, after replacement, ends with at ACUTE CARE ATHOME, where the words AT and HOME are attached. Quite logical as there is, only, a line-break between …at ACUTE CARE AT and the last line HOME
Then, in a new Excel page, I inserted that text as External data, indicating that the fields were comma separated and saved it as a common
.xls
file, without any problem :-)
Finally, in your first post, you also said :
I have a report about 2 000 000 Lines long,
So, I tried to determine if this regex S/R could be done in a sensible time, without any bug or catastrophic backtracking of the N++ regex engine !
I duplicated your
15
lines text, below :1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)ENTRY GRANTED to USER 6123: Michelle Coghlan at Main Entry Gate 1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947730 Sun Nov 26 06:59:33 2017 (CH: Nov 26 07:02:36)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947743 Sun Nov 26 06:59:42 2017 (CH: Nov 26 07:02:38)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947744 Sun Nov 26 07:00:38 2017 (CH: Nov 26 07:03:42)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Main Entry Gate 1947745 Sun Nov 26 07:00:39 2017 (CH: Nov 26 07:03:43)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Main Entry Gate 1947746 Sun Nov 26 07:01:58 2017 (CH: Nov 26 07:05:02)ENTRY GRANTED to USER 6512: Stephen Knynenburg at ACUTE CARE AT HOME
68,000
times ! So, I was left with a file of58,412,000
bytes, containing1,020,000
lines. Then running the above regex S/R, on my old (2005 ) Windows XP laptop ( with only 1GB of RAM ! ) the476,000
replacements took69s
, getting a file of57,800,000
bytes, containing476,000
lines ;-))Consequently, I suppose that your file with
2,000,000
lines should be, correctly, handled, with a recent laptop and OS !So, just tell us about your header and it’ll be only a matter of minutes to get the job done ;-))
See you later,
Best Regards,
guy038
-
Morning All! (Or Afternoon or Evening!)
Scott! You are correct I didnt give you the header!
This is the header, which is static throughout, the only thing that changes is the Page #
This is the header, I can go through and essentially remove it manually or use CTRL-H and replace the Text with a “space”
Door User Activity Reported on Fri Mar 2 09:50:00 2018
Initiated by Master from Main Office Page 6292
Messia Plains AmbulatoryI’ve tried your expression, and sadly it didnt work for me,
In the Find What Line :(?-is)^(\d{7} (?:Sun|Mon|Tue|Wed|Thu|Fri|Sat).+)\R(?!\d{7} (?:Sun|Mon|Tue|Wed|Thu|Fri|Sat))(.+)
In the Replace with: \1\2Each time it has come up with Zero Lines Replaced
I made the sample size smaller, just 1 Page, and the same issue, made it just one line, and he same issue (Zero Lines replaced), I thought perhaps i might have an outdated version, and I’ve updated to the latest version effective today (7.5.6 32 Bit) I have no plug-ins enabled or installed
I then thought perhaps the data because it only has monday (Mon) in it, was breaking the expression (I know, I doubted it as well, but its a bit above my head, but wanted to try!)
I removed Tuesday-Sunday, and left monday only, Still the same result.
So at a bit of a loss with that one, I HAVE to be missing something there or some plig in or something for that… :(
I’ve put up 4 pages of the text so you can see the formatting
Thanks for your response as well!!
I’ve tried your expression as well, and I get the same response, it tells me that there are zero changes, I’ve hit the button many times just to ensure I am not the reason for this
One thing I’ve noticed is that there is a Space before the first 7 digit character, will that upset things?
Space before first 7 Digit Number
What you have outputted is exactly what I am after
1947728,Sun Nov 26 06:57:54 2017,(CH: Nov 26 07:00:57),ENTRY GRANTED to USER6123: Michelle Coghlan at Main Entry Gate
1947729,Sun Nov 26 06:59:32 2017,(CH: Nov 26 07:02:35),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME
1947730,Sun Nov 26 06:59:33 2017,(CH: Nov 26 07:02:36),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME
1947743,Sun Nov 26 06:59:42 2017,(CH: Nov 26 07:02:38),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME
1947744,Sun Nov 26 07:00:38 2017,(CH: Nov 26 07:03:42),ENTRY GRANTED to USER6512: Stephen Knynenburg at Main Entry Gate
1947745,Sun Nov 26 07:00:39 2017,(CH: Nov 26 07:03:43),ENTRY GRANTED to USER6512: Stephen Knynenburg at Main Entry Gate
1947746,Sun Nov 26 07:01:58 2017,(CH: Nov 26 07:05:02),ENTRY GRANTED to USER6512: Stephen Knynenburg at ACUTE CARE ATHOMEOne Line, with comma’s that can be brought into excel nice and neat!
I will keep playing, but any thoughts would be appreciated! and I will be sure to save this away if asked again!
Mark
nb. Sorry dont know the syntax to nicely put things into code box’s
-
The simplest way to represent your data accurately here is by pasting it between a line containing ```z and a line containing ```
Thus, if you do the following:
```z
(copy and paste your real data HERE)
```It (some of what I think your real data is) will render as:
1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME
And really, accurately representing your data is the only way to get good advice. Otherwise, responders are guessing, and when you try their advice you get Replace: No occurrence was found. Then when you say, “Oh, that’s not really what I meant…” and attempt to clarify, the helpers have lost interest and have moved on to more interesting pursuits…
Note: You can also indent your entire dataset by 4 spaces and it will render like the following, but I find the above method easier (and the colors are often wacky in what follows).
1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME
Also note: Screenshots aren’t going to cut it. Nobody is going to retype your data from a screenshot to test out a proposed match expression.
-
I forgot to mention that your “Sample Text” link is good in that it has copyable text, but it is bad in that it doesn’t preserve the leading spaces (as shown in some of the screenshot versions) on the lines! :-(
-
agh! Cant Win :) But this is the data
1947727 Sun Nov 26 00:44:12 2017 (CH: Nov 26 00:47:14)ENTRY GRANTED to USER 1313: State Government Security at Main Entry Gate 1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)ENTRY GRANTED to USER 6123: Michelle Coghlan at Main Entry Gate 1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947730 Sun Nov 26 06:59:33 2017 (CH: Nov 26 07:02:36)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947743 Sun Nov 26 06:59:42 2017 (CH: Nov 26 07:02:38)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947744 Sun Nov 26 07:00:38 2017 (CH: Nov 26 07:03:42)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Main Entry Gate 1947745 Sun Nov 26 07:00:39 2017 (CH: Nov 26 07:03:43)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Main Entry Gate 1947746 Sun Nov 26 07:01:58 2017 (CH: Nov 26 07:05:02)ENTRY GRANTED to USER 6512: Stephen Knynenburg at ACUTE CARE AT HOME 1947747 Sun Nov 26 07:27:03 2017 (CH: Nov 26 07:30:07)ENTRY GRANTED to USER 6123: Michelle Coghlan at Population Health Rear Entry 1947748 Sun Nov 26 08:08:34 2017 (CH: Nov 26 08:11:39)ENTRY GRANTED to USER 6512: Stephen Knynenburg at ACUTE CARE AT HOME 1947749 Sun Nov 26 09:08:22 2017 (CH: Nov 26 09:11:27)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Population Health Rear Entry 1947750 Sun Nov 26 09:43:14 2017 (CH: Nov 26 09:46:19)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Population Health Rear Entry 1947751 Sun Nov 26 09:46:48 2017 (CH: Nov 26 09:49:53)ENTRY GRANTED to USER 6005: Rose Gorman at Main Entry Gate 1947752 Sun Nov 26 09:49:50 2017 (CH: Nov 26 09:52:55)ENTRY GRANTED to USER 6005: Rose Gorman at ACUTE CARE AT HOME 1947753 Sun Nov 26 09:56:07 2017 (CH: Nov 26 09:59:11)ENTRY GRANTED to USER 6513: Sheela Tuazon at Main Entry Gate 1947754 Sun Nov 26 09:58:25 2017 (CH: Nov 26 10:01:29)ENTRY GRANTED to USER 6513: Sheela Tuazon at ACUTE CARE AT HOME 1947755 Sun Nov 26 10:01:56 2017 (CH: Nov 26 10:05:00)ENTRY GRANTED to USER 6513: Sheela Tuazon at Population Health Rear Entry 1947756 Sun Nov 26 10:08:53 2017 (CH: Nov 26 10:11:58)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Population Health Rear Entry Door User Activity Reported on Fri Mar 2 09:50:00 2018 Initiated by Master from Main Office Page 2 Messia Plains Ambulatory ================================================================================
-
Hello, @mark-allen, @scott-sumner and All,
Ah ! Of course, it’s easy to understand why my regex could not match any string ;-)) Indeed, when regular expressions are involved, even a single space, added or deleted in text, is of importance !
From your recent picture, I noticed four particularities, in your list :
-
Any incident number is separated from beginning of line by a space character
-
The verbose output zones are, both :
-
Preceded by
1
or more space character(s) -
Followed by
1
space character
-
-
Each header block :
-
Begins with the Form Feed character ( the ASCII page-breaking control character, of Unicode value
\x{000C}
, which can be matched, in regular expressions, with the\f
syntax -
Ends with several equal signs (
=
)
-
Refer to https://en.wikipedia.org/wiki/Page_break#Form_feed
With all these elements, it’s not difficult to modify my previous regex, to be close to your real text ;-))
So again, let’s begin with this short example :
1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)<Output_1> <Output_2> <Output_3> <Output_4> 1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)<Output_1> <Output_2> <Output_3> Door User Activity Reported on Fri Mar 2 09:50:00 2018 Initiated by Master from Main Office Page 2 Messia Plains Ambulatory ================================================================================ 1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)<Output_1> <Output_2> 1941234 .........
Using my new regex, below :
SEARCH
(?-s)^\x20(194\d{4})\x20(.+)\x20(\(.+\))(.+)\x20\R\x20+(.+)\x20(?:\R\x20+(.+)\x20(?:\R\x20+(.+)\x20)??)??(?=\Z|\R(\f|\x20194\d{4}\x20))|(?s)\f.*?===\R
REPLACE
?1\1,\2,\3,\4\5\6\7
We get the following text :
1947728,Sun Nov 26 06:57:54 2017,(CH: Nov 26 07:00:57),<Output_1><Output_2><Output_3><Output_4> 1947728,Sun Nov 26 06:57:54 2017,(CH: Nov 26 07:00:57),<Output_1><Output_2><Output_3> 1947728,Sun Nov 26 06:57:54 2017,(CH: Nov 26 07:00:57),<Output_1><Output_2> 1941234 .........
Secondly, given your list of transactions, below, of your LAST post, where I simply added :
-
One space before the first incident number, which was missing
-
A header block, after the
5th
and11th
transactions
1947727 Sun Nov 26 00:44:12 2017 (CH: Nov 26 00:47:14)ENTRY GRANTED to USER 1313: State Government Security at Main Entry Gate 1947728 Sun Nov 26 06:57:54 2017 (CH: Nov 26 07:00:57)ENTRY GRANTED to USER 6123: Michelle Coghlan at Main Entry Gate 1947729 Sun Nov 26 06:59:32 2017 (CH: Nov 26 07:02:35)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947730 Sun Nov 26 06:59:33 2017 (CH: Nov 26 07:02:36)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1947743 Sun Nov 26 06:59:42 2017 (CH: Nov 26 07:02:38)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME Door User Activity Reported on Fri Mar 2 09:50:00 2018 Initiated by Master from Main Office Page 26 Messia Plains Ambulatory ================================================================================ 1947744 Sun Nov 26 07:00:38 2017 (CH: Nov 26 07:03:42)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Main Entry Gate 1947745 Sun Nov 26 07:00:39 2017 (CH: Nov 26 07:03:43)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Main Entry Gate 1947746 Sun Nov 26 07:01:58 2017 (CH: Nov 26 07:05:02)ENTRY GRANTED to USER 6512: Stephen Knynenburg at ACUTE CARE AT HOME 1947747 Sun Nov 26 07:27:03 2017 (CH: Nov 26 07:30:07)ENTRY GRANTED to USER 6123: Michelle Coghlan at Population Health Rear Entry 1947748 Sun Nov 26 08:08:34 2017 (CH: Nov 26 08:11:39)ENTRY GRANTED to USER 6512: Stephen Knynenburg at ACUTE CARE AT HOME 1947749 Sun Nov 26 09:08:22 2017 (CH: Nov 26 09:11:27)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Population Health Rear Entry Door User Activity Reported on Fri Mar 2 09:50:00 2018 Initiated by Master from Main Office Page 27 Messia Plains Ambulatory ================================================================================ 1947750 Sun Nov 26 09:43:14 2017 (CH: Nov 26 09:46:19)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Population Health Rear Entry 1947751 Sun Nov 26 09:46:48 2017 (CH: Nov 26 09:49:53)ENTRY GRANTED to USER 6005: Rose Gorman at Main Entry Gate 1947752 Sun Nov 26 09:49:50 2017 (CH: Nov 26 09:52:55)ENTRY GRANTED to USER 6005: Rose Gorman at ACUTE CARE AT HOME 1947753 Sun Nov 26 09:56:07 2017 (CH: Nov 26 09:59:11)ENTRY GRANTED to USER 6513: Sheela Tuazon at Main Entry Gate 1947754 Sun Nov 26 09:58:25 2017 (CH: Nov 26 10:01:29)ENTRY GRANTED to USER 6513: Sheela Tuazon at ACUTE CARE AT HOME 1947755 Sun Nov 26 10:01:56 2017 (CH: Nov 26 10:05:00)ENTRY GRANTED to USER 6513: Sheela Tuazon at Population Health Rear Entry 1947756 Sun Nov 26 10:08:53 2017 (CH: Nov 26 10:11:58)ENTRY GRANTED to USER 6512: Stephen Knynenburg at Population Health Rear Entry Door User Activity Reported on Fri Mar 2 09:50:00 2018 Initiated by Master from Main Office Page 28 Messia Plains Ambulatory ================================================================================
After replacement, the new regex S/R gives :
1947727,Sun Nov 26 00:44:12 2017,(CH: Nov 26 00:47:14),ENTRY GRANTED to USER1313: State Government Security at MainEntry Gate 1947728,Sun Nov 26 06:57:54 2017,(CH: Nov 26 07:00:57),ENTRY GRANTED to USER6123: Michelle Coghlan at Main Entry Gate 1947729,Sun Nov 26 06:59:32 2017,(CH: Nov 26 07:02:35),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME 1947730,Sun Nov 26 06:59:33 2017,(CH: Nov 26 07:02:36),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME 1947743,Sun Nov 26 06:59:42 2017,(CH: Nov 26 07:02:38),ENTRY GRANTED to USER6123: Michelle Coghlan at ACUTE CARE AT HOME 1947744,Sun Nov 26 07:00:38 2017,(CH: Nov 26 07:03:42),ENTRY GRANTED to USER6512: Stephen Knynenburg at Main Entry Gate 1947745,Sun Nov 26 07:00:39 2017,(CH: Nov 26 07:03:43),ENTRY GRANTED to USER6512: Stephen Knynenburg at Main Entry Gate 1947746,Sun Nov 26 07:01:58 2017,(CH: Nov 26 07:05:02),ENTRY GRANTED to USER6512: Stephen Knynenburg at ACUTE CARE ATHOME 1947747,Sun Nov 26 07:27:03 2017,(CH: Nov 26 07:30:07),ENTRY GRANTED to USER6123: Michelle Coghlan at Population HealthRear Entry 1947748,Sun Nov 26 08:08:34 2017,(CH: Nov 26 08:11:39),ENTRY GRANTED to USER6512: Stephen Knynenburg at ACUTE CARE ATHOME 1947749,Sun Nov 26 09:08:22 2017,(CH: Nov 26 09:11:27),ENTRY GRANTED to USER6512: Stephen Knynenburg at PopulationHealth Rear Entry 1947750,Sun Nov 26 09:43:14 2017,(CH: Nov 26 09:46:19),ENTRY GRANTED to USER6512: Stephen Knynenburg at PopulationHealth Rear Entry 1947751,Sun Nov 26 09:46:48 2017,(CH: Nov 26 09:49:53),ENTRY GRANTED to USER6005: Rose Gorman at Main Entry Gate 1947752,Sun Nov 26 09:49:50 2017,(CH: Nov 26 09:52:55),ENTRY GRANTED to USER6005: Rose Gorman at ACUTE CARE AT HOME 1947753,Sun Nov 26 09:56:07 2017,(CH: Nov 26 09:59:11),ENTRY GRANTED to USER6513: Sheela Tuazon at Main Entry Gate 1947754,Sun Nov 26 09:58:25 2017,(CH: Nov 26 10:01:29),ENTRY GRANTED to USER6513: Sheela Tuazon at ACUTE CARE AT HOME 1947755,Sun Nov 26 10:01:56 2017,(CH: Nov 26 10:05:00),ENTRY GRANTED to USER6513: Sheela Tuazon at Population HealthRear Entry 1947756,Sun Nov 26 10:08:53 2017,(CH: Nov 26 10:11:58),ENTRY GRANTED to USER6512: Stephen Knynenburg at PopulationHealth Rear Entry
Voilà !
Quite sure that it’s what you expect to, don’t you ?
Cheers,
guy038
P.S. :
-
I’ll give some explanations on my regex S/R, in a next post, if the resulting replacement is OK for you !
-
I also assume that the last non-blank line, of your file, is, either :
-
The line, with equal signs, of a header block, followed by a Windows line-break (
\r\n
) -
A verbose output line, possibly followed with any range of
\r
,\n
or\f
character(s)
-
-
-
Hello All!
Sorry All! Was away from the desk for a few days, BUT I ran it and it works fine, for the first 30 pages, then after then first 30 pages it does this
1949977,Tue Nov 28 11:48:52 2017,(CH: Nov 28 11:52:11),ENTRY GRANTED to USER6295: Rachael Downie at AMH Rear Entry 1949978,Tue Nov 28 11:48:55 2017,(CH: Nov 28 11:52:13),ENTRY GRANTED to USER40: Authorised Student Card 17 at CH RearEntry 1949979,Tue Nov 28 11:49:06 2017,(CH: Nov 28 11:52:25),ENTRY GRANTED to USER10885: Anne Volling at CH Rear Entry 1949980,Tue Nov 28 11:49:14 2017,(CH: Nov 28 11:52:33),ENTRY GRANTED to USER6048: Elizabeth Apro at Basement comms Room 1949981,Tue Nov 28 11:49:29 2017,(CH: Nov 28 11:52:48),ENTRY GRANTED to USER5990: Tracy Brighouse at AMH Reception 1949982,Tue Nov 28 11:49:35 2017,(CH: Nov 28 11:52:54),ENTRY GRANTED to USER6130: Narelle Broad at OH Front Entry 1949983,Tue Nov 28 11:50:23 2017,(CH: Nov 28 11:53:41),ENTRY GRANTED to USER6048: Elizabeth Apro at Plant Room to CommsRoom door 1949984,Tue Nov 28 11:50:36 2017,(CH: Nov 28 11:53:54),ENTRY GRANTED to USER5990: Tracy Brighouse at AMH Front Entry 1949985,Tue Nov 28 11:50:40 2017,(CH: Nov 28 11:53:58),ENTRY GRANTED to USER5990: Tracy Brighouse at AMH Reception 1949986,Tue Nov 28 11:51:03 2017,(CH: Nov 28 11:54:22),ENTRY GRANTED to USER10885: Anne Volling at CH Rear Entry 1949987,Tue Nov 28 11:51:09 2017,(CH: Nov 28 11:54:28),ENTRY GRANTED to USER9828: Rosalinda Whittmaack at AMH Rear Entry 1949988,Tue Nov 28 11:51:32 2017,(CH: Nov 28 11:54:51),ENTRY GRANTED to USER5990: Tracy Brighouse at AMH Rear Entry 1949989,Tue Nov 28 11:51:51 2017,(CH: Nov 28 11:55:10),ENTRY GRANTED to USER6048: Elizabeth Apro at Staff Entry GroundEast 1949990,Tue Nov 28 11:54:54 2017,(CH: Nov 28 11:58:13),ENTRY GRANTED to USER6295: Rachael Downie at AMH Rear Entry 1949991,Tue Nov 28 11:55:00 2017,(CH: Nov 28 11:58:18),ENTRY GRANTED to USER6295: Rachael Downie at AMH Consult B08 1949992,Tue Nov 28 11:55:05 2017,(CH: Nov 28 11:58:24),ENTRY GRANTED to USER11235: Kristi Davies at Admin Entry 1949993,Tue Nov 28 11:55:08 2017,(CH: Nov 28 11:58:27),ENTRY GRANTED to USER6295: Rachael Downie at AMH Rear Entry 1949994,Tue Nov 28 11:55:10 2017,(CH: Nov 28 11:58:28),ENTRY GRANTED to USER6375: Ahmed Mansour at AMH Reception 1949995,Tue Nov 28 11:57:23 2017,(CH: Nov 28 12:00:41),ENTRY GRANTED to USER10593: Emma Phelan at AMH Rear Entry 1949996,Tue Nov 28 11:57:47 2017,(CH: Nov 28 12:01:06),ENTRY GRANTED to USER6127: Daniel Nguyen at AMH Reception 1949997,Tue Nov 28 11:57:49 2017,(CH: Nov 28 12:01:07),ENTRY GRANTED to USER10593: Emma Phelan at AMH Clinic B15 1949998,Tue Nov 28 11:57:56 2017,(CH: Nov 28 12:01:15),ENTRY GRANTED to USER7656: Angela McCoy at A & P Rear Entry 1949999 Tue Nov 28 11:58:30 2017 (CH: Nov 28 12:01:48)ENTRY GRANTED to USER 7656: Angela McCoy at Admin Entry 1950000 Tue Nov 28 11:58:43 2017 (CH: Nov 28 12:02:01)ENTRY GRANTED to USER 9886: Janet Meade at Main Entry Gate 1950001 Tue Nov 28 11:58:59 2017 (CH: Nov 28 12:02:17)ENTRY GRANTED to USER 10041: Christina Davidson at A & P Foyer Entry 1950002 Tue Nov 28 11:59:06 2017 (CH: Nov 28 12:02:24)ENTRY GRANTED to USER 7656: Angela McCoy at Gym Internal Entry 1950003 Tue Nov 28 11:59:22 2017 (CH: Nov 28 12:02:40)ENTRY GRANTED to USER 10041: Christina Davidson at A & P Rear Entry 1950004 Tue Nov 28 12:00:21 2017 (CH: Nov 28 12:03:39)ENTRY GRANTED to USER 9886: Janet Meade at Staff Entry Ground East 1950005 Tue Nov 28 12:00:40 2017 (CH: Nov 28 12:03:58)ENTRY GRANTED to USER 6296: Peta Picton at Admin Entry 1950006 Tue Nov 28 12:00:52 2017 (CH: Nov 28 12:04:10)ENTRY GRANTED to USER 6048: Elizabeth Apro at CH Rear Entry 1950007 Tue Nov 28 12:01:29 2017 (CH: Nov 28 12:04:47)ENTRY GRANTED to USER 6296: Peta Picton at A & P Foyer Entry 1950008 Tue Nov 28 12:01:42 2017 (CH: Nov 28 12:05:00)ENTRY GRANTED to USER 10184: Jasmin-marie Lim at Gym Internal Entry 1950009 Tue Nov 28 12:02:02 2017 (CH: Nov 28 12:05:20)ENTRY GRANTED to USER 10184: Jasmin-marie Lim at A & P Foyer Entry 1950010 Tue Nov 28 12:02:27 2017 (CH: Nov 28 12:05:45)ENTRY GRANTED to USER 6016: Rhonda Rainbird at Conference External Entry 1950011 Tue Nov 28 12:02:29 2017 (CH: Nov 28 12:05:47)ENTRY GRANTED to USER 10184: Jasmin-marie Lim at A & P Rear Entry 1950012 Tue Nov 28 12:02:35 2017 (CH: Nov 28 12:05:53)ENTRY GRANTED to USER 6123: Michelle Coghlan at ACUTE CARE AT HOME 1950013 Tue Nov 28 12:02:52 2017 (CH: Nov 28 12:06:11)ENTRY GRANTED to USER 6016: Rhonda Rainbird at OH Front Entry 1950014 Tue Nov 28 12:02:52 2017 (CH: Nov 28 12:06:11)ENTRY GRANTED to USER 6286: Laura Fitzalan at CH Rear Entry 1950015 Tue Nov 28 12:03:05 2017 (CH: Nov 28 12:06:24)ENTRY GRANTED to USER 6048: Elizabeth Apro at Admin Entry 1950016 Tue Nov 28 12:03:10 2017 (CH: Nov 28 12:06:29)ENTRY GRANTED to USER 6103: Helen McGregor at Staff Entry Ground East 1950017 Tue Nov 28 12:03:41 2017 (CH: Nov 28 12:06:59)ENTRY GRANTED to USER 7656: Angela McCoy at Gym Internal Entry 1950018 Tue Nov 28 12:03:57 2017 (CH: Nov 28 12:07:16)ENTRY GRANTED to USER 10041: Christina Davidson at Staff Entry Ground East 1950019 Tue Nov 28 12:04:12 2017 (CH: Nov 28 12:07:31)ENTRY GRANTED to USER 10041: Christina Davidson at Basement comms Room
[Output 1(https://ibb.co/cNXgexl)
Repeating the process doesnt yield any changes, changing
(?-s)^\x20(194\d{4})\x20(.+)\x20((.+))(.+)\x20\R\x20+(.+)\x20(?:\R\x20+(.+)\x20(?:\R\x20+(.+)\x20)??)??(?=\Z|\R(\f|\x20194\d{4}\x20))|(?s)\f.*?===\R
to
(?-s)^\x20(195\d{4})\x20(.+)\x20((.+))(.+)\x20\R\x20+(.+)\x20(?:\R\x20+(.+)\x20(?:\R\x20+(.+)\x20)??)??(?=\Z|\R(\f|\x20194\d{4}\x20))|(?s)\f.*?===\R
For what I am guessing is it so filter the 195(next 4 digits) normally blanks the rest of the document and you end up with a single line of text somtimes :(
I copied 10 pages from the 1950001 to 195xxxx and re-ran the expression, with the 194 changed to 195, and it would normally only change the last line of the page, sometimes get two lines merged into one line
SO CLOSE! Any thoughts? I could email you the source file if you like?
-
Hello, @mark-allen, @scott-sumner and All,
Ah, sorry ! I was really wrong about the Incident Number description… I should have read carefully, because you said, in your first post :
Field 1 - Incident Number (194xxxx —> Onwards)
And also :
I have a report about 2 000 000 Lines long,
So as each transaction is, at least, split in two lines, this means that your file could contain, roughly,
1,000,000
transactions, or a bit less !. As the first number is1940000
=> The last Incident Number should be nearby2940000
So I changed the regex, matching the Incident Number, to
[12]\d{6})
. Note that this regex definition, of an Incident Number, is present, twice, in the overall regex :-
At its beginning
(?-s)^\x20([12]\d{6})\x20(.+)\x20......
-
In the look-ahead part
(?=\Z|\R(\f|\x20[12]\d{6}\x20))
I, also, changed the regex matching the Header, to
(?s)\f.*?===\R\R
, as there is a pure blank line, after the=====
line !
Now, the last version of the regex is :
SEARCH
(?-s)^\x20([12]\d{6})\x20(.+)\x20(\(.+\))(.+)\x20\R\x20+(.+)\x20(?:\R\x20+(.+)\x20(?:\R\x20+(.+)\x20)??)??(?=\Z|\R(\f|\x20[12]\d{6}\x20))|(?s)\f.*?===\R\R
REPLACE
?1\1,\2,\3,\4\5\6\7
Running that regex S/R against your attached file out.txt (
14,399,825
bytes and202,315
lines ) left me with the expected comma separated file, of9,698,482
bytes, containing82,068
linesVoilà :-))
Here is, below, some notes about this regex S/R :
(?-s) # The DOT regex character matches ONLY a SINGLE STANDARD character ( NOT the EOL ones ) ^\x20([12]\d{6})\x20 # An 'Incident Number' ([12]\d{6}), as Group 1, PREDEDED and FOLLOWED with a SPACE character \x20 (.+)\x20 # The Date and Time (.+), as Group 2, which ENDS with a SPACE character \x20 (\(.+\)) # The 'Access Control Time' (\(.+\)), as Group 3, surrounded with PARENTHESES, which are ESCAPED to be considered as LITTERALS (.+)\x20\R # The FIRST 'Verbose Output' zone (.+), as Group 4, FOLLOWED with a SPACE character \x20 and a LINE-BREAK \R \x20+(.+)\x20 # The SECOND 'Verbose Output' zone (.+), as Group 5, NEXT line, PRECEDED with SPACE characters \x20+ and FOLLOWED with a SPACE character \x20 (?:\R\x20+(.+)\x20(?:\R\x20+(.+)\x20)??)?? # The part, between the OUTER parentheses, (?: and )??, may be ASBENT or PRESENT, with priority to ABSENCE # It represents the possible THIRD and FOURTH 'Verbose Output' zones (.+), as Groups 6 and 7 : # - PRECEDED with a LINE-BREAK \R and some SPACE characters \x20+ # - FOLLOWED with a SPACE character \x20 # # Again, the part, between the INNER parentheses, (?: and )??, may be ABSENT or PRESENT, with priority to ABSENCE # It represents the possible FOURTH 'Verbose Output' zone (.+), as Group 7 : # - PRECEDED with a LINE-BREAK \R and some SPACE characters \x20+ # - FOLLOWED with a SPACE character \x20 (?=\Z|\R(\f|\x20[12]\d{6}\x20)) # IF followed with, EITHER : ( LOOK-AHEAD syntax ) # - \Z that is to say SOME \r,\n or \f characters, at the VERY END of file # - a LINE-BREAK \R, followed with, EITHER : # - A CONTROL character \f, beginning a HEADER block # - An INCIDENT number, beginning with '1' or '2', [12]\d{6}, PRECEDED and FOLLOWED with a SPACE character \x20 | # OR (?s) # The DOT regex character matches, ABSOLUTELY, ANY character \f.*?===\R\R # The SHORTEST range of ANY character *.?, EVEN on SEVERAL lines, between : # - A FORM-FEED character \f # - THREE EQUAL signs, FOLLOWED with 2 LINE-BREAKS \R\R # # => The COMPLETE HEADER block, with LINE-BREAKS
To end with, here is this same regex S/R, displayed in Free-spacing mode, with a minimum of comments ! You may run this syntax, as well. Just select all the text, below, and paste it, in the Find what: zone of the Replace dialog :-))
(?x) # FRE-SPACING regex mode # # IMPORTANT : In this mode, any \R syntax must be REPLACED with \r\n # --------- (?-s) # The DOT regex character matches ONLY a SINGLE STANDARD character ( NOT the EOL ones ) ^\x20([12]\d{6})\x20 # Group 1 ( Incident Number ) (.+)\x20 # Group 2 ( Date and Time ) (\(.+\)) # Group 3 ( Access Control Time ) (.+)\x20\r\n # Group 4 ( 1ST Verbose Output ) \x20+(.+)\x20 # Group 5 ( 2ND Verbose Output ) (?:\r\n\x20+(.+)\x20(?:\r\n\x20+(.+)\x20)??)?? # Group 6 and 7 ( 3RD and 4TH Verbose Output ) (?=\Z|\r\n(\f|\x20[12]\d{6}\x20)) # IF the VERY END of file or IF a LINE-BREAK followed with an INCIDENT NUMBER or a HEADER | # OR (?s) # The DOT regex character matches, ABSOLUTELY, ANY character \f.*?===\r\n\r\n # The COMPLETE HEADER block, with its LINE-BREAKS
Remember that the Find what: zone cannot store more than
2046
characters !Cheers,
guy038
-
-
Mate
Thank you so so so Much for your help with this, I truly mean that, it worked perfectly, If its okay, I am going to distribute this way to the supplier of the product so we can use it to make better reports from this program!!!
Truly Thank you heaps, and thanks for Explaination as well, I should be able to use this in the future to write up my own rules…
Again thanks heaps!!!
Mark