Help Parsing Datafile and removing unwanted strings
-
Hello,
I am tasked with parsing some very large datafiles that look like the examples below. I am new to RegEx and Notepad++ and will get there eventually but for now I just need help getting over this hurdle. The desired output is what I’m looking to accomplish. Help anyone?Provided Data
þRecordIDþþFullPathþ
þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ
þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West RegionþDesired Output
þRecordIDþþEmail ContainerþþEmail Folder Pathþ
þREC00000001þþMikeR.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
þREC00000002þþJohnDoe.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
þREC00000003þþoutlook.ostþþRoot - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
þREC00000004þþarchive.pstþþTop of Outlook data file\Inboxþ
þREC00000005þþarchive.pstþþTop of Outlook data file\Inbox\Billing - West Regionþ -
Hello, @frederick-dalmeida, and All,
As you did not provide lot of information, and in order to, exactly, match your text, I supposed some points :
-
Regarding the Header line :
-
Begins with the string RecordID, surrounded by the
þcharacter and followed by <DC4>þ, which should remain after replacement -
Ends with the word FullPath, followed with the
þcharacter
-
-
Regarding the Record lines :
-
Begins with the string
þREC, followed by8digits and the stringþ<DC4>þ, unchanged after replacement ) -
The parts beginning with, either, the string Root - Mailbox or the string Top of Outlook data file, up to the end of line, must be, also, unchanged after replacement
-
The range between two
\, placed before the two strings just above, must be rewritten, followed with the stringþ<DC4>þ
-
So :
-
First, backup the file(s), concerned by this S/R ! ( IMPORTANT )
-
Open the Replace dialog (
Ctrl + H) -
Select the
Regular expressionsearch mode -
Tick the
Wrap aroundoption
Fill the
Search what:andReplace with:zones, with the regexes, below ( Copy/Paste ) :SEARCH
(?-is)^\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)REPLACE
(?1\1\xFE\x14\xFE)(?2Email\x20Container\xFE\x14\x{FE}Email\x20Folder\x20)- Click on the
Replace Allbutton, exclusively ( Do not use theReplaceone ! )
Best Regards,
guy038
-
-
@guy038
Thanks for the reply. You captured the structure correctly. The Regex that you provided only replaces the values in the header row and does not perform the update (search/replace) on the body of the data. Thanks! -
Hi, @frederick-dalmeida, and All,
Ah, OK ! Probably, the body part of your initial text is indented, as shown below :
þRecordIDþþFullPathþ þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West RegionþIn that case, just add the regex
\h*, after the^symbol, at the beginning of the search regex, which becomes :SEARCH
(?-is)^\h*\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)The indentation will be kept, after the replacement :-)
Cheers,
guy038
-
@guy038
That did the trick. Thank you!!ps.
I later found that there were variations in the RecordIDs and some other variations in the email container types and Top Folders so I made some adjustments to account for those variations:
SEARCH (?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\(.+?)\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)Again, Thank you so much for the help! You are a life saver.
-
Hello, @frederick-dalmeida,
Probably, your regex should have that form !
(?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)or may be, if we care about possible indentations :
(?-is)^\h*\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)
For documentation about regular expressions, see here
As you managed to adapt my regex to your needs, I suppose that you correctly understood its syntax :-) But I don’t mind giving you additional information, if necessary !
Best Regards,
guy038
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login