Help Parsing Datafile and removing unwanted strings
-
Hello,
I am tasked with parsing some very large datafiles that look like the examples below. I am new to RegEx and Notepad++ and will get there eventually but for now I just need help getting over this hurdle. The desired output is what I’m looking to accomplish. Help anyone?Provided Data
þRecordIDþþFullPathþ
þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ
þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West RegionþDesired Output
þRecordIDþþEmail ContainerþþEmail Folder Pathþ
þREC00000001þþMikeR.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
þREC00000002þþJohnDoe.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
þREC00000003þþoutlook.ostþþRoot - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
þREC00000004þþarchive.pstþþTop of Outlook data file\Inboxþ
þREC00000005þþarchive.pstþþTop of Outlook data file\Inbox\Billing - West Regionþ -
Hello, @frederick-dalmeida, and All,
As you did not provide lot of information, and in order to, exactly, match your text, I supposed some points :
-
Regarding the Header line :
-
Begins with the string RecordID, surrounded by the
þ
character and followed by <DC4>þ
, which should remain after replacement -
Ends with the word FullPath, followed with the
þ
character
-
-
Regarding the Record lines :
-
Begins with the string
þREC
, followed by8
digits and the stringþ
<DC4>þ
, unchanged after replacement ) -
The parts beginning with, either, the string Root - Mailbox or the string Top of Outlook data file, up to the end of line, must be, also, unchanged after replacement
-
The range between two
\
, placed before the two strings just above, must be rewritten, followed with the stringþ
<DC4>þ
-
So :
-
First, backup the file(s), concerned by this S/R ! ( IMPORTANT )
-
Open the Replace dialog (
Ctrl + H
) -
Select the
Regular expression
search mode -
Tick the
Wrap around
option
Fill the
Search what:
andReplace with:
zones, with the regexes, below ( Copy/Paste ) :SEARCH
(?-is)^\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)
REPLACE
(?1\1\xFE\x14\xFE)(?2Email\x20Container\xFE\x14\x{FE}Email\x20Folder\x20)
- Click on the
Replace All
button, exclusively ( Do not use theReplace
one ! )
Best Regards,
guy038
-
-
@guy038
Thanks for the reply. You captured the structure correctly. The Regex that you provided only replaces the values in the header row and does not perform the update (search/replace) on the body of the data. Thanks! -
Hi, @frederick-dalmeida, and All,
Ah, OK ! Probably, the body part of your initial text is indented, as shown below :
þRecordIDþþFullPathþ þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West Regionþ
In that case, just add the regex
\h*
, after the^
symbol, at the beginning of the search regex, which becomes :SEARCH
(?-is)^\h*\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)
The indentation will be kept, after the replacement :-)
Cheers,
guy038
-
@guy038
That did the trick. Thank you!!ps.
I later found that there were variations in the RecordIDs and some other variations in the email container types and Top Folders so I made some adjustments to account for those variations:
SEARCH (?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\(.+?)\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)Again, Thank you so much for the help! You are a life saver.
-
Hello, @frederick-dalmeida,
Probably, your regex should have that form !
(?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)
or may be, if we care about possible indentations :
(?-is)^\h*\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)
For documentation about regular expressions, see here
As you managed to adapt my regex to your needs, I suppose that you correctly understood its syntax :-) But I don’t mind giving you additional information, if necessary !
Best Regards,
guy038