Help Parsing Datafile and removing unwanted strings



  • Hello,
    I am tasked with parsing some very large datafiles that look like the examples below. I am new to RegEx and Notepad++ and will get there eventually but for now I just need help getting over this hurdle. The desired output is what I’m looking to accomplish. Help anyone?

    Provided Data
    þRecordIDþþFullPathþ
    þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
    þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
    þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
    þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ
    þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West Regionþ

    Desired Output
    þRecordIDþþEmail ContainerþþEmail Folder Pathþ
    þREC00000001þþMikeR.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
    þREC00000002þþJohnDoe.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
    þREC00000003þþoutlook.ostþþRoot - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
    þREC00000004þþarchive.pstþþTop of Outlook data file\Inboxþ
    þREC00000005þþarchive.pstþþTop of Outlook data file\Inbox\Billing - West Regionþ



  • Hello, @frederick-dalmeida, and All,

    As you did not provide lot of information, and in order to, exactly, match your text, I supposed some points :

    • Regarding the Header line :

      • Begins with the string RecordID, surrounded by the þ character and followed by <DC4>þ, which should remain after replacement

      • Ends with the word FullPath, followed with the þ character

    • Regarding the Record lines :

      • Begins with the string þREC, followed by 8 digits and the string þ<DC4>þ, unchanged after replacement )

      • The parts beginning with, either, the string Root - Mailbox or the string Top of Outlook data file, up to the end of line, must be, also, unchanged after replacement

      • The range between two \, placed before the two strings just above, must be rewritten, followed with the string þ<DC4>þ


    So :

    • First, backup the file(s), concerned by this S/R ! ( IMPORTANT )

    • Open the Replace dialog ( Ctrl + H )

    • Select the Regular expression search mode

    • Tick the Wrap around option

    Fill the Search what: and Replace with: zones, with the regexes, below ( Copy/Paste ) :

    SEARCH (?-is)^\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

    REPLACE (?1\1\xFE\x14\xFE)(?2Email\x20Container\xFE\x14\x{FE}Email\x20Folder\x20)

    • Click on the Replace All button, exclusively ( Do not use the Replace one ! )

    Best Regards,

    guy038



  • @guy038
    Thanks for the reply. You captured the structure correctly. The Regex that you provided only replaces the values in the header row and does not perform the update (search/replace) on the body of the data. Thanks!



  • Hi, @frederick-dalmeida, and All,

    Ah, OK ! Probably, the body part of your initial text is indented, as shown below :

    þRecordIDþþFullPathþ
        þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
        þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
        þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
        þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ
        þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West Regionþ
    

    In that case, just add the regex \h*, after the ^ symbol, at the beginning of the search regex, which becomes :

    SEARCH (?-is)^\h*\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

    The indentation will be kept, after the replacement :-)

    Cheers,

    guy038



  • @guy038
    That did the trick. Thank you!!

    ps.
    I later found that there were variations in the RecordIDs and some other variations in the email container types and Top Folders so I made some adjustments to account for those variations:
    SEARCH (?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\(.+?)\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

    Again, Thank you so much for the help! You are a life saver.



  • Hello, @frederick-dalmeida,

    Probably, your regex should have that form !

    (?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

    or may be, if we care about possible indentations :

    (?-is)^\h*\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)


    For documentation about regular expressions, see here

    As you managed to adapt my regex to your needs, I suppose that you correctly understood its syntax :-) But I don’t mind giving you additional information, if necessary !

    Best Regards,

    guy038


Log in to reply