Community
    • Login

    Help Parsing Datafile and removing unwanted strings

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 2 Posters 1.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Frederick DalmeidaF
      Frederick Dalmeida
      last edited by

      Hello,
      I am tasked with parsing some very large datafiles that look like the examples below. I am new to RegEx and Notepad++ and will get there eventually but for now I just need help getting over this hurdle. The desired output is what I’m looking to accomplish. Help anyone?

      Provided Data
      þRecordIDþþFullPathþ
      þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
      þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
      þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
      þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ
      þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West Regionþ

      Desired Output
      þRecordIDþþEmail ContainerþþEmail Folder Pathþ
      þREC00000001þþMikeR.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
      þREC00000002þþJohnDoe.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
      þREC00000003þþoutlook.ostþþRoot - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
      þREC00000004þþarchive.pstþþTop of Outlook data file\Inboxþ
      þREC00000005þþarchive.pstþþTop of Outlook data file\Inbox\Billing - West Regionþ

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @frederick-dalmeida, and All,

        As you did not provide lot of information, and in order to, exactly, match your text, I supposed some points :

        • Regarding the Header line :

          • Begins with the string RecordID, surrounded by the þ character and followed by <DC4>þ, which should remain after replacement

          • Ends with the word FullPath, followed with the þ character

        • Regarding the Record lines :

          • Begins with the string þREC, followed by 8 digits and the string þ<DC4>þ, unchanged after replacement )

          • The parts beginning with, either, the string Root - Mailbox or the string Top of Outlook data file, up to the end of line, must be, also, unchanged after replacement

          • The range between two \, placed before the two strings just above, must be rewritten, followed with the string þ<DC4>þ


        So :

        • First, backup the file(s), concerned by this S/R ! ( IMPORTANT )

        • Open the Replace dialog ( Ctrl + H )

        • Select the Regular expression search mode

        • Tick the Wrap around option

        Fill the Search what: and Replace with: zones, with the regexes, below ( Copy/Paste ) :

        SEARCH (?-is)^\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

        REPLACE (?1\1\xFE\x14\xFE)(?2Email\x20Container\xFE\x14\x{FE}Email\x20Folder\x20)

        • Click on the Replace All button, exclusively ( Do not use the Replace one ! )

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 1
        • Frederick DalmeidaF
          Frederick Dalmeida
          last edited by

          @guy038
          Thanks for the reply. You captured the structure correctly. The Regex that you provided only replaces the values in the header row and does not perform the update (search/replace) on the body of the data. Thanks!

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi, @frederick-dalmeida, and All,

            Ah, OK ! Probably, the body part of your initial text is indented, as shown below :

            þRecordIDþþFullPathþ
                þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
                þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
                þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
                þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ
                þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West Regionþ
            

            In that case, just add the regex \h*, after the ^ symbol, at the beginning of the search regex, which becomes :

            SEARCH (?-is)^\h*\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

            The indentation will be kept, after the replacement :-)

            Cheers,

            guy038

            1 Reply Last reply Reply Quote 0
            • Frederick DalmeidaF
              Frederick Dalmeida
              last edited by

              @guy038
              That did the trick. Thank you!!

              ps.
              I later found that there were variations in the RecordIDs and some other variations in the email container types and Top Folders so I made some adjustments to account for those variations:
              SEARCH (?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\(.+?)\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

              Again, Thank you so much for the help! You are a life saver.

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by guy038

                Hello, @frederick-dalmeida,

                Probably, your regex should have that form !

                (?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

                or may be, if we care about possible indentations :

                (?-is)^\h*\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)


                For documentation about regular expressions, see here

                As you managed to adapt my regex to your needs, I suppose that you correctly understood its syntax :-) But I don’t mind giving you additional information, if necessary !

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors