Community
    • Login

    Help Parsing Datafile and removing unwanted strings

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 2 Posters 1.9k Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Frederick DalmeidaF Offline
      Frederick Dalmeida
      last edited by

      Hello,
      I am tasked with parsing some very large datafiles that look like the examples below. I am new to RegEx and Notepad++ and will get there eventually but for now I just need help getting over this hurdle. The desired output is what I’m looking to accomplish. Help anyone?

      Provided Data
      þRecordIDþþFullPathþ
      þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
      þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
      þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
      þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ
      þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West Regionþ

      Desired Output
      þRecordIDþþEmail ContainerþþEmail Folder Pathþ
      þREC00000001þþMikeR.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
      þREC00000002þþJohnDoe.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
      þREC00000003þþoutlook.ostþþRoot - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
      þREC00000004þþarchive.pstþþTop of Outlook data file\Inboxþ
      þREC00000005þþarchive.pstþþTop of Outlook data file\Inbox\Billing - West Regionþ

      1 Reply Last reply Reply Quote 0
      • guy038G Offline
        guy038
        last edited by guy038

        Hello, @frederick-dalmeida, and All,

        As you did not provide lot of information, and in order to, exactly, match your text, I supposed some points :

        • Regarding the Header line :

          • Begins with the string RecordID, surrounded by the þ character and followed by <DC4>þ, which should remain after replacement

          • Ends with the word FullPath, followed with the þ character

        • Regarding the Record lines :

          • Begins with the string þREC, followed by 8 digits and the string þ<DC4>þ, unchanged after replacement )

          • The parts beginning with, either, the string Root - Mailbox or the string Top of Outlook data file, up to the end of line, must be, also, unchanged after replacement

          • The range between two \, placed before the two strings just above, must be rewritten, followed with the string þ<DC4>þ


        So :

        • First, backup the file(s), concerned by this S/R ! ( IMPORTANT )

        • Open the Replace dialog ( Ctrl + H )

        • Select the Regular expression search mode

        • Tick the Wrap around option

        Fill the Search what: and Replace with: zones, with the regexes, below ( Copy/Paste ) :

        SEARCH (?-is)^\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

        REPLACE (?1\1\xFE\x14\xFE)(?2Email\x20Container\xFE\x14\x{FE}Email\x20Folder\x20)

        • Click on the Replace All button, exclusively ( Do not use the Replace one ! )

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 1
        • Frederick DalmeidaF Offline
          Frederick Dalmeida
          last edited by

          @guy038
          Thanks for the reply. You captured the structure correctly. The Regex that you provided only replaces the values in the header row and does not perform the update (search/replace) on the body of the data. Thanks!

          1 Reply Last reply Reply Quote 0
          • guy038G Offline
            guy038
            last edited by guy038

            Hi, @frederick-dalmeida, and All,

            Ah, OK ! Probably, the body part of your initial text is indented, as shown below :

            þRecordIDþþFullPathþ
                þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
                þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
                þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
                þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ
                þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West Regionþ
            

            In that case, just add the regex \h*, after the ^ symbol, at the beginning of the search regex, which becomes :

            SEARCH (?-is)^\h*\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

            The indentation will be kept, after the replacement :-)

            Cheers,

            guy038

            1 Reply Last reply Reply Quote 0
            • Frederick DalmeidaF Offline
              Frederick Dalmeida
              last edited by

              @guy038
              That did the trick. Thank you!!

              ps.
              I later found that there were variations in the RecordIDs and some other variations in the email container types and Top Folders so I made some adjustments to account for those variations:
              SEARCH (?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\(.+?)\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

              Again, Thank you so much for the help! You are a life saver.

              1 Reply Last reply Reply Quote 0
              • guy038G Offline
                guy038
                last edited by guy038

                Hello, @frederick-dalmeida,

                Probably, your regex should have that form !

                (?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

                or may be, if we care about possible indentations :

                (?-is)^\h*\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)


                For documentation about regular expressions, see here

                As you managed to adapt my regex to your needs, I suppose that you correctly understood its syntax :-) But I don’t mind giving you additional information, if necessary !

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 0

                Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                With your input, this post could be even better 💗

                Register Login
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors