• Login
Community
  • Login

Help Parsing Datafile and removing unwanted strings

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
6 Posts 2 Posters 1.5k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • F
    Frederick Dalmeida
    last edited by Jun 8, 2018, 3:28 PM

    Hello,
    I am tasked with parsing some very large datafiles that look like the examples below. I am new to RegEx and Notepad++ and will get there eventually but for now I just need help getting over this hurdle. The desired output is what I’m looking to accomplish. Help anyone?

    Provided Data
    þRecordIDþþFullPathþ
    þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
    þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
    þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
    þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ
    þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West Regionþ

    Desired Output
    þRecordIDþþEmail ContainerþþEmail Folder Pathþ
    þREC00000001þþMikeR.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
    þREC00000002þþJohnDoe.ostþþRoot - Mailbox\IPM_SUBTREE\Inboxþ
    þREC00000003þþoutlook.ostþþRoot - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
    þREC00000004þþarchive.pstþþTop of Outlook data file\Inboxþ
    þREC00000005þþarchive.pstþþTop of Outlook data file\Inbox\Billing - West Regionþ

    1 Reply Last reply Reply Quote 0
    • G
      guy038
      last edited by guy038 Jun 8, 2018, 9:36 PM Jun 8, 2018, 9:33 PM

      Hello, @frederick-dalmeida, and All,

      As you did not provide lot of information, and in order to, exactly, match your text, I supposed some points :

      • Regarding the Header line :

        • Begins with the string RecordID, surrounded by the þ character and followed by <DC4>þ, which should remain after replacement

        • Ends with the word FullPath, followed with the þ character

      • Regarding the Record lines :

        • Begins with the string þREC, followed by 8 digits and the string þ<DC4>þ, unchanged after replacement )

        • The parts beginning with, either, the string Root - Mailbox or the string Top of Outlook data file, up to the end of line, must be, also, unchanged after replacement

        • The range between two \, placed before the two strings just above, must be rewritten, followed with the string þ<DC4>þ


      So :

      • First, backup the file(s), concerned by this S/R ! ( IMPORTANT )

      • Open the Replace dialog ( Ctrl + H )

      • Select the Regular expression search mode

      • Tick the Wrap around option

      Fill the Search what: and Replace with: zones, with the regexes, below ( Copy/Paste ) :

      SEARCH (?-is)^\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

      REPLACE (?1\1\xFE\x14\xFE)(?2Email\x20Container\xFE\x14\x{FE}Email\x20Folder\x20)

      • Click on the Replace All button, exclusively ( Do not use the Replace one ! )

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 1
      • F
        Frederick Dalmeida
        last edited by Jun 11, 2018, 6:17 PM

        @guy038
        Thanks for the reply. You captured the structure correctly. The Regex that you provided only replaces the values in the header row and does not perform the update (search/replace) on the body of the data. Thanks!

        1 Reply Last reply Reply Quote 0
        • G
          guy038
          last edited by guy038 Jun 11, 2018, 7:31 PM Jun 11, 2018, 7:29 PM

          Hi, @frederick-dalmeida, and All,

          Ah, OK ! Probably, the body part of your initial text is indented, as shown below :

          þRecordIDþþFullPathþ
              þREC00000001þþ\Network Share\Mike Ren Files\C\Users\Mike Ren\Documents\Outlook Files\MikeR.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
              þREC00000002þþ\Teams\Users\John Doe\AppData\Local\Microsoft\Outlook\JohnDoe.ost\Root - Mailbox\IPM_SUBTREE\Inboxþ
              þREC00000003þþ\Network Share\John Doe Old Laptop #1\D\Users\JohnDoe\AppData\Local\Microsoft\Outlook\outlook.ost\Root - Mailbox\IPM_SUBTREE\Inbox\Finance\FW: FY12 Audit Reportþ
              þREC00000004þþ\Network Share\Jane Smith Data\C\Users\JaneSmith\Documents\Outlook Files\archive.pst\Top of Outlook data file\Inboxþ
              þREC00000005þþ\Backups\Mary Larson\HDD1234\E\Back-up (4.16.2018)\archive.pst\Top of Outlook data file\Inbox\Billing - West Regionþ
          

          In that case, just add the regex \h*, after the ^ symbol, at the beginning of the search regex, which becomes :

          SEARCH (?-is)^\h*\x{FE}REC\d{8}\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - Mailbox|Top of Outlook data file))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

          The indentation will be kept, after the replacement :-)

          Cheers,

          guy038

          1 Reply Last reply Reply Quote 0
          • F
            Frederick Dalmeida
            last edited by Jun 11, 2018, 11:32 PM

            @guy038
            That did the trick. Thank you!!

            ps.
            I later found that there were variations in the RecordIDs and some other variations in the email container types and Top Folders so I made some adjustments to account for those variations:
            SEARCH (?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\(.+?)\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

            Again, Thank you so much for the help! You are a life saver.

            1 Reply Last reply Reply Quote 0
            • G
              guy038
              last edited by guy038 Jun 12, 2018, 12:52 AM Jun 12, 2018, 12:48 AM

              Hello, @frederick-dalmeida,

              Probably, your regex should have that form !

              (?-is)^\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)

              or may be, if we care about possible indentations :

              (?-is)^\h*\x{FE}.*?\xFE\x14\xFE\K.+\\(.+?)\\(?=(?:Root - |Top of |EAD2EF20-))|^\x{FE}RecordID\xFE\x14\xFE\K(Full)(?=Path\xFE$)


              For documentation about regular expressions, see here

              As you managed to adapt my regex to your needs, I suppose that you correctly understood its syntax :-) But I don’t mind giving you additional information, if necessary !

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 0
              6 out of 6
              • First post
                6/6
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors