Community
    • Login

    Convert XML data to tabular csv data or tab separated without a script

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    14 Posts 8 Posters 15.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • CoisesC
      Coises
      last edited by

      The example you show has the same four column-elements in the same order in every row-element. Is that true of the real data?

      Bas de ReuverB 1 Reply Last reply Reply Quote 1
      • wonkawillyW
        wonkawilly
        last edited by

        If the schema is constant I can suggest to use find and replace multiple times and create a macro saving the whole procedure in the meanwhile so you have it for future uses

        1 Reply Last reply Reply Quote 1
        • Mark OlsonM
          Mark Olson
          last edited by

          PythonScript makes working with XML data quite easy, because Python’s standard library has great XML support.

          Here’s a pretty generic script. I’ve only tested it on the sample data you provided.

          # based on https://community.notepad-plus-plus.org/topic/24563/convert-xml-data-to-tabular-csv-data-or-tab-separated-without-a-script
          # References:
          #   * https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
          #   * https://docs.python.org/3/library/csv.html
          #   * https://docs.python.org/3/library/io.html#io.StringIO
          #   * https://npppythonscript.sourceforge.net/docs/latest/intro.html#notifications
          
          import csv
          from io import StringIO
          from xml.etree import ElementTree as ET
          
          from Npp import *
          
          
          def xml_to_json(text):
              root = ET.fromstring(text)
              return [{e.tag: e.text for e in child} for child in root]
                          
          
          def xml_to_csv(text):
              js = xml_to_json(text)
              if not js:
                  return ''
              fieldnames = list(js[0])
              strio = StringIO()
              writer = csv.DictWriter(strio, fieldnames=fieldnames)
              writer.writeheader()
              for row in js:
                  writer.writerow(row)
              csv_text = strio.getvalue()
              return csv_text
              
          
          if __name__ == '__main__':
              text = editor.getText()
              csv_text = xml_to_csv(text)
              notepad.new()
              editor.setText(csv_text)
          

          When run on

          <ChildTaskStatus>
            <LoadSubStatus>
              <TaskStatus>RanToCompletion</TaskStatus>
              <Template>NL_CM_0_1_1</Template>
              <ItemsTotal>41219</ItemsTotal>
              <ElapsedTime>00:00:04</ElapsedTime>
            </LoadSubStatus>
            <LoadSubStatus>
              <TaskStatus>RanToCompletion</TaskStatus>
              <Template>NL_CM_18_7_1</Template>
              <ItemsTotal>168930</ItemsTotal>
              <ElapsedTime>00:00:53</ElapsedTime>
            </LoadSubStatus>
            <LoadSubStatus>
              <TaskStatus>RanToCompletion</TaskStatus>
              <Template>NL_CM_12_8_1</Template>
              <ItemsTotal>231122</ItemsTotal>
              <ElapsedTime>00:00:43</ElapsedTime>
            </LoadSubStatus>
          </ChildTaskStatus>
          

          I got

          TaskStatus,Template,ItemsTotal,ElapsedTime
          RanToCompletion,NL_CM_0_1_1,41219,00:00:04
          RanToCompletion,NL_CM_18_7_1,168930,00:00:53
          RanToCompletion,NL_CM_12_8_1,231122,00:00:43
          
          1 Reply Last reply Reply Quote 5
          • Mark OlsonM Mark Olson referenced this topic on
          • Bas de ReuverB
            Bas de Reuver @Coises
            last edited by

            @Coises Technically it could be in a different order, but it’s practically always the same order. However, it does sometimes have missing tags, like the ElapsedTime can be missing for a certain error status.

            CoisesC 1 Reply Last reply Reply Quote 0
            • Mark OlsonM
              Mark Olson
              last edited by

              Given that some tags are sometimes missing, a more robust solution than my original script would be this:

              # based on https://community.notepad-plus-plus.org/topic/24563/convert-xml-data-to-tabular-csv-data-or-tab-separated-without-a-script
              # References:
              #   * https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
              #   * https://docs.python.org/3/library/csv.html
              #   * https://docs.python.org/3/library/io.html#io.StringIO
              #   * https://npppythonscript.sourceforge.net/docs/latest/intro.html#notifications
              
              import csv
              from io import StringIO
              from xml.etree import ElementTree as ET
              
              from Npp import *
              
              
              def xml_to_json(text):
                  root = ET.fromstring(text)
                  fieldnames = sorted(set(e.tag for child in root for e in child),
                      key=lambda x: x.lower())
                  js = [{e.tag: e.text for e in child} for child in root]
                  return fieldnames, js
                              
              
              def xml_to_csv(text):
                  fieldnames, js = xml_to_json(text)
                  if not js:
                      return ''
                  strio = StringIO()
                  writer = csv.DictWriter(strio, fieldnames=fieldnames)
                  writer.writeheader()
                  for row in js:
                      writer.writerow(row)
                  csv_text = strio.getvalue()
                  return csv_text
                  
              
              if __name__ == '__main__':
                  text = editor.getText()
                  csv_text = xml_to_csv(text)
                  notepad.new()
                  editor.setText(csv_text)
              

              This version finds all the distinct child tag names so that even if some children are missing tags, the final table is guaranteed to have a column for each child tag.

              Test data:

              <ChildTaskStatus>
                <LoadSubStatus>
                  <TaskStatus>RanToCompletion</TaskStatus>
                  <Template>NL_CM_0_1_1</Template>
                  <ItemsTotal>41219</ItemsTotal>
                  <ElapsedTime>00:00:04</ElapsedTime>
                </LoadSubStatus>
                <LoadSubStatus>
                  <TaskStatus>RanToCompletion</TaskStatus>
                  <Template>NL_CM_18_7_1</Template>
                  <ItemsTotal>168930</ItemsTotal>
                  <ElapsedTime>00:00:53</ElapsedTime>
                </LoadSubStatus>
                <LoadSubStatus>
                  <TaskStatus>RanToCompletion</TaskStatus>
                  <Template>NL_CM_12_8_1</Template>
                  <ItemsTotal>231122</ItemsTotal>
                  <ElapsedTime>00:00:43</ElapsedTime>
                </LoadSubStatus>
                <LoadSubStatus>
                  <TaskStatus>RanToCompletion</TaskStatus>
                  <bar>eorneoren</bar>
                  <Template>NL_CM_12_8_1</Template>
                  <ItemsTotal>231122</ItemsTotal>
                  <ElapsedTime>00:00:43</ElapsedTime>
                </LoadSubStatus>
                <LoadSubStatus>
                  <TaskStatus>RanToCompletion</TaskStatus>
                  <Template>NL_CM_12_8_1</Template>
                  <ItemsTotal>231122</ItemsTotal>
                  <baz>roenroen</baz>
                  <ElapsedTime>00:00:43</ElapsedTime>
                </LoadSubStatus>
              </ChildTaskStatus>
              

              Result:

              bar,baz,ElapsedTime,ItemsTotal,TaskStatus,Template
              ,,00:00:04,41219,RanToCompletion,NL_CM_0_1_1
              ,,00:00:53,168930,RanToCompletion,NL_CM_18_7_1
              ,,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
              eorneoren,,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
              ,roenroen,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
              
              1 Reply Last reply Reply Quote 2
              • Mark OlsonM Mark Olson referenced this topic on
              • CoisesC
                Coises @Bas de Reuver
                last edited by

                @Bas-de-Reuver said in Convert XML data to tabular csv data or tab separated without a script:

                @Coises Technically it could be in a different order, but it’s practically always the same order. However, it does sometimes have missing tags, like the ElapsedTime can be missing for a certain error status.

                OK. The reason I asked was that if they were always the same, it would be feasible to transform the file with regular expressions. With order changing and/or tags missing (no matter how rarely), scripting solutions become far more attractive.

                1 Reply Last reply Reply Quote 1
                • dinkumoilD
                  dinkumoil @Bas de Reuver
                  last edited by

                  @Bas-de-Reuver

                  Converting your XML file to an CSV file could also be done with the help of the XML Tools plugin and its XML transformation feature.

                  1. Install XML Tools plugin via built-in PluginsAdmin.
                  2. Save the following code as TransformToCSV.xsl with character encoding UTF-8.
                  <?xml version="1.0" encoding="UTF-8"?>
                  
                  <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
                  
                    <xsl:output
                      method="text"
                      omit-xml-declaration="yes"
                      indent="no"
                      encoding="utf-8"
                    />
                  
                    <xsl:strip-space
                      elements="*"
                    />
                    
                    <xsl:template match="/ChildTaskStatus/LoadSubStatus">
                      <xsl:if test="position()=1">
                        <xsl:text>TaskStatus;Template;ItemsTotal;ElapsedTime&#xD;&#xA;</xsl:text>
                      </xsl:if>
                      <xsl:value-of select="concat(TaskStatus, ';', Template, ';', ItemsTotal, ';', ElapsedTime, '&#xD;', '&#xA;')"/>
                    </xsl:template>
                  
                  </xsl:transform>
                  
                  1. Load your XML file with Notepad++ and ensure it’s the active file.
                  2. Navigate to (menu) Plugins -> XML Tools -> XSL Transformation
                  3. In the dialog box popping up, use the ... button to select the XSL file you saved in step 2.
                  4. Click Transform button.

                  A new tab will be opened that contains the transformed file.

                  Please note: If the character encoding of your source XML file is not UTF-8, you have to set the correct encoding in line 9 of the XSL file.

                  1 Reply Last reply Reply Quote 3
                  • guy038G
                    guy038
                    last edited by

                    Hello, @bas-de-reuver, @datatraveller1, @mark-olson, @coises, @wonkawilly, @dinkumoil and *All,

                    Just for info, if you always get the same number of columns per line, here is a regex solution :


                    So, from your INPUT text :

                    <ChildTaskStatus>
                      <LoadSubStatus>
                        <TaskStatus>RanToCompletion</TaskStatus>
                        <Template>NL_CM_0_1_1</Template>
                        <ItemsTotal>41219</ItemsTotal>
                        <ElapsedTime>00:00:04</ElapsedTime>
                      </LoadSubStatus>
                      <LoadSubStatus>
                        <TaskStatus>RanToCompletion</TaskStatus>
                        <Template>NL_CM_18_7_1</Template>
                        <ItemsTotal>168930</ItemsTotal>
                        <ElapsedTime>00:00:53</ElapsedTime>
                      </LoadSubStatus>
                      <LoadSubStatus>
                        <TaskStatus>RanToCompletion</TaskStatus>
                        <Template>NL_CM_12_8_1</Template>
                        <ItemsTotal>231122</ItemsTotal>
                        <ElapsedTime>00:00:43</ElapsedTime>
                      </LoadSubStatus>
                    </ChildTaskStatus>
                    

                    This first regex S/R only keeps the key items, on consecutive lines

                    SEARCH (?-s)^\h*(?:</?ChildTaskStatus>|</?LoadSubStatus>)\R|\h*</?.+?>

                    REPLACE Leave EMPTY

                    Would give this TEMPORARY text :

                    RanToCompletion
                    NL_CM_0_1_1
                    41219
                    00:00:04
                    RanToCompletion
                    NL_CM_18_7_1
                    168930
                    00:00:53
                    RanToCompletion
                    NL_CM_12_8_1
                    231122
                    00:00:43
                    

                    Then, this second regex S/R put the items in a table :

                    SEARCH (?-s)(.+:.+)(\R)|(.+)\R

                    REPLACE (?1|\1 |\2)(?3|\3\x20)

                    Would give the TEMPORARY text :

                    |RanToCompletion |NL_CM_0_1_1 |41219 |00:00:04 |
                    |RanToCompletion |NL_CM_18_7_1 |168930 |00:00:53 |
                    |RanToCompletion |NL_CM_12_8_1 |231122 |00:00:43 |
                    

                    Finally, this third regex S/R justify the ItemsTotal values on the right

                    SEARCH (?-s)^(\|.+\|)(\d+)(\x20)+

                    REPLACE \1\3\2

                    And you get your expected OUTPUT text :

                    |RanToCompletion |NL_CM_0_1_1 | 41219|00:00:04 |
                    |RanToCompletion |NL_CM_18_7_1 | 168930|00:00:53 |
                    |RanToCompletion |NL_CM_12_8_1 | 231122|00:00:43 |
                    

                    Best Regards,

                    guy038

                    1 Reply Last reply Reply Quote 2
                    • wonkawillyW
                      wonkawilly
                      last edited by wonkawilly

                      The following should work as well: (you need PythonScript Plugin in order to be able to run the script)

                      The data:

                      <ChildTaskStatus>
                        <LoadSubStatus>
                          <TaskStatus>TaskStatus</TaskStatus>
                          <Template>Template</Template>
                          <ItemsTotal>ItemsTotal</ItemsTotal>
                          <ElapsedTime>ElapsedTime</ElapsedTime>
                        </LoadSubStatus>
                        <LoadSubStatus>
                          <TaskStatus>RanToCompletion</TaskStatus>
                          <Template>NL_CM_0_1_1</Template>
                          <ItemsTotal>41219</ItemsTotal>
                          <ElapsedTime>00:00:04</ElapsedTime>
                        </LoadSubStatus>
                        <LoadSubStatus>
                          <TaskStatus>RanToCompletion</TaskStatus>
                          <Template>NL_CM_18_7_1</Template>
                          <ItemsTotal>168930</ItemsTotal>
                          <ElapsedTime>00:00:53</ElapsedTime>
                        </LoadSubStatus>
                        <LoadSubStatus>
                          <TaskStatus>RanToCompletion</TaskStatus>
                          <Template>NL_CM_12_8_1</Template>
                          <ItemsTotal>231122</ItemsTotal>
                          <ElapsedTime>00:00:43</ElapsedTime>
                        </LoadSubStatus>
                      </ChildTaskStatus>
                      

                      The Python script:

                      search1 = "<ChildTaskStatus>\r\n"
                      search2 = "</ChildTaskStatus>"
                      search3 = "  <LoadSubStatus>\r\n"
                      search4 = "  </LoadSubStatus>\r\n"
                      search5 = "    <TaskStatus>"
                      search6 = "</TaskStatus>\r\n"
                      search7 = "    <Template>"
                      search8 = "</Template>\r\n"
                      search9 = "    <ItemsTotal>"
                      search10 = "</ItemsTotal>\r\n"
                      search11 = "    <ElapsedTime>"
                      search12 = "</ElapsedTime>\r\n"
                      
                      replace1 = ""
                      replace2 = ""
                      replace3 = ""
                      replace4 = ""
                      replace5 = ""
                      replace6 = "\t"
                      replace7 = ""
                      replace8 = "\t"
                      replace9 = ""
                      replace10 = "\t"
                      replace11 = ""
                      replace12 = "\r\n"
                      
                      
                      
                      srItems = \
                      	[
                      		[search1, replace1],
                      		[search2, replace2],
                      		[search3, replace3],
                      		[search4, replace4],
                      		[search5, replace5],
                      		[search6, replace6],
                      		[search7, replace7],
                      		[search8, replace8],
                      		[search9, replace9],
                      		[search10, replace10],
                      		[search11, replace11],
                      		[search12, replace12],
                      	]
                      
                      
                      def replaceAll(sri):
                      	editor.rereplace(sri[0], sri[1])
                      
                      
                      for sri in srItems:
                      	replaceAll(sri)
                      

                      Situation before running the script:

                      7c57bf22-779c-4b43-ae16-97966e79f5db-image.png

                      Result after running the script:

                      a4f4e0d9-b5f3-41e1-ba0c-9abf15002b90-image.png

                      The script of course can be further simplified and also you can replace the \t with comma or semicolon or any other symbol you like/need to achieve your purpose.

                      1 Reply Last reply Reply Quote 0
                      • Mark OlsonM Mark Olson referenced this topic on
                      • Kristijan BertaK
                        Kristijan Berta
                        last edited by

                        For converting XML to CSV, I have written a blog post that outlines comprehensive methods, including flattening simple XML and normalizing complex, nested XML structures. It discusses multiple approaches, such as using Flexter for complex cases. Additionally, if you’re dealing with sensitive data, the Paranoid tool can help anonymize data before conversion. This makes it a practical solution for handling XML that contains private or confidential information while still achieving the necessary format transformation.

                        If you want to find out more contact me or send me a direct message

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors