Community
    • Login

    Convert XML data to tabular csv data or tab separated without a script

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    14 Posts 8 Posters 15.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Bas de ReuverB
      Bas de Reuver
      last edited by

      Is there an easy way to convert (part of) XML data back to tabular data, so comma separated values or tab separated, using Notepad++? I sometimes get XML log files, and for readability and easy compare it would be a lot easier to extract the data just as a flat table, for example in Excel.

      I’ve looked at the XMLtools plug-in and other plug-ins but couldn’t find any option. So for example the data is like this:

      <ChildTaskStatus>
        <LoadSubStatus>
          <TaskStatus>RanToCompletion</TaskStatus>
          <Template>NL_CM_0_1_1</Template>
          <ItemsTotal>41219</ItemsTotal>
          <ElapsedTime>00:00:04</ElapsedTime>
        </LoadSubStatus>
        <LoadSubStatus>
          <TaskStatus>RanToCompletion</TaskStatus>
          <Template>NL_CM_18_7_1</Template>
          <ItemsTotal>168930</ItemsTotal>
          <ElapsedTime>00:00:53</ElapsedTime>
        </LoadSubStatus>
        <LoadSubStatus>
          <TaskStatus>RanToCompletion</TaskStatus>
          <Template>NL_CM_12_8_1</Template>
          <ItemsTotal>231122</ItemsTotal>
          <ElapsedTime>00:00:43</ElapsedTime>
        </LoadSubStatus>
      </ChildTaskStatus>
      

      And I would like to extract the data to just this:

      |TaskStatus      |Template     |ItemsTotal |ElapsedTime |
      +----------------+-------------+-----------+------------+
      |RanToCompletion |NL_CM_0_1_1  |      41219|00:00:04    |
      |RanToCompletion |NL_CM_18_7_1 |     168930|00:00:53    |
      |RanToCompletion |NL_CM_12_8_1 |     231122|00:00:43    |
      

      Is there any way to do this easily without having to write a script for each different instance? I understand I would have to at least specify the XML tags for the four columns, but I mean is there a option or plugin to do this?

      datatraveller1D dinkumoilD 2 Replies Last reply Reply Quote 2
      • datatraveller1D
        datatraveller1 @Bas de Reuver
        last edited by

        @Bas-de-Reuver I use an online tool for converting XML to CSV but I assume this is not an option.

        Bas de ReuverB 1 Reply Last reply Reply Quote 0
        • Bas de ReuverB
          Bas de Reuver @datatraveller1
          last edited by

          @datatraveller1 it’s not an option in this case due to the XML containing privacy data, but I’m curious which online tool you’re referring to?

          Anyway I also asked the question on superuser.com and someone suggested simply opening the file in Excel which handles random XML files pretty well. It just didn’t occur to me to open an XML file in Excel because quite often it can’t even handle a simple csv file. 😆

          1 Reply Last reply Reply Quote 2
          • datatraveller1D
            datatraveller1
            last edited by

            @Bas-de-Reuver said in Convert XML data to tabular csv data or tab separated without a script:

            @datatraveller1 it’s not an option in this case due to the XML containing privacy data, but I’m curious which online tool you’re referring to?

            Anyway I also asked the question on superuser.com and someone suggested simply opening the file in Excel which handles random XML files pretty well. It just didn’t occur to me to open an XML file in Excel because quite often it can’t even handle a simple csv file. 😆

            I use this tool: https://conversiontools.io/convert/xml-to-csv, which seems to be a good solution if there is no private data, but the Excel solution also looks promising.

            1 Reply Last reply Reply Quote 1
            • CoisesC
              Coises
              last edited by

              The example you show has the same four column-elements in the same order in every row-element. Is that true of the real data?

              Bas de ReuverB 1 Reply Last reply Reply Quote 1
              • wonkawillyW
                wonkawilly
                last edited by

                If the schema is constant I can suggest to use find and replace multiple times and create a macro saving the whole procedure in the meanwhile so you have it for future uses

                1 Reply Last reply Reply Quote 1
                • Mark OlsonM
                  Mark Olson
                  last edited by

                  PythonScript makes working with XML data quite easy, because Python’s standard library has great XML support.

                  Here’s a pretty generic script. I’ve only tested it on the sample data you provided.

                  # based on https://community.notepad-plus-plus.org/topic/24563/convert-xml-data-to-tabular-csv-data-or-tab-separated-without-a-script
                  # References:
                  #   * https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
                  #   * https://docs.python.org/3/library/csv.html
                  #   * https://docs.python.org/3/library/io.html#io.StringIO
                  #   * https://npppythonscript.sourceforge.net/docs/latest/intro.html#notifications
                  
                  import csv
                  from io import StringIO
                  from xml.etree import ElementTree as ET
                  
                  from Npp import *
                  
                  
                  def xml_to_json(text):
                      root = ET.fromstring(text)
                      return [{e.tag: e.text for e in child} for child in root]
                                  
                  
                  def xml_to_csv(text):
                      js = xml_to_json(text)
                      if not js:
                          return ''
                      fieldnames = list(js[0])
                      strio = StringIO()
                      writer = csv.DictWriter(strio, fieldnames=fieldnames)
                      writer.writeheader()
                      for row in js:
                          writer.writerow(row)
                      csv_text = strio.getvalue()
                      return csv_text
                      
                  
                  if __name__ == '__main__':
                      text = editor.getText()
                      csv_text = xml_to_csv(text)
                      notepad.new()
                      editor.setText(csv_text)
                  

                  When run on

                  <ChildTaskStatus>
                    <LoadSubStatus>
                      <TaskStatus>RanToCompletion</TaskStatus>
                      <Template>NL_CM_0_1_1</Template>
                      <ItemsTotal>41219</ItemsTotal>
                      <ElapsedTime>00:00:04</ElapsedTime>
                    </LoadSubStatus>
                    <LoadSubStatus>
                      <TaskStatus>RanToCompletion</TaskStatus>
                      <Template>NL_CM_18_7_1</Template>
                      <ItemsTotal>168930</ItemsTotal>
                      <ElapsedTime>00:00:53</ElapsedTime>
                    </LoadSubStatus>
                    <LoadSubStatus>
                      <TaskStatus>RanToCompletion</TaskStatus>
                      <Template>NL_CM_12_8_1</Template>
                      <ItemsTotal>231122</ItemsTotal>
                      <ElapsedTime>00:00:43</ElapsedTime>
                    </LoadSubStatus>
                  </ChildTaskStatus>
                  

                  I got

                  TaskStatus,Template,ItemsTotal,ElapsedTime
                  RanToCompletion,NL_CM_0_1_1,41219,00:00:04
                  RanToCompletion,NL_CM_18_7_1,168930,00:00:53
                  RanToCompletion,NL_CM_12_8_1,231122,00:00:43
                  
                  1 Reply Last reply Reply Quote 5
                  • Mark OlsonM Mark Olson referenced this topic on
                  • Bas de ReuverB
                    Bas de Reuver @Coises
                    last edited by

                    @Coises Technically it could be in a different order, but it’s practically always the same order. However, it does sometimes have missing tags, like the ElapsedTime can be missing for a certain error status.

                    CoisesC 1 Reply Last reply Reply Quote 0
                    • Mark OlsonM
                      Mark Olson
                      last edited by

                      Given that some tags are sometimes missing, a more robust solution than my original script would be this:

                      # based on https://community.notepad-plus-plus.org/topic/24563/convert-xml-data-to-tabular-csv-data-or-tab-separated-without-a-script
                      # References:
                      #   * https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
                      #   * https://docs.python.org/3/library/csv.html
                      #   * https://docs.python.org/3/library/io.html#io.StringIO
                      #   * https://npppythonscript.sourceforge.net/docs/latest/intro.html#notifications
                      
                      import csv
                      from io import StringIO
                      from xml.etree import ElementTree as ET
                      
                      from Npp import *
                      
                      
                      def xml_to_json(text):
                          root = ET.fromstring(text)
                          fieldnames = sorted(set(e.tag for child in root for e in child),
                              key=lambda x: x.lower())
                          js = [{e.tag: e.text for e in child} for child in root]
                          return fieldnames, js
                                      
                      
                      def xml_to_csv(text):
                          fieldnames, js = xml_to_json(text)
                          if not js:
                              return ''
                          strio = StringIO()
                          writer = csv.DictWriter(strio, fieldnames=fieldnames)
                          writer.writeheader()
                          for row in js:
                              writer.writerow(row)
                          csv_text = strio.getvalue()
                          return csv_text
                          
                      
                      if __name__ == '__main__':
                          text = editor.getText()
                          csv_text = xml_to_csv(text)
                          notepad.new()
                          editor.setText(csv_text)
                      

                      This version finds all the distinct child tag names so that even if some children are missing tags, the final table is guaranteed to have a column for each child tag.

                      Test data:

                      <ChildTaskStatus>
                        <LoadSubStatus>
                          <TaskStatus>RanToCompletion</TaskStatus>
                          <Template>NL_CM_0_1_1</Template>
                          <ItemsTotal>41219</ItemsTotal>
                          <ElapsedTime>00:00:04</ElapsedTime>
                        </LoadSubStatus>
                        <LoadSubStatus>
                          <TaskStatus>RanToCompletion</TaskStatus>
                          <Template>NL_CM_18_7_1</Template>
                          <ItemsTotal>168930</ItemsTotal>
                          <ElapsedTime>00:00:53</ElapsedTime>
                        </LoadSubStatus>
                        <LoadSubStatus>
                          <TaskStatus>RanToCompletion</TaskStatus>
                          <Template>NL_CM_12_8_1</Template>
                          <ItemsTotal>231122</ItemsTotal>
                          <ElapsedTime>00:00:43</ElapsedTime>
                        </LoadSubStatus>
                        <LoadSubStatus>
                          <TaskStatus>RanToCompletion</TaskStatus>
                          <bar>eorneoren</bar>
                          <Template>NL_CM_12_8_1</Template>
                          <ItemsTotal>231122</ItemsTotal>
                          <ElapsedTime>00:00:43</ElapsedTime>
                        </LoadSubStatus>
                        <LoadSubStatus>
                          <TaskStatus>RanToCompletion</TaskStatus>
                          <Template>NL_CM_12_8_1</Template>
                          <ItemsTotal>231122</ItemsTotal>
                          <baz>roenroen</baz>
                          <ElapsedTime>00:00:43</ElapsedTime>
                        </LoadSubStatus>
                      </ChildTaskStatus>
                      

                      Result:

                      bar,baz,ElapsedTime,ItemsTotal,TaskStatus,Template
                      ,,00:00:04,41219,RanToCompletion,NL_CM_0_1_1
                      ,,00:00:53,168930,RanToCompletion,NL_CM_18_7_1
                      ,,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
                      eorneoren,,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
                      ,roenroen,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
                      
                      1 Reply Last reply Reply Quote 2
                      • Mark OlsonM Mark Olson referenced this topic on
                      • CoisesC
                        Coises @Bas de Reuver
                        last edited by

                        @Bas-de-Reuver said in Convert XML data to tabular csv data or tab separated without a script:

                        @Coises Technically it could be in a different order, but it’s practically always the same order. However, it does sometimes have missing tags, like the ElapsedTime can be missing for a certain error status.

                        OK. The reason I asked was that if they were always the same, it would be feasible to transform the file with regular expressions. With order changing and/or tags missing (no matter how rarely), scripting solutions become far more attractive.

                        1 Reply Last reply Reply Quote 1
                        • dinkumoilD
                          dinkumoil @Bas de Reuver
                          last edited by

                          @Bas-de-Reuver

                          Converting your XML file to an CSV file could also be done with the help of the XML Tools plugin and its XML transformation feature.

                          1. Install XML Tools plugin via built-in PluginsAdmin.
                          2. Save the following code as TransformToCSV.xsl with character encoding UTF-8.
                          <?xml version="1.0" encoding="UTF-8"?>
                          
                          <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
                          
                            <xsl:output
                              method="text"
                              omit-xml-declaration="yes"
                              indent="no"
                              encoding="utf-8"
                            />
                          
                            <xsl:strip-space
                              elements="*"
                            />
                            
                            <xsl:template match="/ChildTaskStatus/LoadSubStatus">
                              <xsl:if test="position()=1">
                                <xsl:text>TaskStatus;Template;ItemsTotal;ElapsedTime&#xD;&#xA;</xsl:text>
                              </xsl:if>
                              <xsl:value-of select="concat(TaskStatus, ';', Template, ';', ItemsTotal, ';', ElapsedTime, '&#xD;', '&#xA;')"/>
                            </xsl:template>
                          
                          </xsl:transform>
                          
                          1. Load your XML file with Notepad++ and ensure it’s the active file.
                          2. Navigate to (menu) Plugins -> XML Tools -> XSL Transformation
                          3. In the dialog box popping up, use the ... button to select the XSL file you saved in step 2.
                          4. Click Transform button.

                          A new tab will be opened that contains the transformed file.

                          Please note: If the character encoding of your source XML file is not UTF-8, you have to set the correct encoding in line 9 of the XSL file.

                          1 Reply Last reply Reply Quote 3
                          • guy038G
                            guy038
                            last edited by

                            Hello, @bas-de-reuver, @datatraveller1, @mark-olson, @coises, @wonkawilly, @dinkumoil and *All,

                            Just for info, if you always get the same number of columns per line, here is a regex solution :


                            So, from your INPUT text :

                            <ChildTaskStatus>
                              <LoadSubStatus>
                                <TaskStatus>RanToCompletion</TaskStatus>
                                <Template>NL_CM_0_1_1</Template>
                                <ItemsTotal>41219</ItemsTotal>
                                <ElapsedTime>00:00:04</ElapsedTime>
                              </LoadSubStatus>
                              <LoadSubStatus>
                                <TaskStatus>RanToCompletion</TaskStatus>
                                <Template>NL_CM_18_7_1</Template>
                                <ItemsTotal>168930</ItemsTotal>
                                <ElapsedTime>00:00:53</ElapsedTime>
                              </LoadSubStatus>
                              <LoadSubStatus>
                                <TaskStatus>RanToCompletion</TaskStatus>
                                <Template>NL_CM_12_8_1</Template>
                                <ItemsTotal>231122</ItemsTotal>
                                <ElapsedTime>00:00:43</ElapsedTime>
                              </LoadSubStatus>
                            </ChildTaskStatus>
                            

                            This first regex S/R only keeps the key items, on consecutive lines

                            SEARCH (?-s)^\h*(?:</?ChildTaskStatus>|</?LoadSubStatus>)\R|\h*</?.+?>

                            REPLACE Leave EMPTY

                            Would give this TEMPORARY text :

                            RanToCompletion
                            NL_CM_0_1_1
                            41219
                            00:00:04
                            RanToCompletion
                            NL_CM_18_7_1
                            168930
                            00:00:53
                            RanToCompletion
                            NL_CM_12_8_1
                            231122
                            00:00:43
                            

                            Then, this second regex S/R put the items in a table :

                            SEARCH (?-s)(.+:.+)(\R)|(.+)\R

                            REPLACE (?1|\1 |\2)(?3|\3\x20)

                            Would give the TEMPORARY text :

                            |RanToCompletion |NL_CM_0_1_1 |41219 |00:00:04 |
                            |RanToCompletion |NL_CM_18_7_1 |168930 |00:00:53 |
                            |RanToCompletion |NL_CM_12_8_1 |231122 |00:00:43 |
                            

                            Finally, this third regex S/R justify the ItemsTotal values on the right

                            SEARCH (?-s)^(\|.+\|)(\d+)(\x20)+

                            REPLACE \1\3\2

                            And you get your expected OUTPUT text :

                            |RanToCompletion |NL_CM_0_1_1 | 41219|00:00:04 |
                            |RanToCompletion |NL_CM_18_7_1 | 168930|00:00:53 |
                            |RanToCompletion |NL_CM_12_8_1 | 231122|00:00:43 |
                            

                            Best Regards,

                            guy038

                            1 Reply Last reply Reply Quote 2
                            • wonkawillyW
                              wonkawilly
                              last edited by wonkawilly

                              The following should work as well: (you need PythonScript Plugin in order to be able to run the script)

                              The data:

                              <ChildTaskStatus>
                                <LoadSubStatus>
                                  <TaskStatus>TaskStatus</TaskStatus>
                                  <Template>Template</Template>
                                  <ItemsTotal>ItemsTotal</ItemsTotal>
                                  <ElapsedTime>ElapsedTime</ElapsedTime>
                                </LoadSubStatus>
                                <LoadSubStatus>
                                  <TaskStatus>RanToCompletion</TaskStatus>
                                  <Template>NL_CM_0_1_1</Template>
                                  <ItemsTotal>41219</ItemsTotal>
                                  <ElapsedTime>00:00:04</ElapsedTime>
                                </LoadSubStatus>
                                <LoadSubStatus>
                                  <TaskStatus>RanToCompletion</TaskStatus>
                                  <Template>NL_CM_18_7_1</Template>
                                  <ItemsTotal>168930</ItemsTotal>
                                  <ElapsedTime>00:00:53</ElapsedTime>
                                </LoadSubStatus>
                                <LoadSubStatus>
                                  <TaskStatus>RanToCompletion</TaskStatus>
                                  <Template>NL_CM_12_8_1</Template>
                                  <ItemsTotal>231122</ItemsTotal>
                                  <ElapsedTime>00:00:43</ElapsedTime>
                                </LoadSubStatus>
                              </ChildTaskStatus>
                              

                              The Python script:

                              search1 = "<ChildTaskStatus>\r\n"
                              search2 = "</ChildTaskStatus>"
                              search3 = "  <LoadSubStatus>\r\n"
                              search4 = "  </LoadSubStatus>\r\n"
                              search5 = "    <TaskStatus>"
                              search6 = "</TaskStatus>\r\n"
                              search7 = "    <Template>"
                              search8 = "</Template>\r\n"
                              search9 = "    <ItemsTotal>"
                              search10 = "</ItemsTotal>\r\n"
                              search11 = "    <ElapsedTime>"
                              search12 = "</ElapsedTime>\r\n"
                              
                              replace1 = ""
                              replace2 = ""
                              replace3 = ""
                              replace4 = ""
                              replace5 = ""
                              replace6 = "\t"
                              replace7 = ""
                              replace8 = "\t"
                              replace9 = ""
                              replace10 = "\t"
                              replace11 = ""
                              replace12 = "\r\n"
                              
                              
                              
                              srItems = \
                              	[
                              		[search1, replace1],
                              		[search2, replace2],
                              		[search3, replace3],
                              		[search4, replace4],
                              		[search5, replace5],
                              		[search6, replace6],
                              		[search7, replace7],
                              		[search8, replace8],
                              		[search9, replace9],
                              		[search10, replace10],
                              		[search11, replace11],
                              		[search12, replace12],
                              	]
                              
                              
                              def replaceAll(sri):
                              	editor.rereplace(sri[0], sri[1])
                              
                              
                              for sri in srItems:
                              	replaceAll(sri)
                              

                              Situation before running the script:

                              7c57bf22-779c-4b43-ae16-97966e79f5db-image.png

                              Result after running the script:

                              a4f4e0d9-b5f3-41e1-ba0c-9abf15002b90-image.png

                              The script of course can be further simplified and also you can replace the \t with comma or semicolon or any other symbol you like/need to achieve your purpose.

                              1 Reply Last reply Reply Quote 0
                              • Mark OlsonM Mark Olson referenced this topic on
                              • Kristijan BertaK
                                Kristijan Berta
                                last edited by

                                For converting XML to CSV, I have written a blog post that outlines comprehensive methods, including flattening simple XML and normalizing complex, nested XML structures. It discusses multiple approaches, such as using Flexter for complex cases. Additionally, if you’re dealing with sensitive data, the Paranoid tool can help anonymize data before conversion. This makes it a practical solution for handling XML that contains private or confidential information while still achieving the necessary format transformation.

                                If you want to find out more contact me or send me a direct message

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors