Convert XML data to tabular csv data or tab separated without a script

Bas de Reuver

@datatraveller1 it’s not an option in this case due to the XML containing privacy data, but I’m curious which online tool you’re referring to?

Anyway I also asked the question on superuser.com and someone suggested simply opening the file in Excel which handles random XML files pretty well. It just didn’t occur to me to open an XML file in Excel because quite often it can’t even handle a simple csv file. 😆

datatraveller1

@Bas-de-Reuver said in Convert XML data to tabular csv data or tab separated without a script:

@datatraveller1 it’s not an option in this case due to the XML containing privacy data, but I’m curious which online tool you’re referring to?

Anyway I also asked the question on superuser.com and someone suggested simply opening the file in Excel which handles random XML files pretty well. It just didn’t occur to me to open an XML file in Excel because quite often it can’t even handle a simple csv file. 😆

I use this tool: https://conversiontools.io/convert/xml-to-csv, which seems to be a good solution if there is no private data, but the Excel solution also looks promising.

Coises

The example you show has the same four column-elements in the same order in every row-element. Is that true of the real data?

wonkawilly

If the schema is constant I can suggest to use find and replace multiple times and create a macro saving the whole procedure in the meanwhile so you have it for future uses

Mark Olson

PythonScript makes working with XML data quite easy, because Python’s standard library has great XML support.

Here’s a pretty generic script. I’ve only tested it on the sample data you provided.

# based on https://community.notepad-plus-plus.org/topic/24563/convert-xml-data-to-tabular-csv-data-or-tab-separated-without-a-script
# References:
#   * https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
#   * https://docs.python.org/3/library/csv.html
#   * https://docs.python.org/3/library/io.html#io.StringIO
#   * https://npppythonscript.sourceforge.net/docs/latest/intro.html#notifications

import csv
from io import StringIO
from xml.etree import ElementTree as ET

from Npp import *


def xml_to_json(text):
    root = ET.fromstring(text)
    return [{e.tag: e.text for e in child} for child in root]
                

def xml_to_csv(text):
    js = xml_to_json(text)
    if not js:
        return ''
    fieldnames = list(js[0])
    strio = StringIO()
    writer = csv.DictWriter(strio, fieldnames=fieldnames)
    writer.writeheader()
    for row in js:
        writer.writerow(row)
    csv_text = strio.getvalue()
    return csv_text
    

if __name__ == '__main__':
    text = editor.getText()
    csv_text = xml_to_csv(text)
    notepad.new()
    editor.setText(csv_text)

When run on

<ChildTaskStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_0_1_1</Template>
    <ItemsTotal>41219</ItemsTotal>
    <ElapsedTime>00:00:04</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_18_7_1</Template>
    <ItemsTotal>168930</ItemsTotal>
    <ElapsedTime>00:00:53</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_12_8_1</Template>
    <ItemsTotal>231122</ItemsTotal>
    <ElapsedTime>00:00:43</ElapsedTime>
  </LoadSubStatus>
</ChildTaskStatus>

I got

TaskStatus,Template,ItemsTotal,ElapsedTime
RanToCompletion,NL_CM_0_1_1,41219,00:00:04
RanToCompletion,NL_CM_18_7_1,168930,00:00:53
RanToCompletion,NL_CM_12_8_1,231122,00:00:43

Bas de Reuver

@Coises Technically it could be in a different order, but it’s practically always the same order. However, it does sometimes have missing tags, like the ElapsedTime can be missing for a certain error status.

Mark Olson

Given that some tags are sometimes missing, a more robust solution than my original script would be this:

# based on https://community.notepad-plus-plus.org/topic/24563/convert-xml-data-to-tabular-csv-data-or-tab-separated-without-a-script
# References:
#   * https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
#   * https://docs.python.org/3/library/csv.html
#   * https://docs.python.org/3/library/io.html#io.StringIO
#   * https://npppythonscript.sourceforge.net/docs/latest/intro.html#notifications

import csv
from io import StringIO
from xml.etree import ElementTree as ET

from Npp import *


def xml_to_json(text):
    root = ET.fromstring(text)
    fieldnames = sorted(set(e.tag for child in root for e in child),
        key=lambda x: x.lower())
    js = [{e.tag: e.text for e in child} for child in root]
    return fieldnames, js
                

def xml_to_csv(text):
    fieldnames, js = xml_to_json(text)
    if not js:
        return ''
    strio = StringIO()
    writer = csv.DictWriter(strio, fieldnames=fieldnames)
    writer.writeheader()
    for row in js:
        writer.writerow(row)
    csv_text = strio.getvalue()
    return csv_text
    

if __name__ == '__main__':
    text = editor.getText()
    csv_text = xml_to_csv(text)
    notepad.new()
    editor.setText(csv_text)

This version finds all the distinct child tag names so that even if some children are missing tags, the final table is guaranteed to have a column for each child tag.

Test data:

<ChildTaskStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_0_1_1</Template>
    <ItemsTotal>41219</ItemsTotal>
    <ElapsedTime>00:00:04</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_18_7_1</Template>
    <ItemsTotal>168930</ItemsTotal>
    <ElapsedTime>00:00:53</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_12_8_1</Template>
    <ItemsTotal>231122</ItemsTotal>
    <ElapsedTime>00:00:43</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <bar>eorneoren</bar>
    <Template>NL_CM_12_8_1</Template>
    <ItemsTotal>231122</ItemsTotal>
    <ElapsedTime>00:00:43</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_12_8_1</Template>
    <ItemsTotal>231122</ItemsTotal>
    <baz>roenroen</baz>
    <ElapsedTime>00:00:43</ElapsedTime>
  </LoadSubStatus>
</ChildTaskStatus>

Result:

bar,baz,ElapsedTime,ItemsTotal,TaskStatus,Template
,,00:00:04,41219,RanToCompletion,NL_CM_0_1_1
,,00:00:53,168930,RanToCompletion,NL_CM_18_7_1
,,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
eorneoren,,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
,roenroen,00:00:43,231122,RanToCompletion,NL_CM_12_8_1

Coises

@Bas-de-Reuver said in Convert XML data to tabular csv data or tab separated without a script:

@Coises Technically it could be in a different order, but it’s practically always the same order. However, it does sometimes have missing tags, like the ElapsedTime can be missing for a certain error status.

OK. The reason I asked was that if they were always the same, it would be feasible to transform the file with regular expressions. With order changing and/or tags missing (no matter how rarely), scripting solutions become far more attractive.

dinkumoil

@Bas-de-Reuver

Converting your XML file to an CSV file could also be done with the help of the XML Tools plugin and its XML transformation feature.

Install XML Tools plugin via built-in PluginsAdmin.
Save the following code as TransformToCSV.xsl with character encoding UTF-8.

<?xml version="1.0" encoding="UTF-8"?>

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output
    method="text"
    omit-xml-declaration="yes"
    indent="no"
    encoding="utf-8"
  />

  <xsl:strip-space
    elements="*"
  />
  
  <xsl:template match="/ChildTaskStatus/LoadSubStatus">
    <xsl:if test="position()=1">
      <xsl:text>TaskStatus;Template;ItemsTotal;ElapsedTime&#xD;&#xA;</xsl:text>
    </xsl:if>
    <xsl:value-of select="concat(TaskStatus, ';', Template, ';', ItemsTotal, ';', ElapsedTime, '&#xD;', '&#xA;')"/>
  </xsl:template>

</xsl:transform>

Load your XML file with Notepad++ and ensure it’s the active file.
Navigate to (menu) Plugins -> XML Tools -> XSL Transformation
In the dialog box popping up, use the ... button to select the XSL file you saved in step 2.
Click Transform button.

A new tab will be opened that contains the transformed file.

Please note: If the character encoding of your source XML file is not UTF-8, you have to set the correct encoding in line 9 of the XSL file.

guy038

Hello, @bas-de-reuver, @datatraveller1, @mark-olson, @coises, @wonkawilly, @dinkumoil and *All,

Just for info, if you always get the same number of columns per line, here is a regex solution :

So, from your INPUT text :

<ChildTaskStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_0_1_1</Template>
    <ItemsTotal>41219</ItemsTotal>
    <ElapsedTime>00:00:04</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_18_7_1</Template>
    <ItemsTotal>168930</ItemsTotal>
    <ElapsedTime>00:00:53</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_12_8_1</Template>
    <ItemsTotal>231122</ItemsTotal>
    <ElapsedTime>00:00:43</ElapsedTime>
  </LoadSubStatus>
</ChildTaskStatus>

This first regex S/R only keeps the key items, on consecutive lines

SEARCH (?-s)^\h*(?:</?ChildTaskStatus>|</?LoadSubStatus>)\R|\h*</?.+?>

REPLACE Leave EMPTY

Would give this TEMPORARY text :

RanToCompletion
NL_CM_0_1_1
41219
00:00:04
RanToCompletion
NL_CM_18_7_1
168930
00:00:53
RanToCompletion
NL_CM_12_8_1
231122
00:00:43

Then, this second regex S/R put the items in a table :

SEARCH (?-s)(.+:.+)(\R)|(.+)\R

REPLACE (?1|\1 |\2)(?3|\3\x20)

Would give the TEMPORARY text :

|RanToCompletion |NL_CM_0_1_1 |41219 |00:00:04 |
|RanToCompletion |NL_CM_18_7_1 |168930 |00:00:53 |
|RanToCompletion |NL_CM_12_8_1 |231122 |00:00:43 |

Finally, this third regex S/R justify the ItemsTotal values on the right

SEARCH (?-s)^(\|.+\|)(\d+)(\x20)+

REPLACE \1\3\2

And you get your expected OUTPUT text :

|RanToCompletion |NL_CM_0_1_1 | 41219|00:00:04 |
|RanToCompletion |NL_CM_18_7_1 | 168930|00:00:53 |
|RanToCompletion |NL_CM_12_8_1 | 231122|00:00:43 |

Best Regards,

guy038

wonkawilly

The following should work as well: (you need PythonScript Plugin in order to be able to run the script)

The data:

<ChildTaskStatus>
  <LoadSubStatus>
    <TaskStatus>TaskStatus</TaskStatus>
    <Template>Template</Template>
    <ItemsTotal>ItemsTotal</ItemsTotal>
    <ElapsedTime>ElapsedTime</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_0_1_1</Template>
    <ItemsTotal>41219</ItemsTotal>
    <ElapsedTime>00:00:04</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_18_7_1</Template>
    <ItemsTotal>168930</ItemsTotal>
    <ElapsedTime>00:00:53</ElapsedTime>
  </LoadSubStatus>
  <LoadSubStatus>
    <TaskStatus>RanToCompletion</TaskStatus>
    <Template>NL_CM_12_8_1</Template>
    <ItemsTotal>231122</ItemsTotal>
    <ElapsedTime>00:00:43</ElapsedTime>
  </LoadSubStatus>
</ChildTaskStatus>

The Python script:

search1 = "<ChildTaskStatus>\r\n"
search2 = "</ChildTaskStatus>"
search3 = "  <LoadSubStatus>\r\n"
search4 = "  </LoadSubStatus>\r\n"
search5 = "    <TaskStatus>"
search6 = "</TaskStatus>\r\n"
search7 = "    <Template>"
search8 = "</Template>\r\n"
search9 = "    <ItemsTotal>"
search10 = "</ItemsTotal>\r\n"
search11 = "    <ElapsedTime>"
search12 = "</ElapsedTime>\r\n"

replace1 = ""
replace2 = ""
replace3 = ""
replace4 = ""
replace5 = ""
replace6 = "\t"
replace7 = ""
replace8 = "\t"
replace9 = ""
replace10 = "\t"
replace11 = ""
replace12 = "\r\n"



srItems = \
	[
		[search1, replace1],
		[search2, replace2],
		[search3, replace3],
		[search4, replace4],
		[search5, replace5],
		[search6, replace6],
		[search7, replace7],
		[search8, replace8],
		[search9, replace9],
		[search10, replace10],
		[search11, replace11],
		[search12, replace12],
	]


def replaceAll(sri):
	editor.rereplace(sri[0], sri[1])


for sri in srItems:
	replaceAll(sri)

Situation before running the script:

Result after running the script:

The script of course can be further simplified and also you can replace the \t with comma or semicolon or any other symbol you like/need to achieve your purpose.

Kristijan Berta

For converting XML to CSV, I have written a blog post that outlines comprehensive methods, including flattening simple XML and normalizing complex, nested XML structures. It discusses multiple approaches, such as using Flexter for complex cases. Additionally, if you’re dealing with sensitive data, the Paranoid tool can help anonymize data before conversion. This makes it a practical solution for handling XML that contains private or confidential information while still achieving the necessary format transformation.

If you want to find out more contact me or send me a direct message