Convert XML data to tabular csv data or tab separated without a script
-
@datatraveller1 it’s not an option in this case due to the XML containing privacy data, but I’m curious which online tool you’re referring to?
Anyway I also asked the question on superuser.com and someone suggested simply opening the file in Excel which handles random XML files pretty well. It just didn’t occur to me to open an XML file in Excel because quite often it can’t even handle a simple csv file. 😆
-
@Bas-de-Reuver said in Convert XML data to tabular csv data or tab separated without a script:
@datatraveller1 it’s not an option in this case due to the XML containing privacy data, but I’m curious which online tool you’re referring to?
Anyway I also asked the question on superuser.com and someone suggested simply opening the file in Excel which handles random XML files pretty well. It just didn’t occur to me to open an XML file in Excel because quite often it can’t even handle a simple csv file. 😆
I use this tool: https://conversiontools.io/convert/xml-to-csv, which seems to be a good solution if there is no private data, but the Excel solution also looks promising.
-
The example you show has the same four column-elements in the same order in every row-element. Is that true of the real data?
-
If the schema is constant I can suggest to use find and replace multiple times and create a macro saving the whole procedure in the meanwhile so you have it for future uses
-
PythonScript makes working with XML data quite easy, because Python’s standard library has great XML support.
Here’s a pretty generic script. I’ve only tested it on the sample data you provided.
# based on https://community.notepad-plus-plus.org/topic/24563/convert-xml-data-to-tabular-csv-data-or-tab-separated-without-a-script # References: # * https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree # * https://docs.python.org/3/library/csv.html # * https://docs.python.org/3/library/io.html#io.StringIO # * https://npppythonscript.sourceforge.net/docs/latest/intro.html#notifications import csv from io import StringIO from xml.etree import ElementTree as ET from Npp import * def xml_to_json(text): root = ET.fromstring(text) return [{e.tag: e.text for e in child} for child in root] def xml_to_csv(text): js = xml_to_json(text) if not js: return '' fieldnames = list(js[0]) strio = StringIO() writer = csv.DictWriter(strio, fieldnames=fieldnames) writer.writeheader() for row in js: writer.writerow(row) csv_text = strio.getvalue() return csv_text if __name__ == '__main__': text = editor.getText() csv_text = xml_to_csv(text) notepad.new() editor.setText(csv_text)
When run on
<ChildTaskStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_0_1_1</Template> <ItemsTotal>41219</ItemsTotal> <ElapsedTime>00:00:04</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_18_7_1</Template> <ItemsTotal>168930</ItemsTotal> <ElapsedTime>00:00:53</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_12_8_1</Template> <ItemsTotal>231122</ItemsTotal> <ElapsedTime>00:00:43</ElapsedTime> </LoadSubStatus> </ChildTaskStatus>
I got
TaskStatus,Template,ItemsTotal,ElapsedTime RanToCompletion,NL_CM_0_1_1,41219,00:00:04 RanToCompletion,NL_CM_18_7_1,168930,00:00:53 RanToCompletion,NL_CM_12_8_1,231122,00:00:43
-
-
@Coises Technically it could be in a different order, but it’s practically always the same order. However, it does sometimes have missing tags, like the ElapsedTime can be missing for a certain error status.
-
Given that some tags are sometimes missing, a more robust solution than my original script would be this:
# based on https://community.notepad-plus-plus.org/topic/24563/convert-xml-data-to-tabular-csv-data-or-tab-separated-without-a-script # References: # * https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree # * https://docs.python.org/3/library/csv.html # * https://docs.python.org/3/library/io.html#io.StringIO # * https://npppythonscript.sourceforge.net/docs/latest/intro.html#notifications import csv from io import StringIO from xml.etree import ElementTree as ET from Npp import * def xml_to_json(text): root = ET.fromstring(text) fieldnames = sorted(set(e.tag for child in root for e in child), key=lambda x: x.lower()) js = [{e.tag: e.text for e in child} for child in root] return fieldnames, js def xml_to_csv(text): fieldnames, js = xml_to_json(text) if not js: return '' strio = StringIO() writer = csv.DictWriter(strio, fieldnames=fieldnames) writer.writeheader() for row in js: writer.writerow(row) csv_text = strio.getvalue() return csv_text if __name__ == '__main__': text = editor.getText() csv_text = xml_to_csv(text) notepad.new() editor.setText(csv_text)
This version finds all the distinct child tag names so that even if some children are missing tags, the final table is guaranteed to have a column for each child tag.
Test data:
<ChildTaskStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_0_1_1</Template> <ItemsTotal>41219</ItemsTotal> <ElapsedTime>00:00:04</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_18_7_1</Template> <ItemsTotal>168930</ItemsTotal> <ElapsedTime>00:00:53</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_12_8_1</Template> <ItemsTotal>231122</ItemsTotal> <ElapsedTime>00:00:43</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <bar>eorneoren</bar> <Template>NL_CM_12_8_1</Template> <ItemsTotal>231122</ItemsTotal> <ElapsedTime>00:00:43</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_12_8_1</Template> <ItemsTotal>231122</ItemsTotal> <baz>roenroen</baz> <ElapsedTime>00:00:43</ElapsedTime> </LoadSubStatus> </ChildTaskStatus>
Result:
bar,baz,ElapsedTime,ItemsTotal,TaskStatus,Template ,,00:00:04,41219,RanToCompletion,NL_CM_0_1_1 ,,00:00:53,168930,RanToCompletion,NL_CM_18_7_1 ,,00:00:43,231122,RanToCompletion,NL_CM_12_8_1 eorneoren,,00:00:43,231122,RanToCompletion,NL_CM_12_8_1 ,roenroen,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
-
-
@Bas-de-Reuver said in Convert XML data to tabular csv data or tab separated without a script:
@Coises Technically it could be in a different order, but it’s practically always the same order. However, it does sometimes have missing tags, like the ElapsedTime can be missing for a certain error status.
OK. The reason I asked was that if they were always the same, it would be feasible to transform the file with regular expressions. With order changing and/or tags missing (no matter how rarely), scripting solutions become far more attractive.
-
Converting your XML file to an CSV file could also be done with the help of the XML Tools plugin and its XML transformation feature.
- Install XML Tools plugin via built-in PluginsAdmin.
- Save the following code as
TransformToCSV.xsl
with character encoding UTF-8.
<?xml version="1.0" encoding="UTF-8"?> <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text" omit-xml-declaration="yes" indent="no" encoding="utf-8" /> <xsl:strip-space elements="*" /> <xsl:template match="/ChildTaskStatus/LoadSubStatus"> <xsl:if test="position()=1"> <xsl:text>TaskStatus;Template;ItemsTotal;ElapsedTime
</xsl:text> </xsl:if> <xsl:value-of select="concat(TaskStatus, ';', Template, ';', ItemsTotal, ';', ElapsedTime, '
', '
')"/> </xsl:template> </xsl:transform>
- Load your XML file with Notepad++ and ensure it’s the active file.
- Navigate to
(menu) Plugins -> XML Tools -> XSL Transformation
- In the dialog box popping up, use the
...
button to select the XSL file you saved in step 2. - Click
Transform
button.
A new tab will be opened that contains the transformed file.
Please note: If the character encoding of your source XML file is not UTF-8, you have to set the correct encoding in line 9 of the XSL file.
-
Hello, @bas-de-reuver, @datatraveller1, @mark-olson, @coises, @wonkawilly, @dinkumoil and *All,
Just for info, if you always get the same number of columns per line, here is a regex solution :
So, from your INPUT text :
<ChildTaskStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_0_1_1</Template> <ItemsTotal>41219</ItemsTotal> <ElapsedTime>00:00:04</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_18_7_1</Template> <ItemsTotal>168930</ItemsTotal> <ElapsedTime>00:00:53</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_12_8_1</Template> <ItemsTotal>231122</ItemsTotal> <ElapsedTime>00:00:43</ElapsedTime> </LoadSubStatus> </ChildTaskStatus>
This first regex S/R only keeps the key items, on consecutive lines
SEARCH
(?-s)^\h*(?:</?ChildTaskStatus>|</?LoadSubStatus>)\R|\h*</?.+?>
REPLACE
Leave EMPTY
Would give this TEMPORARY text :
RanToCompletion NL_CM_0_1_1 41219 00:00:04 RanToCompletion NL_CM_18_7_1 168930 00:00:53 RanToCompletion NL_CM_12_8_1 231122 00:00:43
Then, this second regex S/R put the items in a table :
SEARCH
(?-s)(.+:.+)(\R)|(.+)\R
REPLACE
(?1|\1 |\2)(?3|\3\x20)
Would give the TEMPORARY text :
|RanToCompletion |NL_CM_0_1_1 |41219 |00:00:04 | |RanToCompletion |NL_CM_18_7_1 |168930 |00:00:53 | |RanToCompletion |NL_CM_12_8_1 |231122 |00:00:43 |
Finally, this third regex S/R justify the
ItemsTotal
values on the rightSEARCH
(?-s)^(\|.+\|)(\d+)(\x20)+
REPLACE
\1\3\2
And you get your expected OUTPUT text :
|RanToCompletion |NL_CM_0_1_1 | 41219|00:00:04 | |RanToCompletion |NL_CM_18_7_1 | 168930|00:00:53 | |RanToCompletion |NL_CM_12_8_1 | 231122|00:00:43 |
Best Regards,
guy038
-
The following should work as well: (you need PythonScript Plugin in order to be able to run the script)
The data:
<ChildTaskStatus> <LoadSubStatus> <TaskStatus>TaskStatus</TaskStatus> <Template>Template</Template> <ItemsTotal>ItemsTotal</ItemsTotal> <ElapsedTime>ElapsedTime</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_0_1_1</Template> <ItemsTotal>41219</ItemsTotal> <ElapsedTime>00:00:04</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_18_7_1</Template> <ItemsTotal>168930</ItemsTotal> <ElapsedTime>00:00:53</ElapsedTime> </LoadSubStatus> <LoadSubStatus> <TaskStatus>RanToCompletion</TaskStatus> <Template>NL_CM_12_8_1</Template> <ItemsTotal>231122</ItemsTotal> <ElapsedTime>00:00:43</ElapsedTime> </LoadSubStatus> </ChildTaskStatus>
The Python script:
search1 = "<ChildTaskStatus>\r\n" search2 = "</ChildTaskStatus>" search3 = " <LoadSubStatus>\r\n" search4 = " </LoadSubStatus>\r\n" search5 = " <TaskStatus>" search6 = "</TaskStatus>\r\n" search7 = " <Template>" search8 = "</Template>\r\n" search9 = " <ItemsTotal>" search10 = "</ItemsTotal>\r\n" search11 = " <ElapsedTime>" search12 = "</ElapsedTime>\r\n" replace1 = "" replace2 = "" replace3 = "" replace4 = "" replace5 = "" replace6 = "\t" replace7 = "" replace8 = "\t" replace9 = "" replace10 = "\t" replace11 = "" replace12 = "\r\n" srItems = \ [ [search1, replace1], [search2, replace2], [search3, replace3], [search4, replace4], [search5, replace5], [search6, replace6], [search7, replace7], [search8, replace8], [search9, replace9], [search10, replace10], [search11, replace11], [search12, replace12], ] def replaceAll(sri): editor.rereplace(sri[0], sri[1]) for sri in srItems: replaceAll(sri)
Situation before running the script:
Result after running the script:
The script of course can be further simplified and also you can replace the \t with comma or semicolon or any other symbol you like/need to achieve your purpose.
-
-
For converting XML to CSV, I have written a blog post that outlines comprehensive methods, including flattening simple XML and normalizing complex, nested XML structures. It discusses multiple approaches, such as using Flexter for complex cases. Additionally, if you’re dealing with sensitive data, the Paranoid tool can help anonymize data before conversion. This makes it a practical solution for handling XML that contains private or confidential information while still achieving the necessary format transformation.
If you want to find out more contact me or send me a direct message