• Login
Community
  • Login

Convert XML data to tabular csv data or tab separated without a script

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
14 Posts 8 Posters 15.4k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • B
    Bas de Reuver
    last edited by Jun 12, 2023, 10:36 AM

    Is there an easy way to convert (part of) XML data back to tabular data, so comma separated values or tab separated, using Notepad++? I sometimes get XML log files, and for readability and easy compare it would be a lot easier to extract the data just as a flat table, for example in Excel.

    I’ve looked at the XMLtools plug-in and other plug-ins but couldn’t find any option. So for example the data is like this:

    <ChildTaskStatus>
      <LoadSubStatus>
        <TaskStatus>RanToCompletion</TaskStatus>
        <Template>NL_CM_0_1_1</Template>
        <ItemsTotal>41219</ItemsTotal>
        <ElapsedTime>00:00:04</ElapsedTime>
      </LoadSubStatus>
      <LoadSubStatus>
        <TaskStatus>RanToCompletion</TaskStatus>
        <Template>NL_CM_18_7_1</Template>
        <ItemsTotal>168930</ItemsTotal>
        <ElapsedTime>00:00:53</ElapsedTime>
      </LoadSubStatus>
      <LoadSubStatus>
        <TaskStatus>RanToCompletion</TaskStatus>
        <Template>NL_CM_12_8_1</Template>
        <ItemsTotal>231122</ItemsTotal>
        <ElapsedTime>00:00:43</ElapsedTime>
      </LoadSubStatus>
    </ChildTaskStatus>
    

    And I would like to extract the data to just this:

    |TaskStatus      |Template     |ItemsTotal |ElapsedTime |
    +----------------+-------------+-----------+------------+
    |RanToCompletion |NL_CM_0_1_1  |      41219|00:00:04    |
    |RanToCompletion |NL_CM_18_7_1 |     168930|00:00:53    |
    |RanToCompletion |NL_CM_12_8_1 |     231122|00:00:43    |
    

    Is there any way to do this easily without having to write a script for each different instance? I understand I would have to at least specify the XML tags for the four columns, but I mean is there a option or plugin to do this?

    D D 2 Replies Last reply Jun 12, 2023, 2:18 PM Reply Quote 2
    • D
      datatraveller1 @Bas de Reuver
      last edited by Jun 12, 2023, 2:18 PM

      @Bas-de-Reuver I use an online tool for converting XML to CSV but I assume this is not an option.

      B 1 Reply Last reply Jun 12, 2023, 2:33 PM Reply Quote 0
      • B
        Bas de Reuver @datatraveller1
        last edited by Jun 12, 2023, 2:33 PM

        @datatraveller1 it’s not an option in this case due to the XML containing privacy data, but I’m curious which online tool you’re referring to?

        Anyway I also asked the question on superuser.com and someone suggested simply opening the file in Excel which handles random XML files pretty well. It just didn’t occur to me to open an XML file in Excel because quite often it can’t even handle a simple csv file. 😆

        1 Reply Last reply Reply Quote 2
        • D
          datatraveller1
          last edited by Jun 12, 2023, 2:56 PM

          @Bas-de-Reuver said in Convert XML data to tabular csv data or tab separated without a script:

          @datatraveller1 it’s not an option in this case due to the XML containing privacy data, but I’m curious which online tool you’re referring to?

          Anyway I also asked the question on superuser.com and someone suggested simply opening the file in Excel which handles random XML files pretty well. It just didn’t occur to me to open an XML file in Excel because quite often it can’t even handle a simple csv file. 😆

          I use this tool: https://conversiontools.io/convert/xml-to-csv , which seems to be a good solution if there is no private data, but the Excel solution also looks promising.

          1 Reply Last reply Reply Quote 1
          • C
            Coises
            last edited by Jun 12, 2023, 5:04 PM

            The example you show has the same four column-elements in the same order in every row-element. Is that true of the real data?

            B 1 Reply Last reply Jun 13, 2023, 1:11 PM Reply Quote 1
            • W
              wonkawilly
              last edited by Jun 12, 2023, 8:09 PM

              If the schema is constant I can suggest to use find and replace multiple times and create a macro saving the whole procedure in the meanwhile so you have it for future uses

              1 Reply Last reply Reply Quote 1
              • M
                Mark Olson
                last edited by Jun 12, 2023, 10:22 PM

                PythonScript makes working with XML data quite easy, because Python’s standard library has great XML support.

                Here’s a pretty generic script. I’ve only tested it on the sample data you provided.

                # based on https://community.notepad-plus-plus.org/topic/24563/convert-xml-data-to-tabular-csv-data-or-tab-separated-without-a-script
                # References:
                #   * https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
                #   * https://docs.python.org/3/library/csv.html
                #   * https://docs.python.org/3/library/io.html#io.StringIO
                #   * https://npppythonscript.sourceforge.net/docs/latest/intro.html#notifications
                
                import csv
                from io import StringIO
                from xml.etree import ElementTree as ET
                
                from Npp import *
                
                
                def xml_to_json(text):
                    root = ET.fromstring(text)
                    return [{e.tag: e.text for e in child} for child in root]
                                
                
                def xml_to_csv(text):
                    js = xml_to_json(text)
                    if not js:
                        return ''
                    fieldnames = list(js[0])
                    strio = StringIO()
                    writer = csv.DictWriter(strio, fieldnames=fieldnames)
                    writer.writeheader()
                    for row in js:
                        writer.writerow(row)
                    csv_text = strio.getvalue()
                    return csv_text
                    
                
                if __name__ == '__main__':
                    text = editor.getText()
                    csv_text = xml_to_csv(text)
                    notepad.new()
                    editor.setText(csv_text)
                

                When run on

                <ChildTaskStatus>
                  <LoadSubStatus>
                    <TaskStatus>RanToCompletion</TaskStatus>
                    <Template>NL_CM_0_1_1</Template>
                    <ItemsTotal>41219</ItemsTotal>
                    <ElapsedTime>00:00:04</ElapsedTime>
                  </LoadSubStatus>
                  <LoadSubStatus>
                    <TaskStatus>RanToCompletion</TaskStatus>
                    <Template>NL_CM_18_7_1</Template>
                    <ItemsTotal>168930</ItemsTotal>
                    <ElapsedTime>00:00:53</ElapsedTime>
                  </LoadSubStatus>
                  <LoadSubStatus>
                    <TaskStatus>RanToCompletion</TaskStatus>
                    <Template>NL_CM_12_8_1</Template>
                    <ItemsTotal>231122</ItemsTotal>
                    <ElapsedTime>00:00:43</ElapsedTime>
                  </LoadSubStatus>
                </ChildTaskStatus>
                

                I got

                TaskStatus,Template,ItemsTotal,ElapsedTime
                RanToCompletion,NL_CM_0_1_1,41219,00:00:04
                RanToCompletion,NL_CM_18_7_1,168930,00:00:53
                RanToCompletion,NL_CM_12_8_1,231122,00:00:43
                
                1 Reply Last reply Reply Quote 5
                • M Mark Olson referenced this topic on Jun 12, 2023, 10:22 PM
                • B
                  Bas de Reuver @Coises
                  last edited by Jun 13, 2023, 1:11 PM

                  @Coises Technically it could be in a different order, but it’s practically always the same order. However, it does sometimes have missing tags, like the ElapsedTime can be missing for a certain error status.

                  C 1 Reply Last reply Jun 13, 2023, 4:55 PM Reply Quote 0
                  • M
                    Mark Olson
                    last edited by Jun 13, 2023, 3:04 PM

                    Given that some tags are sometimes missing, a more robust solution than my original script would be this:

                    # based on https://community.notepad-plus-plus.org/topic/24563/convert-xml-data-to-tabular-csv-data-or-tab-separated-without-a-script
                    # References:
                    #   * https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
                    #   * https://docs.python.org/3/library/csv.html
                    #   * https://docs.python.org/3/library/io.html#io.StringIO
                    #   * https://npppythonscript.sourceforge.net/docs/latest/intro.html#notifications
                    
                    import csv
                    from io import StringIO
                    from xml.etree import ElementTree as ET
                    
                    from Npp import *
                    
                    
                    def xml_to_json(text):
                        root = ET.fromstring(text)
                        fieldnames = sorted(set(e.tag for child in root for e in child),
                            key=lambda x: x.lower())
                        js = [{e.tag: e.text for e in child} for child in root]
                        return fieldnames, js
                                    
                    
                    def xml_to_csv(text):
                        fieldnames, js = xml_to_json(text)
                        if not js:
                            return ''
                        strio = StringIO()
                        writer = csv.DictWriter(strio, fieldnames=fieldnames)
                        writer.writeheader()
                        for row in js:
                            writer.writerow(row)
                        csv_text = strio.getvalue()
                        return csv_text
                        
                    
                    if __name__ == '__main__':
                        text = editor.getText()
                        csv_text = xml_to_csv(text)
                        notepad.new()
                        editor.setText(csv_text)
                    

                    This version finds all the distinct child tag names so that even if some children are missing tags, the final table is guaranteed to have a column for each child tag.

                    Test data:

                    <ChildTaskStatus>
                      <LoadSubStatus>
                        <TaskStatus>RanToCompletion</TaskStatus>
                        <Template>NL_CM_0_1_1</Template>
                        <ItemsTotal>41219</ItemsTotal>
                        <ElapsedTime>00:00:04</ElapsedTime>
                      </LoadSubStatus>
                      <LoadSubStatus>
                        <TaskStatus>RanToCompletion</TaskStatus>
                        <Template>NL_CM_18_7_1</Template>
                        <ItemsTotal>168930</ItemsTotal>
                        <ElapsedTime>00:00:53</ElapsedTime>
                      </LoadSubStatus>
                      <LoadSubStatus>
                        <TaskStatus>RanToCompletion</TaskStatus>
                        <Template>NL_CM_12_8_1</Template>
                        <ItemsTotal>231122</ItemsTotal>
                        <ElapsedTime>00:00:43</ElapsedTime>
                      </LoadSubStatus>
                      <LoadSubStatus>
                        <TaskStatus>RanToCompletion</TaskStatus>
                        <bar>eorneoren</bar>
                        <Template>NL_CM_12_8_1</Template>
                        <ItemsTotal>231122</ItemsTotal>
                        <ElapsedTime>00:00:43</ElapsedTime>
                      </LoadSubStatus>
                      <LoadSubStatus>
                        <TaskStatus>RanToCompletion</TaskStatus>
                        <Template>NL_CM_12_8_1</Template>
                        <ItemsTotal>231122</ItemsTotal>
                        <baz>roenroen</baz>
                        <ElapsedTime>00:00:43</ElapsedTime>
                      </LoadSubStatus>
                    </ChildTaskStatus>
                    

                    Result:

                    bar,baz,ElapsedTime,ItemsTotal,TaskStatus,Template
                    ,,00:00:04,41219,RanToCompletion,NL_CM_0_1_1
                    ,,00:00:53,168930,RanToCompletion,NL_CM_18_7_1
                    ,,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
                    eorneoren,,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
                    ,roenroen,00:00:43,231122,RanToCompletion,NL_CM_12_8_1
                    
                    1 Reply Last reply Reply Quote 2
                    • M Mark Olson referenced this topic on Jun 13, 2023, 3:04 PM
                    • C
                      Coises @Bas de Reuver
                      last edited by Jun 13, 2023, 4:55 PM

                      @Bas-de-Reuver said in Convert XML data to tabular csv data or tab separated without a script:

                      @Coises Technically it could be in a different order, but it’s practically always the same order. However, it does sometimes have missing tags, like the ElapsedTime can be missing for a certain error status.

                      OK. The reason I asked was that if they were always the same, it would be feasible to transform the file with regular expressions. With order changing and/or tags missing (no matter how rarely), scripting solutions become far more attractive.

                      1 Reply Last reply Reply Quote 1
                      • D
                        dinkumoil @Bas de Reuver
                        last edited by Jun 13, 2023, 6:16 PM

                        @Bas-de-Reuver

                        Converting your XML file to an CSV file could also be done with the help of the XML Tools plugin and its XML transformation feature.

                        1. Install XML Tools plugin via built-in PluginsAdmin.
                        2. Save the following code as TransformToCSV.xsl with character encoding UTF-8.
                        <?xml version="1.0" encoding="UTF-8"?>
                        
                        <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
                        
                          <xsl:output
                            method="text"
                            omit-xml-declaration="yes"
                            indent="no"
                            encoding="utf-8"
                          />
                        
                          <xsl:strip-space
                            elements="*"
                          />
                          
                          <xsl:template match="/ChildTaskStatus/LoadSubStatus">
                            <xsl:if test="position()=1">
                              <xsl:text>TaskStatus;Template;ItemsTotal;ElapsedTime&#xD;&#xA;</xsl:text>
                            </xsl:if>
                            <xsl:value-of select="concat(TaskStatus, ';', Template, ';', ItemsTotal, ';', ElapsedTime, '&#xD;', '&#xA;')"/>
                          </xsl:template>
                        
                        </xsl:transform>
                        
                        1. Load your XML file with Notepad++ and ensure it’s the active file.
                        2. Navigate to (menu) Plugins -> XML Tools -> XSL Transformation
                        3. In the dialog box popping up, use the ... button to select the XSL file you saved in step 2.
                        4. Click Transform button.

                        A new tab will be opened that contains the transformed file.

                        Please note: If the character encoding of your source XML file is not UTF-8, you have to set the correct encoding in line 9 of the XSL file.

                        1 Reply Last reply Reply Quote 3
                        • guy038G
                          guy038
                          last edited by Jun 14, 2023, 11:22 PM

                          Hello, @bas-de-reuver, @datatraveller1, @mark-olson, @coises, @wonkawilly, @dinkumoil and *All,

                          Just for info, if you always get the same number of columns per line, here is a regex solution :


                          So, from your INPUT text :

                          <ChildTaskStatus>
                            <LoadSubStatus>
                              <TaskStatus>RanToCompletion</TaskStatus>
                              <Template>NL_CM_0_1_1</Template>
                              <ItemsTotal>41219</ItemsTotal>
                              <ElapsedTime>00:00:04</ElapsedTime>
                            </LoadSubStatus>
                            <LoadSubStatus>
                              <TaskStatus>RanToCompletion</TaskStatus>
                              <Template>NL_CM_18_7_1</Template>
                              <ItemsTotal>168930</ItemsTotal>
                              <ElapsedTime>00:00:53</ElapsedTime>
                            </LoadSubStatus>
                            <LoadSubStatus>
                              <TaskStatus>RanToCompletion</TaskStatus>
                              <Template>NL_CM_12_8_1</Template>
                              <ItemsTotal>231122</ItemsTotal>
                              <ElapsedTime>00:00:43</ElapsedTime>
                            </LoadSubStatus>
                          </ChildTaskStatus>
                          

                          This first regex S/R only keeps the key items, on consecutive lines

                          SEARCH (?-s)^\h*(?:</?ChildTaskStatus>|</?LoadSubStatus>)\R|\h*</?.+?>

                          REPLACE Leave EMPTY

                          Would give this TEMPORARY text :

                          RanToCompletion
                          NL_CM_0_1_1
                          41219
                          00:00:04
                          RanToCompletion
                          NL_CM_18_7_1
                          168930
                          00:00:53
                          RanToCompletion
                          NL_CM_12_8_1
                          231122
                          00:00:43
                          

                          Then, this second regex S/R put the items in a table :

                          SEARCH (?-s)(.+:.+)(\R)|(.+)\R

                          REPLACE (?1|\1 |\2)(?3|\3\x20)

                          Would give the TEMPORARY text :

                          |RanToCompletion |NL_CM_0_1_1 |41219 |00:00:04 |
                          |RanToCompletion |NL_CM_18_7_1 |168930 |00:00:53 |
                          |RanToCompletion |NL_CM_12_8_1 |231122 |00:00:43 |
                          

                          Finally, this third regex S/R justify the ItemsTotal values on the right

                          SEARCH (?-s)^(\|.+\|)(\d+)(\x20)+

                          REPLACE \1\3\2

                          And you get your expected OUTPUT text :

                          |RanToCompletion |NL_CM_0_1_1 | 41219|00:00:04 |
                          |RanToCompletion |NL_CM_18_7_1 | 168930|00:00:53 |
                          |RanToCompletion |NL_CM_12_8_1 | 231122|00:00:43 |
                          

                          Best Regards,

                          guy038

                          1 Reply Last reply Reply Quote 2
                          • W
                            wonkawilly
                            last edited by wonkawilly Jun 15, 2023, 9:50 AM Jun 15, 2023, 9:48 AM

                            The following should work as well: (you need PythonScript Plugin in order to be able to run the script)

                            The data:

                            <ChildTaskStatus>
                              <LoadSubStatus>
                                <TaskStatus>TaskStatus</TaskStatus>
                                <Template>Template</Template>
                                <ItemsTotal>ItemsTotal</ItemsTotal>
                                <ElapsedTime>ElapsedTime</ElapsedTime>
                              </LoadSubStatus>
                              <LoadSubStatus>
                                <TaskStatus>RanToCompletion</TaskStatus>
                                <Template>NL_CM_0_1_1</Template>
                                <ItemsTotal>41219</ItemsTotal>
                                <ElapsedTime>00:00:04</ElapsedTime>
                              </LoadSubStatus>
                              <LoadSubStatus>
                                <TaskStatus>RanToCompletion</TaskStatus>
                                <Template>NL_CM_18_7_1</Template>
                                <ItemsTotal>168930</ItemsTotal>
                                <ElapsedTime>00:00:53</ElapsedTime>
                              </LoadSubStatus>
                              <LoadSubStatus>
                                <TaskStatus>RanToCompletion</TaskStatus>
                                <Template>NL_CM_12_8_1</Template>
                                <ItemsTotal>231122</ItemsTotal>
                                <ElapsedTime>00:00:43</ElapsedTime>
                              </LoadSubStatus>
                            </ChildTaskStatus>
                            

                            The Python script:

                            search1 = "<ChildTaskStatus>\r\n"
                            search2 = "</ChildTaskStatus>"
                            search3 = "  <LoadSubStatus>\r\n"
                            search4 = "  </LoadSubStatus>\r\n"
                            search5 = "    <TaskStatus>"
                            search6 = "</TaskStatus>\r\n"
                            search7 = "    <Template>"
                            search8 = "</Template>\r\n"
                            search9 = "    <ItemsTotal>"
                            search10 = "</ItemsTotal>\r\n"
                            search11 = "    <ElapsedTime>"
                            search12 = "</ElapsedTime>\r\n"
                            
                            replace1 = ""
                            replace2 = ""
                            replace3 = ""
                            replace4 = ""
                            replace5 = ""
                            replace6 = "\t"
                            replace7 = ""
                            replace8 = "\t"
                            replace9 = ""
                            replace10 = "\t"
                            replace11 = ""
                            replace12 = "\r\n"
                            
                            
                            
                            srItems = \
                            	[
                            		[search1, replace1],
                            		[search2, replace2],
                            		[search3, replace3],
                            		[search4, replace4],
                            		[search5, replace5],
                            		[search6, replace6],
                            		[search7, replace7],
                            		[search8, replace8],
                            		[search9, replace9],
                            		[search10, replace10],
                            		[search11, replace11],
                            		[search12, replace12],
                            	]
                            
                            
                            def replaceAll(sri):
                            	editor.rereplace(sri[0], sri[1])
                            
                            
                            for sri in srItems:
                            	replaceAll(sri)
                            

                            Situation before running the script:

                            7c57bf22-779c-4b43-ae16-97966e79f5db-image.png

                            Result after running the script:

                            a4f4e0d9-b5f3-41e1-ba0c-9abf15002b90-image.png

                            The script of course can be further simplified and also you can replace the \t with comma or semicolon or any other symbol you like/need to achieve your purpose.

                            1 Reply Last reply Reply Quote 0
                            • M Mark Olson referenced this topic on Jun 20, 2023, 6:22 AM
                            • Kristijan BertaK
                              Kristijan Berta
                              last edited by Nov 13, 2024, 1:12 PM

                              For converting XML to CSV, I have written a blog post that outlines comprehensive methods, including flattening simple XML and normalizing complex, nested XML structures. It discusses multiple approaches, such as using Flexter for complex cases. Additionally, if you’re dealing with sensitive data, the Paranoid tool can help anonymize data before conversion. This makes it a practical solution for handling XML that contains private or confidential information while still achieving the necessary format transformation.

                              If you want to find out more contact me or send me a direct message

                              1 Reply Last reply Reply Quote 0
                              2 out of 14
                              • First post
                                2/14
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors