XML Split

Craig McFarlane

Hi Guys Just wanted to see if there is a way with Notepadd ++ or if someone else has any suggestions. I have a massive XML file that contains for example the below sections. I have just copied out two sections so you can see.

Is there a way that I can export say all of Game1 out to a new XML file called Game1 and then do the same for Game2

Thanks in advance.

<game name="Game1>
<description>1</description>
<cloneof></cloneof>
<crc></crc>
<manufacturer>1</manufacturer>
<year>1</year>
<genre>Action</genre>
<rating>2</rating>
<enabled>Yes</enabled>
</game>
<game name=Game2>
<description>2)</description>
<cloneof></cloneof>
<crc></crc>
<manufacturer>2</manufacturer>
<year>2010</year>
<genre>F2</genre>
<rating>2</rating>
<enabled>Yes</enabled>
</game>

PeterJones

Notepad++ focuses on a single file at a time.

If you really only had two games (and if the XML in those games were valid and consistent), you could make 2 copies of your original file, and do a reasonably simple regex similar to

FIND = (?s-i)(^<game name="Game1">.*?</game>).*
REPLACE = $1
MODE = regular expression

Run that once each inside each file: in the first file, you would have Game1 in your FIND, in the second file Game2 (and so forth for future sections as well).

(Note: this won’t work with the data you showed as-is, because you had a quote before but not after Game1, and no quotes around Game2. I just assumed some reasonable/consistent data for making my example regex)

If you’ve got more than a handful of games, you probably don’t want to manually cycle through each file, tweaking the regex each time. This then enters the realm of programming languages. You could do it in Notepad++, using a plugin like PythonScript or LuaScript (which you can install from the Plugins > Plugins Admin). But most of what you’d be doing in the program would have nothing to do with Notepad++… at which point, you could do it outside Notepad++ using your favorite programming language with its XML-parsing library as easily (or more easily) than you could inside the Notepad++ environment. And this isn’t really a general programming forum.

If you are able to give a more accurate example (with appropriate quotes around the attribute values – or confirm that it varies from matched quotes around some game names to no-quotes around others) – and also confirm whether it’s always “Game” followed by one or more digits, or whether the game name can have anything as the name. (Make sure the example text is formatted as described in the footnote below, so that real quotes don’t turn into smart quotes accidentally) If you can do that for us, and if you are willing to install the PythonScript plugin, one of us might be willing to hack together an example script that would go through your input file, compare the line to that start-of-game tag, extract the game-name, and start a new output file of that name, outputting everything until the next start-of-game tag or the EOF into that active output file. Even though it’s primarily a programming challenge, not a Notepad++ challenge.

----

Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as plain text using the </> toolbar button or manual Markdown syntax; screenshots can be pasted from the clipbpard to your post using Ctrl+V. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get… Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries.

Alan Kilborn

@Craig-McFarlane said in XML Split:

massive XML file that contains for example the below sections

Hard to tell if you are underrepresenting the data you actually have.

For example, are we talking thousands(?) of “games”, each as short as the 2 you show, or are there a lot of “game1” sections interspersed through the file, or…?

The problem is, someone might come up with a solution, only to have you say “no no no, that’s not what I have, so that solution won’t work”. So it might be best to elaborate a bit more on what you have.

Apologies if what you have described perfectly outlines your data and your need.

Craig McFarlane

Thanks for the reply so far chaps. So here is the first 5 copied out of the main xml file so you can see the correct formatting.

game name="$1,000,000 Pyramid, The (USA)" index="true" image="$">
		<description>$1,000,000 Pyramid, The (USA)</description>
		<cloneof></cloneof>
		<crc></crc>
		<manufacturer>Ubi Soft Entertainment</manufacturer>
		<year>2011</year>
		<genre>Quiz</genre>
		<rating>Other - NR (Not Rated)</rating>
		<enabled>Yes</enabled>
	</game>
	<game name="007 - Quantum of Solace (USA)" index="true" image="0">
		<description>007 - Quantum of Solace (USA)</description>
		<cloneof></cloneof>
		<crc></crc>
		<manufacturer>Activision</manufacturer>
		<year>2008</year>
		<genre>Action</genre>
		<rating>Other - NR (Not Rated)</rating>
		<enabled>Yes</enabled>
	</game>
	<game name="10 Minute Solution (USA)" index="true" image="1">
		<description>10 Minute Solution (USA)</description>
		<cloneof></cloneof>
		<crc></crc>
		<manufacturer>Activision</manufacturer>
		<year>2010</year>
		<genre>Fitness</genre>
		<rating>Other - NR (Not Rated)</rating>
		<enabled>Yes</enabled>
	</game>
	<game name="101-in-1 Party Megamix (USA)" index="" image="">
		<description>101-in-1 Party Megamix (USA)</description>
		<cloneof></cloneof>
		<crc></crc>
		<manufacturer>Atlus</manufacturer>
		<year>2009</year>
		<genre>Party</genre>
		<rating>Other - NR (Not Rated)</rating>
		<enabled>Yes</enabled>
	</game>
	<game name="101-in-1 Sports Party Megamix (USA)" index="" image="">
		<description>101-in-1 Sports Party Megamix (USA)</description>
		<cloneof></cloneof>
		<crc></crc>
		<manufacturer>Atlus</manufacturer>
		<year>2011</year>
		<genre>Party</genre>
		<rating>Other - NR (Not Rated)</rating>
		<enabled>Yes</enabled>
	</game>```

Alan Kilborn

So with Notepad++ itself, you can’t “iterate a search over the game names”, and then do something with each one. At this point your best bet may be to turn to a scripting plugin, or even to take it external to Notepad++ and just go with an independent programming language.

It would not be that hard to script a solution where each game is exported to a filename with the same name as the game (hardest part and not that hard would be to filter out characters that appear in a game name but can’t be used in a filename).

Craig McFarlane

Ok. Thanks mate. Best explore the scripting solution.

Alan Kilborn

@Craig-McFarlane

So, just to take you a little bit down the script-writing path, although we are not a script-writing service here, I could envision something like the following (using PythonScript plugin) for your data:

matches = []
editor.research(r'(?s)^\h+<game name="(.*?)".*?</game>', lambda m: matches.append((m.span(0), m.group(1))))
for m in matches:
    filename = m[1]
    span = m[0]
    with open(filename + '.xml', 'wb') as f:
        f.write(editor.getTextRange(span[0], span[1]) + '\r\n')

Craig McFarlane

Thanks mate. Will explore this further.

guy038

Hello, @craig-mcfarlane, @alan-kilborn, @peterjones and All,

An other solution than the python Alan’s script would be to use the Gawk utility. If you don’t know this powerful software, simply follow the different steps, at the beginning of my recent post, below :

https://community.notepad-plus-plus.org/topic/19186/copy-search-and-replace-between-2-html-files/11

Then :

Place all gawk stuff ( 5 files ) and your big XML file in a same folder
In a DOS console window, type in and run the following command :
- gawk -F\x22 "/game name/ { n= $2 } ; { print > n\".xml\" }" Big_File.xml

Voilà !

Notes :

The field separator is the double quote ( \x22 )
Then if the string game name exists in a line, it stores the contents of field 2 in variable n ( the name of the file )
In all cases, it appends entire current line in a file with name = n.xml

So, after processing, you should have, in your folder, as many files as name tags, in your large XML file, and whose names, without extension, are the values of the name attribute ( $2 parameter )

Best Regards,

guy038

Craig McFarlane

Thanks mate. I appreciate you taking the time to reply. not heard of the software but will check it out. Thanks again mate. always great to learn something new.

Craig McFarlane

@guy038 said in XML Split:

gawk -F\x22 “/game name/ { n= $2 } ; { print > n”.xml" }"

Mate. Just wanted to say. Top man. Thank you so much for A) taking the time to reply with a constructive and very helpful reply. and B) you learn something new every day and C) what you suggested worked like a dream. Had one issue as one of the name had a / in it. No dram worked around it.

Thank you so much