Need help with regex for XML removal
-
This or something similar has potential to work:
Find:
(?s-i)<machine.+?(?:(type="gambling".+?</machine>\R)|(?:</machine>\R))
Replace:?{1}:$0
Or maybe it’s just a variant on yours that will still cause complexity too much for the regex engine.
-
@Piotr-Stefański said in Need help with regex for XML removal:
it fails completely
As in it claims to find nothing, or as in it shows “Invalid regular expression” at the bottom?
If the latter, please hover over the three dots to the right of that message and tell us if it says something about the expression being too complex.
-
@Piotr-Stefański said in Need help with regex for XML removal:
I’m trying to find a regex that would allow me to remove all XML instances on <machine …> … </machine> that contain <control type=“gambling” … />.
For what it’s worth:
If I were trying to solve this problem for myself, I’d break it into a couple of steps instead of trying to write a single, very clever regular expression.
If it is safe to assume that <control type=“gambling” never occurs outside of a <machine …>…</machine> pair, then I would first do:
Find what :
<control type="gambling".*?</machine>
Replace with :</machine DELETE>
Search Mode: Regular expression
. matches newline: checkedThen I’d do:
Find what :
<machine .*?</machine(*SKIP) DELETE>\R
Replace with : (empty)
Search Mode: Regular expression
. matches newline: checked -
@Coises said in Need help with regex for XML removal:
I’d break it into a couple of steps instead of trying to write a single, very clever regular expression.
Ha, yea. The use of
(*SKIP)
pretty much makes it just as clever. :-)Referencing HERE we have a good explanation:
if at current position in string, the regex engine can match the part before (*SKIP) but cannot match the part after (*SKIP), the regex engine discards any further search and the current match attempt just fails. So the regex engine must advance to the location where the zero-width (*SKIP) verb occurs, for a new match attempt
-
Hello, @piotr-stefański, @alan-kilborn, @coises and All,
@piotr-stefański, instead of the regex
(?s)<(machine)\x20((?!<\1).)+?type="gambling".+?</\1>
, use preferably this one, below :-
SEARCH
(?s)<machine\x20(?:(?!<machine).)+?type="gambling".+?</machine>
-
REPLACE
Leave EMPTY
This new version does not use any group to be stored ! May be, you’ll get better results ;-))
Now, let’s start, for example, with this¨INPUT text :
<machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> </input> <driver status="good" emulation="good" savestate="unsupported"/> </machine> <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> <control type="gambling" buttons="21"/> </input> <driver status="good" emulation="good" savestate="unsupported"/> </machine> <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> <control type="other" buttons="21"/> </input> <driver status="good" emulation="good" savestate="unsupported"/> </machine> <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> <control type="gambling" buttons="21"/> </input> <driver status="good" emulation="good" savestate="unsupported"/> </machine>
with the @coises method, we use its first regex S/R :
-
SEARCH
(?s)<control type="gambling".*?</machine>
-
REPLACE
</machine DELETE>
To get the temporary text below :
<machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> </input> <driver status="good" emulation="good" savestate="unsupported"/> </machine> <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> </machine DELETE> <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> <control type="other" buttons="21"/> </input> <driver status="good" emulation="good" savestate="unsupported"/> </machine> <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> </machine DELETE>
Then, with its second regex S/R :
-
SEARCH
(?s)<machine .*?</machine(*SKIP) DELETE>\R
-
REPLACE
Leave EMPTY
We end up with our expected OUTPUT text :
<machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> </input> <driver status="good" emulation="good" savestate="unsupported"/> </machine> <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> <control type="other" buttons="21"/> </input> <driver status="good" emulation="good" savestate="unsupported"/> </machine>
However, note that its second regex S/R could also have been solved without any control verb, with the following S/R :
-
SEARCH
(?s)<machine (?:(?!</machine).)*?</machine DELETE>\R
-
REPLACE
Leave EMPTY
REMARK :
See the fundamental difference between these two regex S/R syntaxes :
-
SEARCH
(?s)<machine .*?</machine(*SKIP) DELETE>\R
-
REPLACE
Leave EMPTY
and
-
SEARCH
(?s)<machine .*?</machine\K DELETE>\R
-
REPLACE
Leave EMPTY
In the first case :
-
IF the
\x20DELETE\R
string is found, all this specific section will be deleted -
IF the
\x20DELETE\R
string is NOT found, as the back-tracking process cannot occur, all this specific section is just ignored
But, in the second case :
-
IF the
\x20DELETE\R
string is found, only the part\x20DELETE\R
will be deleted, due to the\K
syntax -
IF the
\x20DELETE\R
string is NOT found, NO replacement occurs, due to the\K
syntax
Now, a third alternative would be to simply use the generic regex
<What I don't want>(*SKIP)(*F)|<What I want>
. Indeed :-
We do NOT want all the
<machine .......</machine>
sections, which do not contain thetype="gambling"
string, thus ignored sections -
We DO want all the
<machine .......</machine>
sections, which contain thetype="gambling"
string, in order to delete these specific sections by the S/R
This leads to the functional regex S/R :
-
SEARCH
(?s)<machine (?:(?!type="gambling").)+?</machine>(*SKIP)(*F)|<machine .+?</machine>
-
REPLACE
Leave EMPTY
And, starting with the INPUT text again, we would obtain, once more, our expected OUTPUT :
<machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> </input> <driver status="good" emulation="good" savestate="unsupported"/> </machine> <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt"> <description>3 Bags Full (3VXFC5345, New Zealand)</description> <year>1996</year> <manufacturer>Aristocrat</manufacturer> <rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/> <device_ref name="mc6809e"/> <sample name="tick"/> <input players="1" coins="2"> <control type="other" buttons="21"/> </input> <driver status="good" emulation="good" savestate="unsupported"/> </machine>
Best Regards,
guy038
-
-
@Alan-Kilborn said in Need help with regex for XML removal:
Find:
(?s-i)<machine.+?(?:(type="gambling".+?</machine>\R)|(?:</machine>\R))
Replace:
?{1}:$0
I feel somewhat slighted as @guy038 avoided commenting on my proposed solution. :-(
Mine has a bit of symmetry with:
What_I_don’t_want(*SKIP)(*F)|What_I_want>
because I also used an
|
to specify non-wanted vs. wanted sections.If a non-wanted section (according to the OP’s definition of what he doesn’t want) appears, I captured it into group1 and then my replacement replaces that whole <machine>…</machine> section with nothing (because there is nothing between the
}
and the:
), otherwise it replaces that section with itself via$0
(thus, keeping it). -
Thank you everyone, this is a real treasure trove of info.
I’ll experiment with all of the above.Again, huge thanks!
-
Hi, @alan-kilborn, @piotr-stefański, @coises and All,
@alan-kilborn, I’m rather disappointed that you thought I’d intentionally omitted to comment on your solution :-(
As, at this stage, @coises had found a solution that used control verbs, and, what’s more, you had given him a glowing review, I just focused on his solution !
No, I simply didn’t notice your response. Sorry for this “faux pas” !
So, allow me to use the
Free spacing
mode again ! Thus, your regex S/R can be expressed as :SEARCH : (?xs-i) <machine .+? (?: ( type="gambling" .+? </machine> \R ) | </machine> \R ) ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ Group 1 REPLACE : ?1:$0
And, indeed, you’ve found out a very powerful solution, because, if we use in replacement, the regex
?1$0:
, it also inverts the logic and delete ONLY the sections which do not contain thetype= "gambling"
string !
So, if we assume that :
-
BSR =
<machine(?:\x20|>)
-
ESR =
</machine>\R
-
FR =
type="gambling"
This leads to the generic regex S/R, below :
SEARCH
(?s-i)
BSR.+?(?:(
FR.+?
ESR)|
ESR)
REPLACE RR
And :
-
IF RR =
?1:$0
, the completeBSR....ESR
sections, WITH the FR string, are deleted -
IF RR =
?1$0:
, the completeBSR....ESR
sections, WITHOUT the FR string, are deleted
Thus, Alan, I’m going to add a new BLOG post about this powerful and simple method, soon ;-))
Best Regards,
guy038
-
-
@guy038 Before we get too excited about any of our proposed solutions, I hope we hear more from @Piotr-Stefański, the original poster.
When I tried copying his example data, duplicating and creating a variant with a different control type, and then making many copies of each all in one file, his original regular expression worked.
Unless he reports back to us that one or more of our proposed solutions worked on his actual data, or unless someone else manages to construct an example on which his expression fails and one or more of our solutions works, we don’t know that we have solved anything.
We don’t even know what the original problem was. I’m guessing the “complexity” message, which for some reason seems to be cropping up a lot lately; but he has not confirmed that.
-
@guy038 :
Interesting. I hadn’t thought of it as any sort of “general” solution to a problem!
[ But, really, there already was the makings of a general solution to half of the problem, from you (ref. HERE) ]
A couple of notes:
Note 1:
IF RR = ?1$0:, the complete BSR…ESR sections, WITHOUT the FR string, are deleted
In this variant of the replace expression, the
:
isn’t necessary, thus:IF RR =
?1$0
, the complete…
Note 2:
Since the overall regex uses group1, it isn’t available in the BSR, ESR, FR and even the RR subexpressions.
Thus a user of this would have to keep in mind that if he is using further grouping inside these expressions, that he has to think in terms of group2 and above.
This is definitely unlike another templated regex solution I use a lot, ref. HERE, where the user does not have to keep this in mind.