How to match all content between two XML tags except if a certain tag occurs between them?
-
I have an XML file used to define test cases for a specific program. I’d like to find all test cases with at least 8 step changes (a step change is characterized by a specific set of tags (
<step>
etc.), see second test case in example below) and remove the last 3 step changes. There may be any tags between the first 5 step changes but the match must stay within a single test case.A simplified version of this is to be able to find “[start of test case] followed by [any text except “end of test case”] followed by [next step]”. Which is what I’m trying to solve below (unsuccessfully).
Example text to search in:
<unit-test name="3d."> <units> <multiset> <set action="commit" parameter="variant_field" value="variant1"/> <set action="commit" parameter="variant_field2" value="variant2"/> </multiset> <assert-param-value> <parameter>type_field</parameter> <value>type1</value> <operation>=</operation> </assert-param-value> <commit> <parameter>type_field2</parameter> <value>myType</value> <accept>false</accept> </commit> <assert-param-hidden> <parameter>type_field5</parameter> <hidden>true</hidden> </assert-param-hidden> </units> </unit-test> <unit-test name="3e."> <units> <assert-attribute-value> <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute> <value>type7</value> <operation>=</operation> </assert-attribute-value> <step> <stepName>next</stepName> </step> <commit> <parameter>part_field</parameter> <value>part1</value> <accept>false</accept> </commit> </units> </unit-test>
The following expression successfully matches “[start of test case] followed by [any text] followed by [next step]”, but since I left out <except “end of test case”>, it crosses the test case border if I search from the beginning of the above text.
Note that I use “. matches newline” setting in the search dialog.(<unit-test.*?)( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
I tried adding a negative lookahead
(?!</units>)
(strictly speaking</units>
is the second-last tag, but it only occurs at the end of each test case so it should work just as well as</unit-test>
would):(<unit-test.(?!</units>)*?)( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
…but that is an invalid expression according to np++.
This being the first time I use negative lookaheads (or any assertions), I tried rearranging the above expression so that the negative assertion comes after the full
.*?
instead of between the.
and the*?
:(<unit-test.*?(?!</units>))( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
…but that matches across the test case border, so it fails to solve the problem.
If anyone could shed some light on how I could solve this, I’d be very happy.
-
@Elias-Mossholm said in How to match all content between two XML tags except if a certain tag occurs between them?:
I’d like to find all test cases with at least 8 step changes (a step change is characterized by a specific set of tags (<step> etc.), see second test case in example below) and remove the last 3 step changes.
Maybe I’m just not understanding…
You say “at least 8 step changes” but then you show sample data that has only one step (defined by<step>...</step>
)?
I suppose it can be solved from the description, but it is odd that your sample data would not be affected by the solution. -
You say “at least 8 step changes” but then you show sample data that has only one step (defined by <step>…</step>)?
Maybe that was a bit unclear.
The number of step changes I need to have in the match should be easy to specify with a
{n,n}
tag or by just repeating the same segment n times. I’ve already built regex strings that match that correctly.The problem I haven’t been able to solve, is how to only find matches that don’t include the end tag
</units>
. -
Hello, @elias-mossholm, @alan-kilborn and All,
Here is a regex which selects all contents of any
<unit-test name="xxxx">.......</unit-test>
block, which contains ONLY betweenN
andM
block(s)<step>............</step>
, like below :<step> <stepName>XXXX</stepName> </step>
or
<step><stepName>YYYY</stepName></step>
SEARCH
(?s)^\h*<unit-test(?:(((?!</?unit-test|</?step>).)+?)<step>(?1)</step>){
N,
M}(?1)</unit-test>\R
Of course, you must replace the N and M variables with the appropriate integers :
{2,4}
,{1,}
,{0,3}
,{2}
or even{0}
!You may try this regex against the sample text , below :
<unit-test name="3d."> <!-- 1 BLOCK --> <units> <multiset> <set action="commit" parameter="variant_field" value="variant1"/> <set action="commit" parameter="variant_field2" value="variant2"/> </multiset> <assert-param-value> <parameter>type_field</parameter> <value>type1</value> <operation>=</operation> </assert-param-value> <commit> <parameter>type_field2</parameter> <value>myType</value> <accept>false</accept> </commit> <step><stepName>AAAA</stepName></step> <assert-param-hidden> <parameter>type_field5</parameter> <hidden>true</hidden> </assert-param-hidden> </units> </unit-test> <unit-test name="3e."> <!-- 4 BLOCKS --> <units> <assert-attribute-value> <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute> <value>type7</value> <operation>=</operation> </assert-attribute-value> <step> <stepName>BBBB</stepName> </step> <commit> <parameter>part_field</parameter> <value>part1</value> <accept>false</accept> </commit> <assert-attribute-value> <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute> <value>type7</value> <operation>=</operation> </assert-attribute-value> <step> <stepName>CCCC</stepName> </step> <commit> <parameter>part_field</parameter> <value>part1</value> <accept>false</accept> </commit> <assert-attribute-value> <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute> <value>type7</value> <operation>=</operation> </assert-attribute-value> <step> <stepName>DDDD</stepName> </step> <commit> <parameter>part_field</parameter> <value>part1</value> <accept>false</accept> </commit> <assert-attribute-value> <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute> <value>type7</value> <operation>=</operation> </assert-attribute-value> <step> <stepName>EEEE</stepName> </step> <commit> <parameter>part_field</parameter> <value>part1</value> <accept>false</accept> </commit> </units> </unit-test> <unit-test name="3f."> <!-- 0 BLOCK --> <units> <multiset> <set action="commit" parameter="variant_field" value="variant1"/> <set action="commit" parameter="variant_field2" value="variant2"/> </multiset> <commit> <parameter>type_field2</parameter> <value>myType</value> <accept>false</accept> </commit> <assert-param-hidden> <parameter>type_field5</parameter> <hidden>true</hidden> </assert-param-hidden> </units> </unit-test> <unit-test name="3g."> <!-- 2 consecutive BLOCKS --> <units> <multiset> <set action="commit" parameter="variant_field" value="variant1"/> <set action="commit" parameter="variant_field2" value="variant2"/> </multiset> <step> <stepName>FFFF</stepName> </step> <step> <stepName>GGGG</stepName> </step> <assert-param-value> <parameter>type_field</parameter> <value>type1</value> <operation>=</operation> </assert-param-value> <commit> <parameter>type_field2</parameter> <value>myType</value> <accept>false</accept> </commit> <assert-param-hidden> <parameter>type_field5</parameter> <hidden>true</hidden> </assert-param-hidden> </units> </unit-test>
-
Before each line
<units>
, I inserted anXML
comment, where I noted the number of<step>......</step>
blocks of each<unit-test name="xxxx">......</unit-test>
block of my example -
This regex, quite complex, can be decomposed, using the free-spacing mode
(?x)
, as :
(?xs) # FREE-SPACING and SINGLE-LINE modes ^\h*<unit-test # String "<unit-test", preceded with some HORIZONTAL BLANK character(s) (?: # Beginning of a NON-CAPTURING group (((?!</?unit-test|</?step>).)+?) # SHORTEST NON-0 range of any char, NOT crossing "</?unit-test" nor "</?step>", and stored as GROUP 1 <step> # ...till the STRING "<step>" (?1) # CALL of the regex SUB-ROUTINE, stored in GROUP 1, so the regex : ((?!</?unit-test|</?step>).)+? </step> # ...till the STRING "</step>" ){N,M} # DESIRED number of "<step>...</step>" ranges, between N and M, in a SINGLE "<unit-test...</unit-test>" block (?1) # CALL of the regex SUB-ROUTINE, stored in GROUP 1, so the regex : ((?!</?unit-test|</?step>).)+? </unit-test>\R # STRING "</unit-test>" with its LINE-BREAK
Remark : just note that, in order to shorten the overall regex, the part
((?!</?unit-test|</?step>).)+?
, stored as group1
, and which represents the shortest non-null range of any char, not crossing the</?unit-test
string nor the</?step>
string, is re-used two times, thanks to the sub-routine call syntax(?1)
!Best Regards,
guy038
-
-
Thank you @guy038!
-
Hi @guy038
Is there a way to find below pattern with Regex?
Log file
<Text> ns:="https://www.example.com" <Error> <id>ex8359693589435834583985934583495</id> <ErrorItem> <id>slak;jdk;asjdklasjdklasjdfhkldj;sfjdsf</id> <code>404</code> <description>External> failed messages multiple line of detials </description> <reference>/</reference> </ErrorItem> </Error> <InformationLog> <cccpInformation> <description>External> failed messages multiple line of detials 2 </description> <Place> <id>988475748848758478545</id> </Place> </cccpInformation> </InformationLog> </Text>
Basically, I only want to capture the <description> tag only in
<ErrorItem></ErrorItem>and nothing else. and the logs also contain description tag on other level of the tag
I can achieve some basic matching using something like
(?s)<ErrorItem>(.*?)<\/description>
but it will select everything inside <ErrorItem> </ErrorItem>
-
@DesAWSume said in How to match all content between two XML tags except if a certain tag occurs between them?:
I only want to capture the <description> tag only in
<ErrorItem></ErrorItem>See HERE.