How to match all content between two XML tags except if a certain tag occurs between them?



  • I have an XML file used to define test cases for a specific program. I’d like to find all test cases with at least 8 step changes (a step change is characterized by a specific set of tags (<step> etc.), see second test case in example below) and remove the last 3 step changes. There may be any tags between the first 5 step changes but the match must stay within a single test case.

    A simplified version of this is to be able to find “[start of test case] followed by [any text except “end of test case”] followed by [next step]”. Which is what I’m trying to solve below (unsuccessfully).

    Example text to search in:

    <unit-test name="3d.">
        <units>
            <multiset>
                <set action="commit" parameter="variant_field" value="variant1"/>
                <set action="commit" parameter="variant_field2" value="variant2"/>
            </multiset>
            <assert-param-value>
                <parameter>type_field</parameter>
                <value>type1</value>
                <operation>=</operation>
            </assert-param-value>
            <commit>
                <parameter>type_field2</parameter>
                <value>myType</value>
                <accept>false</accept>
            </commit>
            <assert-param-hidden>
                <parameter>type_field5</parameter>
                <hidden>true</hidden>
            </assert-param-hidden>
        </units>
    </unit-test>
    <unit-test name="3e.">
        <units>
            <assert-attribute-value>
                <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                <value>type7</value>
                <operation>=</operation>
            </assert-attribute-value>
            <step>
                <stepName>next</stepName>
            </step>
            <commit>
                <parameter>part_field</parameter>
                <value>part1</value>
                <accept>false</accept>
            </commit>
        </units>
    </unit-test>
    

    The following expression successfully matches “[start of test case] followed by [any text] followed by [next step]”, but since I left out <except “end of test case”>, it crosses the test case border if I search from the beginning of the above text.
    Note that I use “. matches newline” setting in the search dialog.

    (<unit-test.*?)( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
    

    I tried adding a negative lookahead (?!</units>) (strictly speaking </units> is the second-last tag, but it only occurs at the end of each test case so it should work just as well as </unit-test> would):

    (<unit-test.(?!</units>)*?)( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
    

    …but that is an invalid expression according to np++.

    This being the first time I use negative lookaheads (or any assertions), I tried rearranging the above expression so that the negative assertion comes after the full .*? instead of between the . and the *?:

    (<unit-test.*?(?!</units>))( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
    

    …but that matches across the test case border, so it fails to solve the problem.

    If anyone could shed some light on how I could solve this, I’d be very happy.



  • @Elias-Mossholm said in How to match all content between two XML tags except if a certain tag occurs between them?:

    I’d like to find all test cases with at least 8 step changes (a step change is characterized by a specific set of tags (<step> etc.), see second test case in example below) and remove the last 3 step changes.

    Maybe I’m just not understanding…
    You say “at least 8 step changes” but then you show sample data that has only one step (defined by <step>...</step>)?
    I suppose it can be solved from the description, but it is odd that your sample data would not be affected by the solution.



  • You say “at least 8 step changes” but then you show sample data that has only one step (defined by <step>…</step>)?

    Maybe that was a bit unclear.

    The number of step changes I need to have in the match should be easy to specify with a {n,n} tag or by just repeating the same segment n times. I’ve already built regex strings that match that correctly.

    The problem I haven’t been able to solve, is how to only find matches that don’t include the end tag </units>.



  • Hello, @elias-mossholm, @alan-kilborn and All,

    Here is a regex which selects all contents of any <unit-test name="xxxx">.......</unit-test> block, which contains ONLY  between N and M block(s) <step>............</step> , like below :

            <step>
                <stepName>XXXX</stepName>
            </step>
    

    or

            <step><stepName>YYYY</stepName></step>
    

    SEARCH (?s)^\h*<unit-test(?:(((?!</?unit-test|</?step>).)+?)<step>(?1)</step>){N,M}(?1)</unit-test>\R

    Of course, you must replace the N and M variables with the appropriate integers : {2,4}, {1,}, {0,3}, {2} or even {0} !

    You may try this regex against the sample text , below :

    <unit-test name="3d.">
    <!--                        1 BLOCK -->
        <units>
            <multiset>
                <set action="commit" parameter="variant_field" value="variant1"/>
                <set action="commit" parameter="variant_field2" value="variant2"/>
            </multiset>
            <assert-param-value>
                <parameter>type_field</parameter>
                <value>type1</value>
                <operation>=</operation>
            </assert-param-value>
            <commit>
                <parameter>type_field2</parameter>
                <value>myType</value>
                <accept>false</accept>
            </commit>
            <step><stepName>AAAA</stepName></step>
            <assert-param-hidden>
                <parameter>type_field5</parameter>
                <hidden>true</hidden>
            </assert-param-hidden>
        </units>
    </unit-test>
    <unit-test name="3e.">
    <!--                        4 BLOCKS -->
        <units>
            <assert-attribute-value>
                <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                <value>type7</value>
                <operation>=</operation>
            </assert-attribute-value>
            <step>
                <stepName>BBBB</stepName>
            </step>
            <commit>
                <parameter>part_field</parameter>
                <value>part1</value>
                <accept>false</accept>
            </commit>
            <assert-attribute-value>
                <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                <value>type7</value>
                <operation>=</operation>
            </assert-attribute-value>
            <step>
                <stepName>CCCC</stepName>
            </step>
            <commit>
                <parameter>part_field</parameter>
                <value>part1</value>
                <accept>false</accept>
            </commit>
            <assert-attribute-value>
                <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                <value>type7</value>
                <operation>=</operation>
            </assert-attribute-value>
            <step>
                <stepName>DDDD</stepName>
            </step>
            <commit>
                <parameter>part_field</parameter>
                <value>part1</value>
                <accept>false</accept>
            </commit>
            <assert-attribute-value>
                <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                <value>type7</value>
                <operation>=</operation>
            </assert-attribute-value>
            <step>
                <stepName>EEEE</stepName>
            </step>
            <commit>
                <parameter>part_field</parameter>
                <value>part1</value>
                <accept>false</accept>
            </commit>
        </units>
    </unit-test>
    <unit-test name="3f.">
    <!--                        0 BLOCK -->
        <units>
            <multiset>
                <set action="commit" parameter="variant_field" value="variant1"/>
                <set action="commit" parameter="variant_field2" value="variant2"/>
            </multiset>
            <commit>
                <parameter>type_field2</parameter>
                <value>myType</value>
                <accept>false</accept>
            </commit>
            <assert-param-hidden>
                <parameter>type_field5</parameter>
                <hidden>true</hidden>
            </assert-param-hidden>
        </units>
    </unit-test>
    <unit-test name="3g.">
    <!--                        2 consecutive BLOCKS -->
        <units>
            <multiset>
                <set action="commit" parameter="variant_field" value="variant1"/>
                <set action="commit" parameter="variant_field2" value="variant2"/>
            </multiset>
            <step>
                <stepName>FFFF</stepName>
            </step>
            <step>
                <stepName>GGGG</stepName>
            </step>
            <assert-param-value>
                <parameter>type_field</parameter>
                <value>type1</value>
                <operation>=</operation>
            </assert-param-value>
            <commit>
                <parameter>type_field2</parameter>
                <value>myType</value>
                <accept>false</accept>
            </commit>
            <assert-param-hidden>
                <parameter>type_field5</parameter>
                <hidden>true</hidden>
            </assert-param-hidden>
        </units>
    </unit-test>
    

    • Before each line <units>, I inserted an XML comment, where I noted the number of <step>......</step> blocks of each <unit-test name="xxxx">......</unit-test> block of my example

    • This regex, quite complex, can be decomposed, using the free-spacing mode (?x), as :

    (?xs)                             # FREE-SPACING and SINGLE-LINE modes
    ^\h*<unit-test                    # String "<unit-test", preceded with some HORIZONTAL BLANK character(s)
    (?:                               # Beginning of a NON-CAPTURING group
    (((?!</?unit-test|</?step>).)+?)  # SHORTEST NON-0 range of any char, NOT crossing "</?unit-test" nor "</?step>", and stored as GROUP 1
    <step>                            # ...till the STRING "<step>"
    (?1)                              # CALL of the regex SUB-ROUTINE, stored in GROUP 1, so the regex :  ((?!</?unit-test|</?step>).)+?
    </step>                           # ...till the STRING "</step>"
    ){N,M}                            # DESIRED number of "<step>...</step>" ranges, between N and M, in a SINGLE "<unit-test...</unit-test>" block
    (?1)                              # CALL of the regex SUB-ROUTINE, stored in GROUP 1, so the regex :  ((?!</?unit-test|</?step>).)+?
    </unit-test>\R                    # STRING "</unit-test>" with its LINE-BREAK
    

    Remark : just note that, in order to shorten the overall regex, the part ((?!</?unit-test|</?step>).)+?, stored as group1, and which represents the shortest non-null range of any char, not crossing the </?unit-test string nor the </?step> string, is re-used two times, thanks to the sub-routine call syntax (?1) !

    Best Regards,

    guy038



  • Thank you @guy038!


Log in to reply