Community
    • Login

    How to match all content between two XML tags except if a certain tag occurs between them?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    7 Posts 4 Posters 3.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Elias MossholmE
      Elias Mossholm
      last edited by

      I have an XML file used to define test cases for a specific program. I’d like to find all test cases with at least 8 step changes (a step change is characterized by a specific set of tags (<step> etc.), see second test case in example below) and remove the last 3 step changes. There may be any tags between the first 5 step changes but the match must stay within a single test case.

      A simplified version of this is to be able to find “[start of test case] followed by [any text except “end of test case”] followed by [next step]”. Which is what I’m trying to solve below (unsuccessfully).

      Example text to search in:

      <unit-test name="3d.">
          <units>
              <multiset>
                  <set action="commit" parameter="variant_field" value="variant1"/>
                  <set action="commit" parameter="variant_field2" value="variant2"/>
              </multiset>
              <assert-param-value>
                  <parameter>type_field</parameter>
                  <value>type1</value>
                  <operation>=</operation>
              </assert-param-value>
              <commit>
                  <parameter>type_field2</parameter>
                  <value>myType</value>
                  <accept>false</accept>
              </commit>
              <assert-param-hidden>
                  <parameter>type_field5</parameter>
                  <hidden>true</hidden>
              </assert-param-hidden>
          </units>
      </unit-test>
      <unit-test name="3e.">
          <units>
              <assert-attribute-value>
                  <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                  <value>type7</value>
                  <operation>=</operation>
              </assert-attribute-value>
              <step>
                  <stepName>next</stepName>
              </step>
              <commit>
                  <parameter>part_field</parameter>
                  <value>part1</value>
                  <accept>false</accept>
              </commit>
          </units>
      </unit-test>
      

      The following expression successfully matches “[start of test case] followed by [any text] followed by [next step]”, but since I left out <except “end of test case”>, it crosses the test case border if I search from the beginning of the above text.
      Note that I use “. matches newline” setting in the search dialog.

      (<unit-test.*?)( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
      

      I tried adding a negative lookahead (?!</units>) (strictly speaking </units> is the second-last tag, but it only occurs at the end of each test case so it should work just as well as </unit-test> would):

      (<unit-test.(?!</units>)*?)( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
      

      …but that is an invalid expression according to np++.

      This being the first time I use negative lookaheads (or any assertions), I tried rearranging the above expression so that the negative assertion comes after the full .*? instead of between the . and the *?:

      (<unit-test.*?(?!</units>))( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
      

      …but that matches across the test case border, so it fails to solve the problem.

      If anyone could shed some light on how I could solve this, I’d be very happy.

      Alan KilbornA 1 Reply Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @Elias Mossholm
        last edited by

        @Elias-Mossholm said in How to match all content between two XML tags except if a certain tag occurs between them?:

        I’d like to find all test cases with at least 8 step changes (a step change is characterized by a specific set of tags (<step> etc.), see second test case in example below) and remove the last 3 step changes.

        Maybe I’m just not understanding…
        You say “at least 8 step changes” but then you show sample data that has only one step (defined by <step>...</step>)?
        I suppose it can be solved from the description, but it is odd that your sample data would not be affected by the solution.

        1 Reply Last reply Reply Quote 2
        • Elias MossholmE
          Elias Mossholm
          last edited by

          You say “at least 8 step changes” but then you show sample data that has only one step (defined by <step>…</step>)?

          Maybe that was a bit unclear.

          The number of step changes I need to have in the match should be easy to specify with a {n,n} tag or by just repeating the same segment n times. I’ve already built regex strings that match that correctly.

          The problem I haven’t been able to solve, is how to only find matches that don’t include the end tag </units>.

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hello, @elias-mossholm, @alan-kilborn and All,

            Here is a regex which selects all contents of any <unit-test name="xxxx">.......</unit-test> block, which contains ONLY  between N and M block(s) <step>............</step> , like below :

                    <step>
                        <stepName>XXXX</stepName>
                    </step>
            

            or

                    <step><stepName>YYYY</stepName></step>
            

            SEARCH (?s)^\h*<unit-test(?:(((?!</?unit-test|</?step>).)+?)<step>(?1)</step>){N,M}(?1)</unit-test>\R

            Of course, you must replace the N and M variables with the appropriate integers : {2,4}, {1,}, {0,3}, {2} or even {0} !

            You may try this regex against the sample text , below :

            <unit-test name="3d.">
            <!--                        1 BLOCK -->
                <units>
                    <multiset>
                        <set action="commit" parameter="variant_field" value="variant1"/>
                        <set action="commit" parameter="variant_field2" value="variant2"/>
                    </multiset>
                    <assert-param-value>
                        <parameter>type_field</parameter>
                        <value>type1</value>
                        <operation>=</operation>
                    </assert-param-value>
                    <commit>
                        <parameter>type_field2</parameter>
                        <value>myType</value>
                        <accept>false</accept>
                    </commit>
                    <step><stepName>AAAA</stepName></step>
                    <assert-param-hidden>
                        <parameter>type_field5</parameter>
                        <hidden>true</hidden>
                    </assert-param-hidden>
                </units>
            </unit-test>
            <unit-test name="3e.">
            <!--                        4 BLOCKS -->
                <units>
                    <assert-attribute-value>
                        <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                        <value>type7</value>
                        <operation>=</operation>
                    </assert-attribute-value>
                    <step>
                        <stepName>BBBB</stepName>
                    </step>
                    <commit>
                        <parameter>part_field</parameter>
                        <value>part1</value>
                        <accept>false</accept>
                    </commit>
                    <assert-attribute-value>
                        <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                        <value>type7</value>
                        <operation>=</operation>
                    </assert-attribute-value>
                    <step>
                        <stepName>CCCC</stepName>
                    </step>
                    <commit>
                        <parameter>part_field</parameter>
                        <value>part1</value>
                        <accept>false</accept>
                    </commit>
                    <assert-attribute-value>
                        <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                        <value>type7</value>
                        <operation>=</operation>
                    </assert-attribute-value>
                    <step>
                        <stepName>DDDD</stepName>
                    </step>
                    <commit>
                        <parameter>part_field</parameter>
                        <value>part1</value>
                        <accept>false</accept>
                    </commit>
                    <assert-attribute-value>
                        <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                        <value>type7</value>
                        <operation>=</operation>
                    </assert-attribute-value>
                    <step>
                        <stepName>EEEE</stepName>
                    </step>
                    <commit>
                        <parameter>part_field</parameter>
                        <value>part1</value>
                        <accept>false</accept>
                    </commit>
                </units>
            </unit-test>
            <unit-test name="3f.">
            <!--                        0 BLOCK -->
                <units>
                    <multiset>
                        <set action="commit" parameter="variant_field" value="variant1"/>
                        <set action="commit" parameter="variant_field2" value="variant2"/>
                    </multiset>
                    <commit>
                        <parameter>type_field2</parameter>
                        <value>myType</value>
                        <accept>false</accept>
                    </commit>
                    <assert-param-hidden>
                        <parameter>type_field5</parameter>
                        <hidden>true</hidden>
                    </assert-param-hidden>
                </units>
            </unit-test>
            <unit-test name="3g.">
            <!--                        2 consecutive BLOCKS -->
                <units>
                    <multiset>
                        <set action="commit" parameter="variant_field" value="variant1"/>
                        <set action="commit" parameter="variant_field2" value="variant2"/>
                    </multiset>
                    <step>
                        <stepName>FFFF</stepName>
                    </step>
                    <step>
                        <stepName>GGGG</stepName>
                    </step>
                    <assert-param-value>
                        <parameter>type_field</parameter>
                        <value>type1</value>
                        <operation>=</operation>
                    </assert-param-value>
                    <commit>
                        <parameter>type_field2</parameter>
                        <value>myType</value>
                        <accept>false</accept>
                    </commit>
                    <assert-param-hidden>
                        <parameter>type_field5</parameter>
                        <hidden>true</hidden>
                    </assert-param-hidden>
                </units>
            </unit-test>
            

            • Before each line <units>, I inserted an XML comment, where I noted the number of <step>......</step> blocks of each <unit-test name="xxxx">......</unit-test> block of my example

            • This regex, quite complex, can be decomposed, using the free-spacing mode (?x), as :

            (?xs)                             # FREE-SPACING and SINGLE-LINE modes
            ^\h*<unit-test                    # String "<unit-test", preceded with some HORIZONTAL BLANK character(s)
            (?:                               # Beginning of a NON-CAPTURING group
            (((?!</?unit-test|</?step>).)+?)  # SHORTEST NON-0 range of any char, NOT crossing "</?unit-test" nor "</?step>", and stored as GROUP 1
            <step>                            # ...till the STRING "<step>"
            (?1)                              # CALL of the regex SUB-ROUTINE, stored in GROUP 1, so the regex :  ((?!</?unit-test|</?step>).)+?
            </step>                           # ...till the STRING "</step>"
            ){N,M}                            # DESIRED number of "<step>...</step>" ranges, between N and M, in a SINGLE "<unit-test...</unit-test>" block
            (?1)                              # CALL of the regex SUB-ROUTINE, stored in GROUP 1, so the regex :  ((?!</?unit-test|</?step>).)+?
            </unit-test>\R                    # STRING "</unit-test>" with its LINE-BREAK
            

            Remark : just note that, in order to shorten the overall regex, the part ((?!</?unit-test|</?step>).)+?, stored as group1, and which represents the shortest non-null range of any char, not crossing the </?unit-test string nor the </?step> string, is re-used two times, thanks to the sub-routine call syntax (?1) !

            Best Regards,

            guy038

            DesAWSumeD 1 Reply Last reply Reply Quote 2
            • Elias MossholmE
              Elias Mossholm
              last edited by

              Thank you @guy038!

              1 Reply Last reply Reply Quote 1
              • DesAWSumeD
                DesAWSume @guy038
                last edited by DesAWSume

                Hi @guy038

                Is there a way to find below pattern with Regex?

                Log file

                <Text>
                    ns:="https://www.example.com"
                    <Error>
                        <id>ex8359693589435834583985934583495</id>
                        <ErrorItem>
                            <id>slak;jdk;asjdklasjdklasjdfhkldj;sfjdsf</id>
                            <code>404</code>
                            <description>External>  failed messages multiple line of detials </description>
                            <reference>/</reference>
                        </ErrorItem>
                    </Error>
                    <InformationLog>
                        <cccpInformation>
                            <description>External>  failed messages multiple line of detials 2 </description>
                            <Place>
                                <id>988475748848758478545</id>
                            </Place>
                        </cccpInformation>
                    </InformationLog>
                </Text>
                

                Basically, I only want to capture the <description> tag only in
                <ErrorItem></ErrorItem>

                and nothing else. and the logs also contain description tag on other level of the tag

                I can achieve some basic matching using something like

                (?s)<ErrorItem>(.*?)<\/description>
                

                but it will select everything inside <ErrorItem> </ErrorItem>

                Alan KilbornA 1 Reply Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @DesAWSume
                  last edited by

                  @DesAWSume said in How to match all content between two XML tags except if a certain tag occurs between them?:

                  I only want to capture the <description> tag only in
                  <ErrorItem></ErrorItem>

                  See HERE.

                  1 Reply Last reply Reply Quote 0
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors