• Login
Community
  • Login

How to match all content between two XML tags except if a certain tag occurs between them?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
7 Posts 4 Posters 3.5k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • E
    Elias Mossholm
    last edited by Jun 11, 2020, 2:51 PM

    I have an XML file used to define test cases for a specific program. I’d like to find all test cases with at least 8 step changes (a step change is characterized by a specific set of tags (<step> etc.), see second test case in example below) and remove the last 3 step changes. There may be any tags between the first 5 step changes but the match must stay within a single test case.

    A simplified version of this is to be able to find “[start of test case] followed by [any text except “end of test case”] followed by [next step]”. Which is what I’m trying to solve below (unsuccessfully).

    Example text to search in:

    <unit-test name="3d.">
        <units>
            <multiset>
                <set action="commit" parameter="variant_field" value="variant1"/>
                <set action="commit" parameter="variant_field2" value="variant2"/>
            </multiset>
            <assert-param-value>
                <parameter>type_field</parameter>
                <value>type1</value>
                <operation>=</operation>
            </assert-param-value>
            <commit>
                <parameter>type_field2</parameter>
                <value>myType</value>
                <accept>false</accept>
            </commit>
            <assert-param-hidden>
                <parameter>type_field5</parameter>
                <hidden>true</hidden>
            </assert-param-hidden>
        </units>
    </unit-test>
    <unit-test name="3e.">
        <units>
            <assert-attribute-value>
                <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                <value>type7</value>
                <operation>=</operation>
            </assert-attribute-value>
            <step>
                <stepName>next</stepName>
            </step>
            <commit>
                <parameter>part_field</parameter>
                <value>part1</value>
                <accept>false</accept>
            </commit>
        </units>
    </unit-test>
    

    The following expression successfully matches “[start of test case] followed by [any text] followed by [next step]”, but since I left out <except “end of test case”>, it crosses the test case border if I search from the beginning of the above text.
    Note that I use “. matches newline” setting in the search dialog.

    (<unit-test.*?)( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
    

    I tried adding a negative lookahead (?!</units>) (strictly speaking </units> is the second-last tag, but it only occurs at the end of each test case so it should work just as well as </unit-test> would):

    (<unit-test.(?!</units>)*?)( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
    

    …but that is an invalid expression according to np++.

    This being the first time I use negative lookaheads (or any assertions), I tried rearranging the above expression so that the negative assertion comes after the full .*? instead of between the . and the *?:

    (<unit-test.*?(?!</units>))( *<step>\r\n\ *<stepName>next</stepName>\r\n\ *</step>)
    

    …but that matches across the test case border, so it fails to solve the problem.

    If anyone could shed some light on how I could solve this, I’d be very happy.

    A 1 Reply Last reply Jun 11, 2020, 3:09 PM Reply Quote 0
    • A
      Alan Kilborn @Elias Mossholm
      last edited by Jun 11, 2020, 3:09 PM

      @Elias-Mossholm said in How to match all content between two XML tags except if a certain tag occurs between them?:

      I’d like to find all test cases with at least 8 step changes (a step change is characterized by a specific set of tags (<step> etc.), see second test case in example below) and remove the last 3 step changes.

      Maybe I’m just not understanding…
      You say “at least 8 step changes” but then you show sample data that has only one step (defined by <step>...</step>)?
      I suppose it can be solved from the description, but it is odd that your sample data would not be affected by the solution.

      1 Reply Last reply Reply Quote 2
      • E
        Elias Mossholm
        last edited by Jun 16, 2020, 8:06 AM

        You say “at least 8 step changes” but then you show sample data that has only one step (defined by <step>…</step>)?

        Maybe that was a bit unclear.

        The number of step changes I need to have in the match should be easy to specify with a {n,n} tag or by just repeating the same segment n times. I’ve already built regex strings that match that correctly.

        The problem I haven’t been able to solve, is how to only find matches that don’t include the end tag </units>.

        1 Reply Last reply Reply Quote 0
        • G
          guy038
          last edited by guy038 Jun 16, 2020, 5:57 PM Jun 16, 2020, 11:38 AM

          Hello, @elias-mossholm, @alan-kilborn and All,

          Here is a regex which selects all contents of any <unit-test name="xxxx">.......</unit-test> block, which contains ONLY  between N and M block(s) <step>............</step> , like below :

                  <step>
                      <stepName>XXXX</stepName>
                  </step>
          

          or

                  <step><stepName>YYYY</stepName></step>
          

          SEARCH (?s)^\h*<unit-test(?:(((?!</?unit-test|</?step>).)+?)<step>(?1)</step>){N,M}(?1)</unit-test>\R

          Of course, you must replace the N and M variables with the appropriate integers : {2,4}, {1,}, {0,3}, {2} or even {0} !

          You may try this regex against the sample text , below :

          <unit-test name="3d.">
          <!--                        1 BLOCK -->
              <units>
                  <multiset>
                      <set action="commit" parameter="variant_field" value="variant1"/>
                      <set action="commit" parameter="variant_field2" value="variant2"/>
                  </multiset>
                  <assert-param-value>
                      <parameter>type_field</parameter>
                      <value>type1</value>
                      <operation>=</operation>
                  </assert-param-value>
                  <commit>
                      <parameter>type_field2</parameter>
                      <value>myType</value>
                      <accept>false</accept>
                  </commit>
                  <step><stepName>AAAA</stepName></step>
                  <assert-param-hidden>
                      <parameter>type_field5</parameter>
                      <hidden>true</hidden>
                  </assert-param-hidden>
              </units>
          </unit-test>
          <unit-test name="3e.">
          <!--                        4 BLOCKS -->
              <units>
                  <assert-attribute-value>
                      <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                      <value>type7</value>
                      <operation>=</operation>
                  </assert-attribute-value>
                  <step>
                      <stepName>BBBB</stepName>
                  </step>
                  <commit>
                      <parameter>part_field</parameter>
                      <value>part1</value>
                      <accept>false</accept>
                  </commit>
                  <assert-attribute-value>
                      <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                      <value>type7</value>
                      <operation>=</operation>
                  </assert-attribute-value>
                  <step>
                      <stepName>CCCC</stepName>
                  </step>
                  <commit>
                      <parameter>part_field</parameter>
                      <value>part1</value>
                      <accept>false</accept>
                  </commit>
                  <assert-attribute-value>
                      <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                      <value>type7</value>
                      <operation>=</operation>
                  </assert-attribute-value>
                  <step>
                      <stepName>DDDD</stepName>
                  </step>
                  <commit>
                      <parameter>part_field</parameter>
                      <value>part1</value>
                      <accept>false</accept>
                  </commit>
                  <assert-attribute-value>
                      <attribute>part.subpart.subsubpart.block[3].anotherPart.type</attribute>
                      <value>type7</value>
                      <operation>=</operation>
                  </assert-attribute-value>
                  <step>
                      <stepName>EEEE</stepName>
                  </step>
                  <commit>
                      <parameter>part_field</parameter>
                      <value>part1</value>
                      <accept>false</accept>
                  </commit>
              </units>
          </unit-test>
          <unit-test name="3f.">
          <!--                        0 BLOCK -->
              <units>
                  <multiset>
                      <set action="commit" parameter="variant_field" value="variant1"/>
                      <set action="commit" parameter="variant_field2" value="variant2"/>
                  </multiset>
                  <commit>
                      <parameter>type_field2</parameter>
                      <value>myType</value>
                      <accept>false</accept>
                  </commit>
                  <assert-param-hidden>
                      <parameter>type_field5</parameter>
                      <hidden>true</hidden>
                  </assert-param-hidden>
              </units>
          </unit-test>
          <unit-test name="3g.">
          <!--                        2 consecutive BLOCKS -->
              <units>
                  <multiset>
                      <set action="commit" parameter="variant_field" value="variant1"/>
                      <set action="commit" parameter="variant_field2" value="variant2"/>
                  </multiset>
                  <step>
                      <stepName>FFFF</stepName>
                  </step>
                  <step>
                      <stepName>GGGG</stepName>
                  </step>
                  <assert-param-value>
                      <parameter>type_field</parameter>
                      <value>type1</value>
                      <operation>=</operation>
                  </assert-param-value>
                  <commit>
                      <parameter>type_field2</parameter>
                      <value>myType</value>
                      <accept>false</accept>
                  </commit>
                  <assert-param-hidden>
                      <parameter>type_field5</parameter>
                      <hidden>true</hidden>
                  </assert-param-hidden>
              </units>
          </unit-test>
          

          • Before each line <units>, I inserted an XML comment, where I noted the number of <step>......</step> blocks of each <unit-test name="xxxx">......</unit-test> block of my example

          • This regex, quite complex, can be decomposed, using the free-spacing mode (?x), as :

          (?xs)                             # FREE-SPACING and SINGLE-LINE modes
          ^\h*<unit-test                    # String "<unit-test", preceded with some HORIZONTAL BLANK character(s)
          (?:                               # Beginning of a NON-CAPTURING group
          (((?!</?unit-test|</?step>).)+?)  # SHORTEST NON-0 range of any char, NOT crossing "</?unit-test" nor "</?step>", and stored as GROUP 1
          <step>                            # ...till the STRING "<step>"
          (?1)                              # CALL of the regex SUB-ROUTINE, stored in GROUP 1, so the regex :  ((?!</?unit-test|</?step>).)+?
          </step>                           # ...till the STRING "</step>"
          ){N,M}                            # DESIRED number of "<step>...</step>" ranges, between N and M, in a SINGLE "<unit-test...</unit-test>" block
          (?1)                              # CALL of the regex SUB-ROUTINE, stored in GROUP 1, so the regex :  ((?!</?unit-test|</?step>).)+?
          </unit-test>\R                    # STRING "</unit-test>" with its LINE-BREAK
          

          Remark : just note that, in order to shorten the overall regex, the part ((?!</?unit-test|</?step>).)+?, stored as group1, and which represents the shortest non-null range of any char, not crossing the </?unit-test string nor the </?step> string, is re-used two times, thanks to the sub-routine call syntax (?1) !

          Best Regards,

          guy038

          DesAWSumeD 1 Reply Last reply Nov 22, 2022, 12:48 AM Reply Quote 2
          • E
            Elias Mossholm
            last edited by Jun 18, 2020, 3:56 PM

            Thank you @guy038!

            1 Reply Last reply Reply Quote 1
            • DesAWSumeD
              DesAWSume @guy038
              last edited by DesAWSume Nov 22, 2022, 12:50 AM Nov 22, 2022, 12:48 AM

              Hi @guy038

              Is there a way to find below pattern with Regex?

              Log file

              <Text>
                  ns:="https://www.example.com"
                  <Error>
                      <id>ex8359693589435834583985934583495</id>
                      <ErrorItem>
                          <id>slak;jdk;asjdklasjdklasjdfhkldj;sfjdsf</id>
                          <code>404</code>
                          <description>External>  failed messages multiple line of detials </description>
                          <reference>/</reference>
                      </ErrorItem>
                  </Error>
                  <InformationLog>
                      <cccpInformation>
                          <description>External>  failed messages multiple line of detials 2 </description>
                          <Place>
                              <id>988475748848758478545</id>
                          </Place>
                      </cccpInformation>
                  </InformationLog>
              </Text>
              

              Basically, I only want to capture the <description> tag only in
              <ErrorItem></ErrorItem>

              and nothing else. and the logs also contain description tag on other level of the tag

              I can achieve some basic matching using something like

              (?s)<ErrorItem>(.*?)<\/description>
              

              but it will select everything inside <ErrorItem> </ErrorItem>

              A 1 Reply Last reply Nov 22, 2022, 1:37 AM Reply Quote 0
              • A
                Alan Kilborn @DesAWSume
                last edited by Nov 22, 2022, 1:37 AM

                @DesAWSume said in How to match all content between two XML tags except if a certain tag occurs between them?:

                I only want to capture the <description> tag only in
                <ErrorItem></ErrorItem>

                See HERE.

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors