Community
    • Login

    extract XMl with regex

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    31 Posts 5 Posters 3.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones
      last edited by

      @vijay-S said in extract XMl with regex:

      Can you provide the solution asap
      as Its very urgent.

      BTW: I hadn’t had a chance to respond to this earlier, but such a request is considered exceedingly rude in any help forum I’ve ever visited.

      That compounds with the fact that you have shown no effort: guy038 provides you with an answer that works (or comes as close as he can guess, given the inaccurate or incomplete information you provide), and then you change the rules without attempting to modify what he has already given you; and he replies with an update, and this keeps repeating; at some point, you are going to wear out even his patience. I recommend a change in tactics before you’ve burned all bridges here.

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @vijay-s,

        Oh my God ! I was misunderstanding all your stuff from the very beginning :-(( Now I see that you want :

        You would like to pick the totality of any main <ns:name> ... </ns:name> block, ONLY IF it respects ALL the below conditions, in this priority order :

        • It contains a tag and value <ns:locationevent>yyyy</ns:locationevent>

        • It contains a tag and value <ID>123</ID>

        • It contains a tag and value <ns:name>Future</ns:name>

        • It contains, at least, one tag and value <ns:name>def</ns:name>, BEFORE the <ns:Coverage> .... <ns:Coverage> block

        • It contains, at least, one tag and value <ns:name>def</ns:name>, INSIDE the <ns:Coverage> .... <ns:Coverage> block

        Additionnal rule : Tags are sensitive to case and values are insensitive to case

        Am I formulating all, in a right way ?


        If so, from your last example found above and from the one in post :

        https://community.notepad-plus-plus.org/post/49516

        I tried to form a real example, recapitulating all types of text, giving :

        <!----------------  INIITAL TEXT --------------------->
        <ns:name>
            <ns:location>asfsafs</ns:location>
            <ns:locationevent>zzzz</ns:locationevent>
            <ns:locations>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Prior</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Current</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Future</ns:name>
                        </ns:Action>
                        <ns:Status>Pending</ns:Status>
                    </ns:locationphase>
                </ns:location1>
            </ns:locations>
            <ns:Action>
                <ns:name>abc</ns:name>
            </ns:Action>
            <ns:Action>
                <ns:name>ghy</ns:name>
            </ns:Action>
            <ID>123</ID>
            <ns:Coverage>
                <ns:Action>
                    <ns:name>deg</ns:name>
                </ns:Action>
                <ns:Action>
                    <ns:name>def</ns:name>
                </ns:Action>
            </ns:Coverage>
        </ns:name>
        
        
        <ns:name>
            <ns:location>asfsafs</ns:location>
            <ns:locationevent>yyyy</ns:locationevent>
            <ns:locations>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Prior</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Current</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Future</ns:name>
                        </ns:Action>
                        <ns:Status>Pending</ns:Status>
                    </ns:locationphase>
                </ns:location1>
            </ns:locations>
            <ns:Action>
                <ns:name>abc</ns:name>
            </ns:Action>
            <ns:Action>
                <ns:name>def</ns:name>
            </ns:Action>
            <ID>1234</ID>
            <ns:Coverage>
                <ns:Action>
                    <ns:name>deg</ns:name>
                </ns:Action>
                <ns:Action>
                    <ns:name>ddd</ns:name>
                </ns:Action>
            </ns:Coverage>
        </ns:name>
        
        <ns:name>
            <ns:location>asfsafs</ns:location>
            <ns:locationevent>yyyy</ns:locationevent>
            <ns:locations>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Prior</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Current</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Future</ns:name>
                        </ns:Action>
                        <ns:Status>Pending</ns:Status>
                    </ns:locationphase>
                </ns:location1>
            </ns:locations>
            <ns:Action>
                <ns:name>abc</ns:name>
            </ns:Action>
            <ns:Action>
                <ns:name>def</ns:name>
            </ns:Action>
            <ID>123</ID>
            <ns:Coverage>
                <ns:Action>
                    <ns:name>deg</ns:name>
                </ns:Action>
                <ns:Action>
                    <ns:name>def</ns:name>
                </ns:Action>
            </ns:Coverage>
        </ns:name>
        

        IMPORTANT :

        • I change, again, my logic about the regexes. This time, if the condition, contained in a regex, is true, it will add a number, as a benchmark, to the present main <ns:name> .... </ns:name> block

        • For all the S/R, below, click on the Replace All button, exclusively ( Do not use the Replace button. But you may use the Find Next button to see the different matches )

        • And, as usual, tick the Wrap around option and select the Regular expression search mode


        So, if you apply, successively, these 5 regexes, in this order, it will add a different number, right after the ending tag </ns:name> of each main block ( and possible other digits )

        • Regex 1 :

          • SEARCH (?s-i)^\h*<ns:name>.*?<ns:locationevent>(?i:yyyy)</ns:locationevent>.+?</ns:Coverage>\R\h*</ns:name>\d*\K

          • REPLACE 1

        • Regex 2 :

          • SEARCH (?s-i)^\h*<ns:name>.*?<ID>123</ID>.+?</ns:Coverage>\R\h*</ns:name>\d*\K

          • REPLACE 2

        • Regex 3 :

          • SEARCH (?s-i)^\h*<ns:name>.*?<ns:name>(?i:Future)</ns:name>.+?</ns:Coverage>\R\h*</ns:name>\d*\K

          • REPLACE 3

        • Regex 4 :

          • SEARCH (?s-i)^\h*<ns:name>.*?<ns:name>(?i:def)</ns:name>.+?<ns:Coverage>.+?</ns:Coverage>\R\h*</ns:name>\d*\K

          • REPLACE 4

        • Regex 5 :

          • SEARCH (?s-i)^\h*<ns:name>.*?<ns:Coverage>.+?<ns:name>(?i:def)</ns:name>.+?</ns:Coverage>\R\h*</ns:name>\d*\K

          • REPLACE 5

        You should get this temporary text :

        <!----------------  OUTPUT TEXT --------------------->
        <ns:name>
            <ns:location>asfsafs</ns:location>
            <ns:locationevent>zzzz</ns:locationevent>
            <ns:locations>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Prior</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Current</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Future</ns:name>
                        </ns:Action>
                        <ns:Status>Pending</ns:Status>
                    </ns:locationphase>
                </ns:location1>
            </ns:locations>
            <ns:Action>
                <ns:name>abc</ns:name>
            </ns:Action>
            <ns:Action>
                <ns:name>ghy</ns:name>
            </ns:Action>
            <ID>123</ID>
            <ns:Coverage>
                <ns:Action>
                    <ns:name>deg</ns:name>
                </ns:Action>
                <ns:Action>
                    <ns:name>def</ns:name>
                </ns:Action>
            </ns:Coverage>
        </ns:name>235
        
        
        <ns:name>
            <ns:location>asfsafs</ns:location>
            <ns:locationevent>yyyy</ns:locationevent>
            <ns:locations>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Prior</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Current</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Future</ns:name>
                        </ns:Action>
                        <ns:Status>Pending</ns:Status>
                    </ns:locationphase>
                </ns:location1>
            </ns:locations>
            <ns:Action>
                <ns:name>abc</ns:name>
            </ns:Action>
            <ns:Action>
                <ns:name>def</ns:name>
            </ns:Action>
            <ID>1234</ID>
            <ns:Coverage>
                <ns:Action>
                    <ns:name>deg</ns:name>
                </ns:Action>
                <ns:Action>
                    <ns:name>ddd</ns:name>
                </ns:Action>
            </ns:Coverage>
        </ns:name>134
        
        <ns:name>
            <ns:location>asfsafs</ns:location>
            <ns:locationevent>yyyy</ns:locationevent>
            <ns:locations>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Prior</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Current</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Future</ns:name>
                        </ns:Action>
                        <ns:Status>Pending</ns:Status>
                    </ns:locationphase>
                </ns:location1>
            </ns:locations>
            <ns:Action>
                <ns:name>abc</ns:name>
            </ns:Action>
            <ns:Action>
                <ns:name>def</ns:name>
            </ns:Action>
            <ID>123</ID>
            <ns:Coverage>
                <ns:Action>
                    <ns:name>deg</ns:name>
                </ns:Action>
                <ns:Action>
                    <ns:name>def</ns:name>
                </ns:Action>
            </ns:Coverage>
        </ns:name>12345
        

        You certainly noticed that, after replacement, the 3 main <ns:name> ....</ns:name> blocks end as below :

        ...
        </ns:name>235
        ...
        ...
        </ns:name>134
        ...
        ...
        </ns:name>12345
        

        The number, after </ns:name>, recapitulates all the conditions which are TRUE for each block


        Now, it’s elementary ! We just have to :

        • Delete any main <ns:name> ....</ns:name> block which does not satisfy all the conditions, i.e. does not have the string 12345 after </ns:name>

        • Delete the string 12345 after the ending tag of all the blocks which does satisfy all the 5 conditions

        This can be done with the following S/R :

        SEARCH (?s-i)^\h*<ns:name>((?!</ns:Coverage>).)+?</ns:Coverage>\R\h*</ns:name>(?!12345)\d*\R|</ns:name>\K12345

        REPLACE Leave EMPTY

        And you’ll get your expected text :

        
        
        
        <ns:name>
            <ns:location>asfsafs</ns:location>
            <ns:locationevent>yyyy</ns:locationevent>
            <ns:locations>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Prior</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Current</ns:name>
                        </ns:Action>
                        <ns:Status>Completed</ns:Status>
                    </ns:locationphase>
                </ns:location1>
                <ns:location1>
                    <ns:locationphase>
                        <ns:Action>
                            <ns:name>Future</ns:name>
                        </ns:Action>
                        <ns:Status>Pending</ns:Status>
                    </ns:locationphase>
                </ns:location1>
            </ns:locations>
            <ns:Action>
                <ns:name>abc</ns:name>
            </ns:Action>
            <ns:Action>
                <ns:name>def</ns:name>
            </ns:Action>
            <ID>123</ID>
            <ns:Coverage>
                <ns:Action>
                    <ns:name>deg</ns:name>
                </ns:Action>
                <ns:Action>
                    <ns:name>def</ns:name>
                </ns:Action>
            </ns:Coverage>
        </ns:name>
        

        Remark :

        With this final regex, you could, instead, just keep all the blocks which satisfy the conditons, let’s say, 1, 3 and 4

        In this specific case, the S/R would become :

        SEARCH (?s-i)^\h*<ns:name>((?!</ns:Coverage>).)+?</ns:Coverage>\R\h*</ns:name>(?!134)\d*\R|</ns:name>\K134

        REPLACE Leave EMPTY

        Best Regards,

        guy038

        P.S. : When we get a complete solution, I’ll try to explain the differents regexes :-))

        1 Reply Last reply Reply Quote 1
        • ?
          A Former User
          last edited by

          Thanks guy038
          
          I will also try from my end.
          
          But want to precise the requirement. You are understanding is correct on the below conditions. 
          You would like to pick the totality of any main <ns:name> ... </ns:name> block, ONLY IF it respects ALL the below conditions, in this priority order :
          
          It contains a tag and value <ns:locationevent>yyyy</ns:locationevent>
          
          It contains a tag and value <ID>123</ID>
          
          It contains a tag and value <ns:name>Future</ns:name><ns:Status>Pending</ns:Status>--But add this condition too.
          
          It contains, at least, one tag and value <ns:name>def</ns:name>, BEFORE the <ns:Coverage> .... <ns:Coverage> block
          
          It contains, at least, one tag and value <ns:name>def</ns:name>, INSIDE the <ns:Coverage> .... <ns:Coverage> block
          
          
          
          
          **Requirement**: I want to pick only the XML which is matching the given conditions 1-5 from the list of files(when I say file it is a log file which has other texts too)
          
          I need only one command that I will use in the Find in Files option to get the expected XML.
          
          Alan KilbornA 1 Reply Last reply Reply Quote 0
          • Alan KilbornA
            Alan Kilborn @A Former User
            last edited by

            @vijay-S

            If this is just going to be a continue-to-sponge-off-of-Guy type, maybe it is best to take it offline into private emails. I know that Guy has given up his email address in the past in postings, so maybe he will this time as well.

            1 Reply Last reply Reply Quote 0
            • ?
              A Former User
              last edited by

              guy038

              Can you Please provide your email address?

              1 Reply Last reply Reply Quote 1
              • guy038G
                guy038
                last edited by guy038

                Hello, @vijay-s,

                You said :

                I need only one command that I will use in the Find in Files option to get the expected XML.

                I’m really sorry but I cannot ! Even if I tried to concatenate these 5 regexes in an unique one, with the free-spacing regex mode, I get erroneous results, just because the process is orderered !


                To explain this fact, consider the simple S/R below, which tries to search for 2 conditions, simultaneously, and adds, right after the ending tag </ns:name> the letter A if the string abcd is found OR the letter B if the string efgh is found :

                SEARCH (?s)<ns:name>.+?(?:(abcd)|(efgh)).+?</ns:name>\l*\K

                REPLACE (?1A)(?2B)

                Against this sample text, below :

                <ns:name>
                   This text contains, both, strings "efgh" and "abcd"
                </ns:name>
                <ns:name>
                   This text contains, both, strings "efgh" and "abcd"
                </ns:name>
                

                Even if you click several times on the Replace All button, you’ll just find letters B, after </ns:name>, because, when scanning the sample text from left to right, the regex engine meets the efgh string first !

                Now, let’s suppose you run this first S/R :

                SEARCH (?s)<ns:name>.+?abcd.+?</ns:name>\l*\K

                REPLACE A

                Then process this second S/R :

                SEARCH (?s)<ns:name>.+?efgh.+?</ns:name>\l*\K

                REPLACE B

                You get, as expected, the string AB, after </ns:name>, meaning that the two conditions are true for each block !


                Thus, your problem seems beyond the scope of regexes and need to be solved only with script languages or XML analyser tools !

                Best Regards

                guy038

                P.S. :

                In my multi regexes solutions, I still found out an other error of logic. So, after correction and considering your last requirement, I ended with these 5 S/R , below :

                • SEARCH (?s-i)^\h*<ns:name>((?!</ns:Coverage>).)+?<ns:locationevent>(?i:yyyy)</ns:locationevent>.+?</ns:Coverage>\R\h*</ns:name>\d*\K

                • REPLACE 1

                • SEARCH (?s-i)^\h*<ns:name>((?!</ns:Coverage>).)+?<ID>123</ID>.+?</ns:Coverage>\R\h*</ns:name>\d*\K

                • REPLACE 2

                • SEARCH (?s-i)^\h*<ns:name>((?!</ns:Coverage>).)+?<ns:name>(?i:Future)</ns:name>.+?<ns:Status>(?i:Pending)</ns:Status>.+?</ns:Coverage>\R\h*</ns:name>\d*\K

                • REPLACE 3

                • SEARCH (?s-i)^\h*<ns:name>((?!</ns:Coverage>).)+?<ns:name>(?i:def)</ns:name>.+?<ns:Coverage>.+?</ns:Coverage>\R\h*</ns:name>\d*\K

                • REPLACE 4

                • SEARCH (?s-i)^\h*<ns:name>((?!</ns:Coverage>).)+?<ns:Coverage>.+?<ns:name>(?i:def)</ns:name>.+?</ns:Coverage>\R\h*</ns:name>\d*\K

                • REPLACE 5

                And the last regex, which deletes all main <ns:name> .....</ns:name> blocks, which do not satisfy these 5 conditions, remains identical :

                SEARCH (?s-i)^\h*<ns:name>((?!</ns:Coverage>).)+?</ns:Coverage>\R\h*</ns:name>(?!12345)\d*\R|</ns:name>\K12345

                REPLACE Leave EMPTY

                1 Reply Last reply Reply Quote 0
                • ?
                  A Former User
                  last edited by

                  Hi,
                  
                  Thanks for your help.
                  
                  For the following xml,
                  
                  
                  <ns:Input>
                  <ns:location>asfsafs</ns:location>
                  <ns:locationevent>xxxx</ns:locationevent>
                   <ns:Action>
                  <ns:name>abc</ns:name>
                  </ns:Action>
                  <ns:Action>
                  <ns:name>ghy</ns:name>
                  </ns:Action>
                  <ns:Coverage>
                  <ns:Action>
                  <ns:name>deg</ns:name>
                  </ns:Action>
                  </ns:Coverage>
                  </ns:locationevent>
                  <ns:PPLID>121</ns:PPLID
                  </ns:Input>
                  
                  <ns:Input>
                  <ns:location>asfsafs</ns:location>
                  <ns:locationevent>yyyy</ns:locationevent>
                    <ns:Action>
                  <ns:name>abc</ns:name>
                  </ns:Action>
                  <ns:Action>
                  <ns:name>def</ns:name>
                  </ns:Action>
                  <ns:Coverage>
                  <ns:Action>
                  <ns:name>deg</ns:name>
                  </ns:Action>
                  </ns:Coverage>
                  <ns:PPLID>124</ns:PPLID
                  </ns:Input>
                  
                  
                  <ns:Input>
                  <ns:location>asfsafs</ns:location>
                  <ns:locationevent>yyyy</ns:locationevent>
                   <ns:Action>
                  <ns:name>abc</ns:name>
                  </ns:Action>
                  <ns:Action>
                  <ns:name>def</ns:name>
                  </ns:Action>
                  <ns:Coverage>
                  <ns:Action>
                  <ns:name>def</ns:name>
                  </ns:Action>
                  </ns:Coverage>
                  <ns:PPLID>123</ns:PPLID>
                  </ns:Input>
                  I found the command to pick the xml which should match the following conditions
                  <ns:Input>..<ns:locationevent>yyyy</ns:locationevent>..<ns:Action>..<ns:name>def</ns:name>..<ns:Coverage>..<ns:Action>..<ns:name>def</ns:name>..<ns:PPLID>124<ns:PPLID>..</ns:Input>
                  
                  if I use the below command
                  
                  (?s)<ns:Input>((?!</ns:Input>).)*?<ns:locationevent>yyyy</ns:locationevent.*?<ns:Action>.*?<ns:name>def</ns:name>.*?<ns:Coverage>.*?<ns:Action>.*?<ns:name>def</ns:name>.*?124.*?</ns:Input>
                  
                  it didn't find the second XML which matches in the given XML.
                  
                  but whereas if I use the below command
                  
                  (?s)<ns:Input>((?!</ns:Input>).)*?<ns:locationevent>yyyy</ns:locationevent.*?<ns:Action>.*?<ns:name>def</ns:name>.*?<ns:Coverage>.*?<ns:Action>.*?<ns:name>def</ns:name>.*?123.*?</ns:Input>
                  
                  It selects both second and third. In this case it should pick only the third. Can you check on this?
                  
                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hi, @vijay-s,

                    From your last post, I see that you, again, changed the general layout of your text :

                    • The main <ns:name> .... </ns:name> blocks seem replaced with the main <ns:Input> .... </ns:Input> ones

                    • The part <ns:locations>....</ns:locations> are absent

                    • You add the zones <ns:PPLID>xxx</ns:PPLID> between lines </ns:Coverage> and </ns:Input>

                    • You add an other condition, as you want to search for a particular 124 value, in <ns:PPLID>.....</ns:PPLID>

                    Moreover, you tried to find out an unique regex to take an account all your conditions, simultaneously, although I explained, in my previous post, that this way will not work in the general case, regarding the present regexes that I exposed.


                    So, once and for all, could you, please :

                    • Give us a text, which recapitulates ALL possible cases, found in your real data ( I cannot guess it, obviously ! )

                    • Explain ALL the conditions required, in order to consider any main XML block as correct

                    Beware that all your requirements may exceed the power of regular expressions and would need other tools !!


                    Just consider all the wasted time, giving, each time, a part of the whole problem !!

                    When requirements are well defined and all cases well identified, generally, most of the job is done ;-))

                    BR

                    guy038

                    P.S. : Regexes are very sensitive to text. Even, one additional space character, somewhere, may prevent a regular expression from matching an expected piece of text !

                    1 Reply Last reply Reply Quote 0
                    • ?
                      A Former User
                      last edited by

                      
                      Hi,
                      
                      To fix the latests problem will fix all others. I will take care of those. Pls let me know if i can fix. For the given XML,
                      
                      I need to pick the XML for Below are the conditions <ns:Input>..<ns:locationevent>yyyy</ns:locationevent>..<ns:Action>..<ns:name>def</ns:name>..<ns:Coverage>..<ns:Action>..<ns:name>def</ns:name>..<ns:PPLID>124<ns:PPLID>..</ns:Input>  -- which is the second occurence of the given XML
                      
                      I need to pick the XML for Below are the conditions <ns:Input>..<ns:locationevent>yyyy</ns:locationevent>..<ns:Action>..<ns:name>def</ns:name>..<ns:Coverage>..<ns:Action>..<ns:name>def</ns:name>..<ns:PPLID>123<ns:PPLID>..</ns:Input>  --which is the third occurence of the given XML
                      
                      
                      
                      1 Reply Last reply Reply Quote 0
                      • Peter BrandP
                        Peter Brand
                        last edited by

                        Wow, that’s a lot of work. A simpler approach, and one that is much more robust would be to use XSLT to transform your XML document.

                        1 Reply Last reply Reply Quote 1
                        • guy038G
                          guy038
                          last edited by guy038

                          Hi, @vijay-s and All,

                          Ah …, this time, we get something more coherent ;-))


                          But, first, still a few corrections. In your penultimate post, some lines of your XML are misspelled !

                          ...
                          <ns:PPLID>121</ns:PPLID
                          ...
                          ...
                          <ns:PPLID>124</ns:PPLID
                          

                          Of course, the ending > symbol is missing in these two lines

                          On the other hand, in your last post you said :

                          I need to pick the XML for Below are the conditions ns:Input…ns:locationeventyyyy</ns:locationevent>…ns:Action…ns:namedef</ns:name>…ns:Coverage…ns:Action…ns:namedef</ns:name>…ns:PPLID124ns:PPLID…</ns:Input> – which is the second occurence of the given XML

                          But unfortunately, given your example, the second block <ns:Input> .... </ns:Input> contains the part :

                          <ns:Coverage>
                          <ns:Action>
                          <ns:name>deg</ns:name>
                          </ns:Action>
                          </ns:Coverage>
                          

                          And, obviously, it cannot match the regex as the string def is required in <ns:name> .... </ns:name> block !

                          So, in order that your last post seems logic, I suppose that the definitive correct sample text is ( pppfff ! ) :

                          <ns:Input>
                          <ns:location>asfsafs</ns:location>
                          <ns:locationevent>xxxx</ns:locationevent>
                           <ns:Action>
                          <ns:name>abc</ns:name>
                          </ns:Action>
                          <ns:Action>
                          <ns:name>ghy</ns:name>
                          </ns:Action>
                          <ns:Coverage>
                          <ns:Action>
                          <ns:name>deg</ns:name>
                          </ns:Action>
                          </ns:Coverage>
                          </ns:locationevent>
                          <ns:PPLID>121</ns:PPLID>    <!-- ENDING symbol > ADDED -->
                          </ns:Input>
                          
                          <ns:Input>
                          <ns:location>asfsafs</ns:location>
                          <ns:locationevent>yyyy</ns:locationevent>
                            <ns:Action>
                          <ns:name>abc</ns:name>
                          </ns:Action>
                          <ns:Action>
                          <ns:name>def</ns:name>
                          </ns:Action>
                          <ns:Coverage>
                          <ns:Action>
                          <ns:name>def</ns:name>      <!-- BEFORE deg -->
                          </ns:Action>
                          </ns:Coverage>
                          <ns:PPLID>124</ns:PPLID>    <!-- ENDING symbol > ADDED -->
                          </ns:Input>
                          
                          
                          <ns:Input>
                          <ns:location>asfsafs</ns:location>
                          <ns:locationevent>yyyy</ns:locationevent>
                           <ns:Action>
                          <ns:name>abc</ns:name>
                          </ns:Action>
                          <ns:Action>
                          <ns:name>def</ns:name>
                          </ns:Action>
                          <ns:Coverage>
                          <ns:Action>
                          <ns:name>def</ns:name>
                          </ns:Action>
                          </ns:Coverage>
                          <ns:PPLID>123</ns:PPLID>
                          </ns:Input>
                          

                          Now, I succeeded to get an unique regex, catching all your conditions. But, note that this regex supposes, inplicitly, that :

                          • The part <ns:locationevent> ..... </ns:locationevent> appears first, with the chosen value, in the <ns:Input> ..... </ns:Input> block

                          • Then, a part <ns:name> ..... </ns:name>, OUTSIDE a <ns:Coverage> .... </ns:Coverage> block, is present

                          • Then, a <ns:Coverage> ..... </ns:Coverage> block, with the chosen value, is present

                          • Then, a part <ns:name> ..... </ns:name>, INSIDE a a <ns:Coverage> .... </ns:Coverage> block, with the chosen value, is present

                          • Finally a part <ns:PPLID> ..... </ns:PPLID> is present, before the main ending tag </ns:Input>

                          and ONLY in that order ( I insist on this fact ). So, for instance, if a <ns:Coverage> .... </ns:Coverage> block is placed right after the main starting tag <ns:Input>, the regex, below, will NOT match anything !!


                          Now that the example text is correct and the assumptions have been made, the construction of a regular expression is fairly easy ! I’m using the free-spacing mode, for readability

                          Refer to the link, below, for additional information on that mode :

                          https://www.regular-expressions.info/freespacing.html

                          So, here is my final regex, with a lot of comments !

                          (?x)                                            #  DEFAULT behavior : FREE-SPACING mode ( SPACE char IRRELEVANT and # begins COMMENT zone )
                          (?s)                                            #  DEFAULT behavior : the DOT stands for ANY SINGLE character ( STANDARD and EOL chars )
                          (?-i)                                           #  DEFAULT behavior : search SENSIBLE to CASE
                          
                          <ns:Input>                                      #  START of regex, with this EXACT case
                          (                                               #  START of Group 1  ( RE-USED, further on, as a SUBROUTINE CALL = (?1) )
                          ((?!</ns:Input>).)*?                            #  SHORTEST range of characters, even NULL, NOT CONTAINING the string '</ns:Input>'
                          )                                               #  End of Group 1
                          
                          <ns:locationevent>(?i:yyyy)</ns:locationevent>  #  FIRST condition  ( part 'yyyy' NOT sensible to CASE )
                          (?1)                                            #  Regex standing for GROUP 1
                          
                          <ns:Action>                                     #  with that EXACT case
                          (?1)                                            #  Regex standing for GROUP 1
                          <ns:name>(?i:def)</ns:name>                     #  SECOND condition ( part 'def' NOT sensible to CASE )
                          (?1)                                            #  Regex standing for GROUP 1
                          
                          <ns:Coverage>                                   #  THIRD condition, with that EXACT case
                           (?1)                                           #  Regex standing for GROUP 1
                          <ns:Action>                                     #  with that EXACT case
                          (?1)                                            #  Regex standing for GROUP 1
                          
                          <ns:name>(?i:def)</ns:name>                     #  FOURTH condition ( part 'def' NOT sensible to CASE )
                          (?1)                                            #  Regex standing for GROUP 1
                          
                          <ns:PPLID>(?i:124|123)</ns:PPLID>               #  FIFTH condition ( ALTERNATIVE '123|124' NOT sensible to CASE )
                          (?1)                                            #  Regex standing for GROUP 1
                          </ns:Input>                                     #  END of REGEX, with that EXACT case
                          

                          So the road map is :

                          • Start Notepad++ ( your N++ version must be 7.8 or higher : Press thee F1 key to verify )

                          • Open the Mark dialog ( Search > Mark... menu option )

                          • Copy/paste all the free-spacing regex, above, in the Find what: zone = (?x)................</ns:Input>

                          • Tick the Bookmark line option

                          • Tick the Purge for each search option

                          • Tick the Wrap around option

                          • Select the Regular expression search mode

                          • Click, once, on the Mark All button

                          => Normally, all lines of the main <ns:name> ... </ns:name> blocks, which satisfy all the conditions, should be bookmarked

                          Now :

                          • Run the menu option Search > Bookmark > Copy Bookmarked lines

                          • Open a new tab ( Ctrl + N )

                          • Paste all the bookmarked lines ( Ctrl + V )

                          REMARK :

                          • Note that the part ((?!</ns:Input>).)*? represents the shortest range, even null, or any character, not containing the string </ns:Input>, which must be re-used, further on in the regex, as (?1)

                          • Indeed, we cannot use the simple syntax .*?, with the lazy quantifier *?, because, in case a condition is not realized, in a <ns:Input> .... </ns:Input> block, it must not overlap this main block and skips to the next <ns:Input> .... </ns:Input> block in order to get a possible match ;-))

                          Best Regards,

                          guy038

                          P.S. :

                          Surprisingly, when you select all this free-spacing regex, to paste it in the Find what: zone, you notice that it contains 2,103 characters, which seems beyond the maximum of chars ( 2,046 ) !!??

                          But I did verify that the intregrality of the free-spacing regex is taken in account, using a main block, without the ending > symbol

                          <ns:Input>
                          ...
                          ...
                          ...
                          </ns:Input
                          

                          As expected, no match occurs for this main block !

                          1 Reply Last reply Reply Quote 1
                          • ?
                            A Former User
                            last edited by

                            @guy038 said in extract XMl with regex:

                            (?x) # DEFAULT behavior : FREE-SPACING mode ( SPACE char IRRELEVANT and # begins COMMENT zone )
                            (?s) # DEFAULT behavior : the DOT stands for ANY SINGLE character ( STANDARD and EOL chars )
                            (?-i) # DEFAULT behavior : search SENSIBLE to CASE

                            ns:Input # START of regex, with this EXACT case
                            ( # START of Group 1 ( RE-USED, further on, as a SUBROUTINE CALL = (?1) )
                            ((?!</ns:Input>).)*? # SHORTEST range of characters, even NULL, NOT CONTAINING the string ‘</ns:Input>’
                            ) # End of Group 1

                            ns:locationevent(?i:yyyy)</ns:locationevent> # FIRST condition ( part ‘yyyy’ NOT sensible to CASE )
                            (?1) # Regex standing for GROUP 1

                            ns:Action # with that EXACT case
                            (?1) # Regex standing for GROUP 1
                            ns:name(?i:def)</ns:name> # SECOND condition ( part ‘def’ NOT sensible to CASE )
                            (?1) # Regex standing for GROUP 1

                            ns:Coverage # THIRD condition, with that EXACT case
                            (?1) # Regex standing for GROUP 1
                            ns:Action # with that EXACT case
                            (?1) # Regex standing for GROUP 1

                            ns:name(?i:def)</ns:name> # FOURTH condition ( part ‘def’ NOT sensible to CASE )
                            (?1) # Regex standing for GROUP 1

                            ns:PPLID(?i:124|123)</ns:PPLID> # FIFTH condition ( ALTERNATIVE ‘123|124’ NOT sensible to CASE )
                            (?1) # Regex standing for GROUP 1
                            </ns:Input>

                            Thanks a lot. It works like a Charm!!! 
                            
                            1 Reply Last reply Reply Quote 1
                            • ?
                              A Former User
                              last edited by

                              @guy038
                              
                              I tried the above regex in Notepad 7.3.3 and it didnt work.
                              
                              I need the regex which works in 7.3.3 is there any other way to accomplish.
                              
                              
                              1 Reply Last reply Reply Quote 0
                              • guy038G
                                guy038
                                last edited by

                                Hello, @vijay-s,

                                As I still have a local N++ 7.3.3 version, on my laptop, it was very easy to verify that the regex did work, assuming the hypotheses. For instance, I did verify that blocks, with values other than def or values other than 123|124 were not selected by the regex, as expected !

                                So, I suppose that you input text has, again, a different layout than before !?

                                Best Regards,

                                guy038

                                1 Reply Last reply Reply Quote 0
                                • ?
                                  A Former User
                                  last edited by

                                  ``

                                  It works in Notepad++ 7.3.3 if the expected XML is small, if it is big contains 1000 lines then it selects the whole file instaead of the expected XML. but the same thing works in 7.8…8

                                  1 Reply Last reply Reply Quote 1
                                  • Alan KilbornA
                                    Alan Kilborn
                                    last edited by

                                    if it is big … then it selects the whole file instead of the expected…

                                    Sounds like a familiar bug.

                                    1 Reply Last reply Reply Quote 1
                                    • ?
                                      A Former User
                                      last edited by

                                      
                                      is there any update on this
                                      
                                      Alan KilbornA 1 Reply Last reply Reply Quote 0
                                      • Alan KilbornA
                                        Alan Kilborn @A Former User
                                        last edited by

                                        @vijay-S

                                        What update are you expecting?

                                        1 Reply Last reply Reply Quote 0
                                        • ?
                                          A Former User
                                          last edited by

                                          @vijay-S said in extract XMl with regex:

                                          ``
                                          It works in Notepad++ 7.3.3 if the expected XML is small, if it is big contains 1000 lines then it selects the whole file instaead of the expected XML. but the same thing works in 7.8…8

                                          The regex works in 7.8.8 not in 7.3.3 in case if the selected xml is big
                                          
                                          1 Reply Last reply Reply Quote 0
                                          • PeterJonesP
                                            PeterJones
                                            last edited by PeterJones

                                            @vijay-S ,

                                            Please stop marking most of your normal discussion as “plaintext” or “code”. That </> CODE button (or manually using the ``` lines before and after) is used to highlight text that you need to keep raw – like code, or example text for your data – it is not meant to format every paragraph of your discussion. It makes it really hard to read.

                                            As proof, here’s my last paragraph in CODE mode; notice how hard it is to read?

                                            Please stop marking most of your normal discussion as "plaintext" or "code".  That `</> CODE`  button is used to highlight text that you need to keep raw -- like code, or example text for your data -- it is not meant to format every paragraph of your discussion.  It makes it really hard to read.
                                            

                                            Don’t get me wrong: It’s great for example text – so keep using it for when you are asking about certain text that you are trying to work with. But don’t use it for your normal conversation paragraphs.

                                            Back to your clarification:

                                            The regex works in 7.8.8

                                            There is no such version as 7.8.8 (at least, not yet); v7.8.2 has been released, and there is a release-candidate for v7.8.3. I will assume you mean v7.8.2, since that was the newest when this conversation started.

                                            The regex works in 7.8.8 7.8.2 not in 7.3.3 in case if the selected xml is big

                                            Regarding there being a bug in v7.3.3 that isn’t present in v7.8.2: What do you expect? Do you expect a bugfix version of v7.3.3? The version number is incremented as bugs are fixed or features are improved. If v7.3.3 has a bug that you need fixed, you need to move to a newer version that has the bug fixed; you have already admitted that the feature works in newer versions. So if you need a version with the bug fixed, use the version with the bug fixed. If you don’t need a version with the bug fixed, feel free to stick with the old v7.3.3; either way, don’t complain that the bug still exists in the old version when you know it’s fixed in a newer version.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors