extract XMl with regex

Peter Brand

Wow, that’s a lot of work. A simpler approach, and one that is much more robust would be to use XSLT to transform your XML document.

guy038

Hi, @vijay-s and All,

Ah …, this time, we get something more coherent ;-))

But, first, still a few corrections. In your penultimate post, some lines of your XML are misspelled !

...
<ns:PPLID>121</ns:PPLID
...
...
<ns:PPLID>124</ns:PPLID

Of course, the ending > symbol is missing in these two lines

On the other hand, in your last post you said :

I need to pick the XML for Below are the conditions ns:Input…ns:locationeventyyyy</ns:locationevent>…ns:Action…ns:namedef</ns:name>…ns:Coverage…ns:Action…ns:namedef</ns:name>…ns:PPLID124ns:PPLID…</ns:Input> – which is the second occurence of the given XML

But unfortunately, given your example, the second block <ns:Input> .... </ns:Input> contains the part :

<ns:Coverage>
<ns:Action>
<ns:name>deg</ns:name>
</ns:Action>
</ns:Coverage>

And, obviously, it cannot match the regex as the string def is required in <ns:name> .... </ns:name> block !

So, in order that your last post seems logic, I suppose that the definitive correct sample text is ( pppfff ! ) :

<ns:Input>
<ns:location>asfsafs</ns:location>
<ns:locationevent>xxxx</ns:locationevent>
 <ns:Action>
<ns:name>abc</ns:name>
</ns:Action>
<ns:Action>
<ns:name>ghy</ns:name>
</ns:Action>
<ns:Coverage>
<ns:Action>
<ns:name>deg</ns:name>
</ns:Action>
</ns:Coverage>
</ns:locationevent>
<ns:PPLID>121</ns:PPLID>    <!-- ENDING symbol > ADDED -->
</ns:Input>

<ns:Input>
<ns:location>asfsafs</ns:location>
<ns:locationevent>yyyy</ns:locationevent>
  <ns:Action>
<ns:name>abc</ns:name>
</ns:Action>
<ns:Action>
<ns:name>def</ns:name>
</ns:Action>
<ns:Coverage>
<ns:Action>
<ns:name>def</ns:name>      <!-- BEFORE deg -->
</ns:Action>
</ns:Coverage>
<ns:PPLID>124</ns:PPLID>    <!-- ENDING symbol > ADDED -->
</ns:Input>


<ns:Input>
<ns:location>asfsafs</ns:location>
<ns:locationevent>yyyy</ns:locationevent>
 <ns:Action>
<ns:name>abc</ns:name>
</ns:Action>
<ns:Action>
<ns:name>def</ns:name>
</ns:Action>
<ns:Coverage>
<ns:Action>
<ns:name>def</ns:name>
</ns:Action>
</ns:Coverage>
<ns:PPLID>123</ns:PPLID>
</ns:Input>

Now, I succeeded to get an unique regex, catching all your conditions. But, note that this regex supposes, inplicitly, that :

The part <ns:locationevent> ..... </ns:locationevent> appears first, with the chosen value, in the <ns:Input> ..... </ns:Input> block
Then, a part <ns:name> ..... </ns:name>, OUTSIDE a <ns:Coverage> .... </ns:Coverage> block, is present
Then, a <ns:Coverage> ..... </ns:Coverage> block, with the chosen value, is present
Then, a part <ns:name> ..... </ns:name>, INSIDE a a <ns:Coverage> .... </ns:Coverage> block, with the chosen value, is present
Finally a part <ns:PPLID> ..... </ns:PPLID> is present, before the main ending tag </ns:Input>

and ONLY in that order ( I insist on this fact ). So, for instance, if a <ns:Coverage> .... </ns:Coverage> block is placed right after the main starting tag <ns:Input>, the regex, below, will NOT match anything !!

Now that the example text is correct and the assumptions have been made, the construction of a regular expression is fairly easy ! I’m using the free-spacing mode, for readability

Refer to the link, below, for additional information on that mode :

https://www.regular-expressions.info/freespacing.html

So, here is my final regex, with a lot of comments !

(?x)                                            #  DEFAULT behavior : FREE-SPACING mode ( SPACE char IRRELEVANT and # begins COMMENT zone )
(?s)                                            #  DEFAULT behavior : the DOT stands for ANY SINGLE character ( STANDARD and EOL chars )
(?-i)                                           #  DEFAULT behavior : search SENSIBLE to CASE

<ns:Input>                                      #  START of regex, with this EXACT case
(                                               #  START of Group 1  ( RE-USED, further on, as a SUBROUTINE CALL = (?1) )
((?!</ns:Input>).)*?                            #  SHORTEST range of characters, even NULL, NOT CONTAINING the string '</ns:Input>'
)                                               #  End of Group 1

<ns:locationevent>(?i:yyyy)</ns:locationevent>  #  FIRST condition  ( part 'yyyy' NOT sensible to CASE )
(?1)                                            #  Regex standing for GROUP 1

<ns:Action>                                     #  with that EXACT case
(?1)                                            #  Regex standing for GROUP 1
<ns:name>(?i:def)</ns:name>                     #  SECOND condition ( part 'def' NOT sensible to CASE )
(?1)                                            #  Regex standing for GROUP 1

<ns:Coverage>                                   #  THIRD condition, with that EXACT case
 (?1)                                           #  Regex standing for GROUP 1
<ns:Action>                                     #  with that EXACT case
(?1)                                            #  Regex standing for GROUP 1

<ns:name>(?i:def)</ns:name>                     #  FOURTH condition ( part 'def' NOT sensible to CASE )
(?1)                                            #  Regex standing for GROUP 1

<ns:PPLID>(?i:124|123)</ns:PPLID>               #  FIFTH condition ( ALTERNATIVE '123|124' NOT sensible to CASE )
(?1)                                            #  Regex standing for GROUP 1
</ns:Input>                                     #  END of REGEX, with that EXACT case

So the road map is :

Start Notepad++ ( your N++ version must be 7.8 or higher : Press thee F1 key to verify )
Open the Mark dialog ( Search > Mark... menu option )
Copy/paste all the free-spacing regex, above, in the Find what: zone = (?x)................</ns:Input>
Tick the Bookmark line option
Tick the Purge for each search option
Tick the Wrap around option
Select the Regular expression search mode
Click, once, on the Mark All button

=> Normally, all lines of the main <ns:name> ... </ns:name> blocks, which satisfy all the conditions, should be bookmarked

Now :

Run the menu option Search > Bookmark > Copy Bookmarked lines
Open a new tab ( Ctrl + N )
Paste all the bookmarked lines ( Ctrl + V )

REMARK :

Note that the part ((?!</ns:Input>).)*? represents the shortest range, even null, or any character, not containing the string </ns:Input>, which must be re-used, further on in the regex, as (?1)
Indeed, we cannot use the simple syntax .*?, with the lazy quantifier *?, because, in case a condition is not realized, in a <ns:Input> .... </ns:Input> block, it must not overlap this main block and skips to the next <ns:Input> .... </ns:Input> block in order to get a possible match ;-))

Best Regards,

guy038

P.S. :

Surprisingly, when you select all this free-spacing regex, to paste it in the Find what: zone, you notice that it contains 2,103 characters, which seems beyond the maximum of chars ( 2,046 ) !!??

But I did verify that the intregrality of the free-spacing regex is taken in account, using a main block, without the ending > symbol

<ns:Input>
...
...
...
</ns:Input

As expected, no match occurs for this main block !

A Former User

@guy038 said in extract XMl with regex:

(?x) # DEFAULT behavior : FREE-SPACING mode ( SPACE char IRRELEVANT and # begins COMMENT zone )
(?s) # DEFAULT behavior : the DOT stands for ANY SINGLE character ( STANDARD and EOL chars )
(?-i) # DEFAULT behavior : search SENSIBLE to CASE

ns:Input # START of regex, with this EXACT case
( # START of Group 1 ( RE-USED, further on, as a SUBROUTINE CALL = (?1) )
((?!</ns:Input>).)*? # SHORTEST range of characters, even NULL, NOT CONTAINING the string ‘</ns:Input>’
) # End of Group 1

ns:locationevent(?i:yyyy)</ns:locationevent> # FIRST condition ( part ‘yyyy’ NOT sensible to CASE )
(?1) # Regex standing for GROUP 1

ns:Action # with that EXACT case
(?1) # Regex standing for GROUP 1
ns:name(?i:def)</ns:name> # SECOND condition ( part ‘def’ NOT sensible to CASE )
(?1) # Regex standing for GROUP 1

ns:Coverage # THIRD condition, with that EXACT case
(?1) # Regex standing for GROUP 1
ns:Action # with that EXACT case
(?1) # Regex standing for GROUP 1

ns:name(?i:def)</ns:name> # FOURTH condition ( part ‘def’ NOT sensible to CASE )
(?1) # Regex standing for GROUP 1

ns:PPLID(?i:124|123)</ns:PPLID> # FIFTH condition ( ALTERNATIVE ‘123|124’ NOT sensible to CASE )
(?1) # Regex standing for GROUP 1
</ns:Input>

Thanks a lot. It works like a Charm!!!

A Former User

@guy038

I tried the above regex in Notepad 7.3.3 and it didnt work.

I need the regex which works in 7.3.3 is there any other way to accomplish.

guy038

Hello, @vijay-s,

As I still have a local N++ 7.3.3 version, on my laptop, it was very easy to verify that the regex did work, assuming the hypotheses. For instance, I did verify that blocks, with values other than def or values other than 123|124 were not selected by the regex, as expected !

So, I suppose that you input text has, again, a different layout than before !?

Best Regards,

guy038

A Former User

``

It works in Notepad++ 7.3.3 if the expected XML is small, if it is big contains 1000 lines then it selects the whole file instaead of the expected XML. but the same thing works in 7.8…8

Alan Kilborn

if it is big … then it selects the whole file instead of the expected…

Sounds like a familiar bug.

A Former User


is there any update on this

Alan Kilborn

@vijay-S

What update are you expecting?

A Former User

@vijay-S said in extract XMl with regex:

``
It works in Notepad++ 7.3.3 if the expected XML is small, if it is big contains 1000 lines then it selects the whole file instaead of the expected XML. but the same thing works in 7.8…8

The regex works in 7.8.8 not in 7.3.3 in case if the selected xml is big

PeterJones

@vijay-S ,

Please stop marking most of your normal discussion as “plaintext” or “code”. That </> CODE button (or manually using the ``` lines before and after) is used to highlight text that you need to keep raw – like code, or example text for your data – it is not meant to format every paragraph of your discussion. It makes it really hard to read.

As proof, here’s my last paragraph in CODE mode; notice how hard it is to read?

Please stop marking most of your normal discussion as "plaintext" or "code".  That `</> CODE`  button is used to highlight text that you need to keep raw -- like code, or example text for your data -- it is not meant to format every paragraph of your discussion.  It makes it really hard to read.

Don’t get me wrong: It’s great for example text – so keep using it for when you are asking about certain text that you are trying to work with. But don’t use it for your normal conversation paragraphs.

Back to your clarification:

The regex works in 7.8.8

There is no such version as 7.8.8 (at least, not yet); v7.8.2 has been released, and there is a release-candidate for v7.8.3. I will assume you mean v7.8.2, since that was the newest when this conversation started.

The regex works in ~~7.8.8~~ 7.8.2 not in 7.3.3 in case if the selected xml is big

Regarding there being a bug in v7.3.3 that isn’t present in v7.8.2: What do you expect? Do you expect a bugfix version of v7.3.3? The version number is incremented as bugs are fixed or features are improved. If v7.3.3 has a bug that you need fixed, you need to move to a newer version that has the bug fixed; you have already admitted that the feature works in newer versions. So if you need a version with the bug fixed, use the version with the bug fixed. If you don’t need a version with the bug fixed, feel free to stick with the old v7.3.3; either way, don’t complain that the bug still exists in the old version when you know it’s fixed in a newer version.