Regex: Select and delete the content of tags from xml file with skiping other tags

Robin Cruise

Hello, I have this rss.xml file. I want to use regex to delete only those tags from <url> to </url> that contains a link like https://my-website.com/stamina-art(number)

<url>
  <loc>https://my-website.com/en/wild-one.html</loc>
  <lastmod>2018-11-30T17:19:37+00:00</lastmod>
 <changefreq>weekly</changefreq>
  <priority>0.6400</priority>
</url>
<url>
  <loc>https://my-website.com/stamina-art60/en/wild-two.html</loc>
  <lastmod>2018-11-30T17:19:37+00:00</lastmod>
 <changefreq>weekly</changefreq>
  <priority>0.6400</priority>
</url>
<url>
  <loc>https://my-website.com/stamina-art20/en/wild-three.html</loc>
  <lastmod>2018-11-30T17:19:37+00:00</lastmod>
 <changefreq>weekly</changefreq>
  <priority>0.6400</priority>
</url>
<url>
  <loc>https://my-website.com/en/wild-four.html</loc>
  <lastmod>2018-11-30T17:19:37+00:00</lastmod>
 <changefreq>weekly</changefreq>
  <priority>0.6400</priority>
</url>
<url>
  <loc>https://my-website.com/en/wild-five.html</loc>
  <lastmod>2018-11-30T17:19:37+00:00</lastmod>
 <changefreq>weekly</changefreq>
  <priority>0.6400</priority>
</url>

The Output should be:

<url>
  <loc>https://my-website.com/en/number-1.html</loc>
  <lastmod>2018-11-30T17:19:37+00:00</lastmod>
 <changefreq>weekly</changefreq>
  <priority>0.6400</priority>
</url>
<url>
  <loc>https://my-website.com/en/number-4.html</loc>
  <lastmod>2018-11-30T17:19:37+00:00</lastmod>
 <changefreq>weekly</changefreq>
  <priority>0.6400</priority>
</url>
<url>
  <loc>https://my-website.com/en/number-5.html</loc>
  <lastmod>2018-11-30T17:19:37+00:00</lastmod>
 <changefreq>weekly</changefreq>
  <priority>0.6400</priority>
</url>

I made a regex, but I don’t know why it selects/delete more than I want:

FIND: (?:<url>).*?(?=https://my-website.com/stamina-art\d+).*?(html.*?</url>)

Alan Kilborn

@Robin-Cruise said in Regex: Select and delete the content of tags from xml file with skiping other tags:

I made a regex, but I don’t know why it selects/delete more than I want:
FIND: (?:<url>).?(?=https://my-website.com/stamina-art\d+).?(html.*?</url>)

Actually, that regex doesn’t match any of your text.
Can you show us a real one that does match, and more than you intend?

Robin Cruise

@Alan-Kilborn My regex match also the first <url></url> (that I don’t want), but I forgot to mention the .matches newsline

guy038

Hi, @robin-cruise, @alan-kilborn and All,

@Robin-cruise I going to show you the different steps to get the right regex S/R. Note that I’ll use the free-spacing mode (?x) for a better readability !

First, let’s start with the simple regex, below, which searches for an entire area <url>•••••</url> :
(?xs-i) <url> .+? </url> \R
- Note that we use the (?s) modifier as we need to match a multi-lines area. We also use the (?-i) modifier to match </url> and, for instance, not </UrL> nor <URL>

Now, as we need to match an entire section <url>•••••</url> ONLY IF it contains the string https://my-website.com/stamina-art followed with a number, the first idea is to use the usual regex :
(?xs-i) <url> .+? https://my-website.com/stamina-art\d+/ .+? </url> \R
- Almost identical to your own try, it does not match as expected as, shown in your picture, because it matches the first two sections <url>•••••</url> !. Indeed, logically, the first part of the regex ( (?s-i)<url>.+?https://my-website.com/stamina-art\d+/ ) matches from the first string <url> found to the nearest string https://my-website.com/stamina-art\d+/.

As we see that this range of chars crosses the </url> ending tag, a possible approach would be to verify that, at any position, there is no </url> string met, with the regex :
(?xs-i) <url> ( (?!</url>) .)+? https://my-website.com/stamina-art\d+/ .+? </url>\R
- Bingo ! This time, it does select the <url>•••••</url> sections 2 and 3, only, containing the https://my-website.com/stamina-art string

However, if we notice that all the URL addresses come next, after the <url> opening tag and the <loc> tag, a more simple regex is possible :
(?xs-i) <url> \s+ <loc> https://my-website.com/stamina-art\d+/ .+? </url> \R
- Indeed, between the <url> and https://my-website.com/stamina-art parts, we just have the regex \s+<loc> which is enough restritive to prevent from a wrong long match !
- And the regex part .+?, with the lazy quantifier +?? will match the smallest range of chars, after the address till the first </url> ending tag found

So my final solution is :

SEARCH (?xs-i) <url> \s+ <loc> https://my-website.com/stamina-art\d+/ .+? </url> \R

REPLACE Leave EMPTY

Hope this helps !

Best Regards,

guy038

Robin Cruise

thank you @guy038

Robin Cruise

@guy038 said in Regex: Select and delete the content of tags from xml file with skiping other tags:

(?xs-i)

@guy038 What exactly does (?xs-i) in the example of regex? It is not the same as (?s) ?

Alan Kilborn

@Robin-Cruise

I forgot to mention the .matches newline

Ah, okay. That’s an important part of it.

What exactly does (?xs-i) in the example of regex? It is not the same as (?s)

It isn’t the same, but (?xs-i) will do the (?s) functionality as well as other things.

It works like this:

(?turnOnOptions-turnOffOptions)

The options of value are:

x : free-spacing mode (embedded spaces are ignored)
s : single-line mode (bad name), turning on is equivalent to . matches newline ticked, turning off is same as unticked
i : ignore case

If you turn “on” ignore case, then case will be insignificant (both a and A will match if you use a in your espression.
If you turn “off” ignore case, then you are saying case matters (a is different from A)

guy038

@robin-cruise,

In my previous post, I said, at the beginning :

Note that I’ll use the free-spacing mode (?x) for a better readability

And, two lines after I said :

Note that we use the (?s) modifier as we need to match a multi-lines area. We also use the (?-i) modifier to match </url> and, for instance, not </UrL> nor <URL>

Now, as always, everything is in the fucking manual ! So, first, go to the official N++ documentation :

https://npp-user-manual.org/docs/searching/#search-modifiers

And you may have a look to these two HTML pages, below, on the regular-expressions.info site :

https://www.regular-expressions.info/modifiers.html

https://www.regular-expressions.info/freespacing.html

Note that the N++ Boost regex engine just handles the four modifiers i, s, x and m and their negative counterparts -i, -s, -x and -m !

Enjoy your reading !

Afterwards, you should be able to understand the (?xs-i) group of in-line modifiers, which is definitively different from the shorter (?s) syntax !

BR

guy038

Alan Kilborn

@guy038 said in Regex: Select and delete the content of tags from xml file with skiping other tags:

Now, as always, everything is in the fucking manual ! So, first, go to the official N++ documentation

Whoa, whoa, WHOA!
Could it be that the ever-patient @guy038 is getting tired of all the repetitive regex stuff here?? Say it isn’t so.
I get tired of it as well…so tired…but I haven’t yet been driven to use the alternate definition of RTFM (the normal definition being Read That Fine Manual).

Seriously, though, when I look at that section of the manual, it reads more like formula rather than specific application (and that’s ok).

In the manual we have:

(?enable-disable)

which, well, isn’t exactly crystal clear for those that don’t already know what it means.

So my attempt, in explaining the (?.. : …) construct was to help clarify.
I don’t know if I succeeded.