• Login
Community
  • Login

Regex: Select and delete the content of tags from xml file with skiping other tags

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
9 Posts 3 Posters 2.1k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R
    Robin Cruise
    last edited by Jun 22, 2021, 2:19 PM

    Hello, I have this rss.xml file. I want to use regex to delete only those tags from <url> to </url> that contains a link like https://my-website.com/stamina-art(number)

    <url>
      <loc>https://my-website.com/en/wild-one.html</loc>
      <lastmod>2018-11-30T17:19:37+00:00</lastmod>
     <changefreq>weekly</changefreq>
      <priority>0.6400</priority>
    </url>
    <url>
      <loc>https://my-website.com/stamina-art60/en/wild-two.html</loc>
      <lastmod>2018-11-30T17:19:37+00:00</lastmod>
     <changefreq>weekly</changefreq>
      <priority>0.6400</priority>
    </url>
    <url>
      <loc>https://my-website.com/stamina-art20/en/wild-three.html</loc>
      <lastmod>2018-11-30T17:19:37+00:00</lastmod>
     <changefreq>weekly</changefreq>
      <priority>0.6400</priority>
    </url>
    <url>
      <loc>https://my-website.com/en/wild-four.html</loc>
      <lastmod>2018-11-30T17:19:37+00:00</lastmod>
     <changefreq>weekly</changefreq>
      <priority>0.6400</priority>
    </url>
    <url>
      <loc>https://my-website.com/en/wild-five.html</loc>
      <lastmod>2018-11-30T17:19:37+00:00</lastmod>
     <changefreq>weekly</changefreq>
      <priority>0.6400</priority>
    </url>
    

    The Output should be:

    <url>
      <loc>https://my-website.com/en/number-1.html</loc>
      <lastmod>2018-11-30T17:19:37+00:00</lastmod>
     <changefreq>weekly</changefreq>
      <priority>0.6400</priority>
    </url>
    <url>
      <loc>https://my-website.com/en/number-4.html</loc>
      <lastmod>2018-11-30T17:19:37+00:00</lastmod>
     <changefreq>weekly</changefreq>
      <priority>0.6400</priority>
    </url>
    <url>
      <loc>https://my-website.com/en/number-5.html</loc>
      <lastmod>2018-11-30T17:19:37+00:00</lastmod>
     <changefreq>weekly</changefreq>
      <priority>0.6400</priority>
    </url>
    

    I made a regex, but I don’t know why it selects/delete more than I want:

    FIND: (?:<url>).*?(?=https://my-website.com/stamina-art\d+).*?(html.*?</url>)

    A 1 Reply Last reply Jun 22, 2021, 2:48 PM Reply Quote 0
    • A
      Alan Kilborn @Robin Cruise
      last edited by Jun 22, 2021, 2:48 PM

      @Robin-Cruise said in Regex: Select and delete the content of tags from xml file with skiping other tags:

      I made a regex, but I don’t know why it selects/delete more than I want:
      FIND: (?:<url>).?(?=https://my-website.com/stamina-art\d+).?(html.*?</url>)

      Actually, that regex doesn’t match any of your text.
      Can you show us a real one that does match, and more than you intend?

      1 Reply Last reply Reply Quote 0
      • R
        Robin Cruise
        last edited by Robin Cruise Jun 22, 2021, 8:42 PM Jun 22, 2021, 8:40 PM

        @Alan-Kilborn My regex match also the first <url></url> (that I don’t want), but I forgot to mention the .matches newsline

        2f3ce5fe-4338-433e-bd0d-4c639cc61baa-image.png

        1 Reply Last reply Reply Quote 0
        • G
          guy038
          last edited by guy038 Jun 23, 2021, 10:55 AM Jun 22, 2021, 10:35 PM

          Hi, @robin-cruise, @alan-kilborn and All,

          @Robin-cruise I going to show you the different steps to get the right regex S/R. Note that I’ll use the free-spacing mode (?x) for a better readability !

          • First, let’s start with the simple regex, below, which searches for an entire area <url>•••••</url> :
            (?xs-i) <url> .+? </url> \R

            • Note that we use the (?s) modifier as we need to match a multi-lines area. We also use the (?-i) modifier to match </url> and, for instance, not </UrL> nor <URL>

          • Now, as we need to match an entire section <url>•••••</url> ONLY IF it contains the string https://my-website.com/stamina-art followed with a number, the first idea is to use the usual regex :
            (?xs-i) <url> .+? https://my-website.com/stamina-art\d+/ .+? </url> \R

            • Almost identical to your own try, it does not match as expected as, shown in your picture, because it matches the first two sections <url>•••••</url> !. Indeed, logically, the first part of the regex ( (?s-i)<url>.+?https://my-website.com/stamina-art\d+/ ) matches from the first string <url> found to the nearest string https://my-website.com/stamina-art\d+/.

          • As we see that this range of chars crosses the </url> ending tag, a possible approach would be to verify that, at any position, there is no </url> string met, with the regex :
            (?xs-i) <url> ( (?!</url>) .)+? https://my-website.com/stamina-art\d+/ .+? </url>\R

            • Bingo ! This time, it does select the <url>•••••</url> sections 2 and 3, only, containing the https://my-website.com/stamina-art string

          • However, if we notice that all the URL addresses come next, after the <url> opening tag and the <loc> tag, a more simple regex is possible :
            (?xs-i) <url> \s+ <loc> https://my-website.com/stamina-art\d+/ .+? </url> \R

            • Indeed, between the <url> and https://my-website.com/stamina-art parts, we just have the regex \s+<loc> which is enough restritive to prevent from a wrong long match !

            • And the regex part .+?, with the lazy quantifier +?? will match the smallest range of chars, after the address till the first </url> ending tag found


          So my final solution is :

          SEARCH (?xs-i) <url> \s+ <loc> https://my-website.com/stamina-art\d+/ .+? </url> \R

          REPLACE Leave EMPTY

          Hope this helps !

          Best Regards,

          guy038

          R 1 Reply Last reply Jun 23, 2021, 6:28 AM Reply Quote 0
          • R
            Robin Cruise @guy038
            last edited by Jun 23, 2021, 6:28 AM

            thank you @guy038

            1 Reply Last reply Reply Quote 0
            • R
              Robin Cruise
              last edited by Jun 23, 2021, 6:36 AM

              @guy038 said in Regex: Select and delete the content of tags from xml file with skiping other tags:

              (?xs-i)

              @guy038 What exactly does (?xs-i) in the example of regex? It is not the same as (?s) ?

              A 1 Reply Last reply Jun 23, 2021, 11:53 AM Reply Quote 0
              • A
                Alan Kilborn @Robin Cruise
                last edited by Jun 23, 2021, 11:53 AM

                @Robin-Cruise

                I forgot to mention the .matches newline

                Ah, okay. That’s an important part of it.


                What exactly does (?xs-i) in the example of regex? It is not the same as (?s)

                It isn’t the same, but (?xs-i) will do the (?s) functionality as well as other things.

                It works like this:

                (?turnOnOptions-turnOffOptions)

                The options of value are:

                x : free-spacing mode (embedded spaces are ignored)
                s : single-line mode (bad name), turning on is equivalent to . matches newline ticked, turning off is same as unticked
                i : ignore case

                If you turn “on” ignore case, then case will be insignificant (both a and A will match if you use a in your espression.
                If you turn “off” ignore case, then you are saying case matters (a is different from A)

                1 Reply Last reply Reply Quote 2
                • G
                  guy038
                  last edited by guy038 Jun 23, 2021, 12:10 PM Jun 23, 2021, 12:06 PM

                  @robin-cruise,

                  In my previous post, I said, at the beginning :

                  Note that I’ll use the free-spacing mode (?x) for a better readability

                  And, two lines after I said :

                  • Note that we use the (?s) modifier as we need to match a multi-lines area. We also use the (?-i) modifier to match </url> and, for instance, not </UrL> nor <URL>

                  Now, as always, everything is in the fucking manual ! So, first, go to the official N++ documentation :

                  https://npp-user-manual.org/docs/searching/#search-modifiers

                  And you may have a look to these two HTML pages, below, on the regular-expressions.info site :

                  https://www.regular-expressions.info/modifiers.html

                  https://www.regular-expressions.info/freespacing.html

                  Note that the N++ Boost regex engine just handles the four modifiers i, s, x and m and their negative counterparts -i, -s, -x and -m !


                  Enjoy your reading !

                  Afterwards, you should be able to understand the (?xs-i) group of in-line modifiers, which is definitively different from the shorter (?s) syntax !

                  BR

                  guy038

                  A 1 Reply Last reply Jun 23, 2021, 12:54 PM Reply Quote 2
                  • A
                    Alan Kilborn @guy038
                    last edited by Alan Kilborn Jun 23, 2021, 12:55 PM Jun 23, 2021, 12:54 PM

                    @guy038 said in Regex: Select and delete the content of tags from xml file with skiping other tags:

                    Now, as always, everything is in the fucking manual ! So, first, go to the official N++ documentation

                    Whoa, whoa, WHOA!
                    Could it be that the ever-patient @guy038 is getting tired of all the repetitive regex stuff here?? Say it isn’t so.
                    I get tired of it as well…so tired…but I haven’t yet been driven to use the alternate definition of RTFM (the normal definition being Read That Fine Manual).

                    Seriously, though, when I look at that section of the manual, it reads more like formula rather than specific application (and that’s ok).

                    In the manual we have:

                    (?enable-disable)

                    which, well, isn’t exactly crystal clear for those that don’t already know what it means.

                    So my attempt, in explaining the (?.. : …) construct was to help clarify.
                    I don’t know if I succeeded.

                    1 Reply Last reply Reply Quote 2
                    9 out of 9
                    • First post
                      9/9
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors