Community
    • Login

    Regex: Select and delete the content of tags from xml file with skiping other tags

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 3 Posters 2.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR
      Robin Cruise
      last edited by

      Hello, I have this rss.xml file. I want to use regex to delete only those tags from <url> to </url> that contains a link like https://my-website.com/stamina-art(number)

      <url>
        <loc>https://my-website.com/en/wild-one.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
      <url>
        <loc>https://my-website.com/stamina-art60/en/wild-two.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
      <url>
        <loc>https://my-website.com/stamina-art20/en/wild-three.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
      <url>
        <loc>https://my-website.com/en/wild-four.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
      <url>
        <loc>https://my-website.com/en/wild-five.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
      

      The Output should be:

      <url>
        <loc>https://my-website.com/en/number-1.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
      <url>
        <loc>https://my-website.com/en/number-4.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
      <url>
        <loc>https://my-website.com/en/number-5.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
      

      I made a regex, but I don’t know why it selects/delete more than I want:

      FIND: (?:<url>).*?(?=https://my-website.com/stamina-art\d+).*?(html.*?</url>)

      Alan KilbornA 1 Reply Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @Robin Cruise
        last edited by

        @Robin-Cruise said in Regex: Select and delete the content of tags from xml file with skiping other tags:

        I made a regex, but I don’t know why it selects/delete more than I want:
        FIND: (?:<url>).?(?=https://my-website.com/stamina-art\d+).?(html.*?</url>)

        Actually, that regex doesn’t match any of your text.
        Can you show us a real one that does match, and more than you intend?

        1 Reply Last reply Reply Quote 0
        • Robin CruiseR
          Robin Cruise
          last edited by Robin Cruise

          @Alan-Kilborn My regex match also the first <url></url> (that I don’t want), but I forgot to mention the .matches newsline

          2f3ce5fe-4338-433e-bd0d-4c639cc61baa-image.png

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi, @robin-cruise, @alan-kilborn and All,

            @Robin-cruise I going to show you the different steps to get the right regex S/R. Note that I’ll use the free-spacing mode (?x) for a better readability !

            • First, let’s start with the simple regex, below, which searches for an entire area <url>•••••</url> :
              (?xs-i) <url> .+? </url> \R

              • Note that we use the (?s) modifier as we need to match a multi-lines area. We also use the (?-i) modifier to match </url> and, for instance, not </UrL> nor <URL>

            • Now, as we need to match an entire section <url>•••••</url> ONLY IF it contains the string https://my-website.com/stamina-art followed with a number, the first idea is to use the usual regex :
              (?xs-i) <url> .+? https://my-website.com/stamina-art\d+/ .+? </url> \R

              • Almost identical to your own try, it does not match as expected as, shown in your picture, because it matches the first two sections <url>•••••</url> !. Indeed, logically, the first part of the regex ( (?s-i)<url>.+?https://my-website.com/stamina-art\d+/ ) matches from the first string <url> found to the nearest string https://my-website.com/stamina-art\d+/.

            • As we see that this range of chars crosses the </url> ending tag, a possible approach would be to verify that, at any position, there is no </url> string met, with the regex :
              (?xs-i) <url> ( (?!</url>) .)+? https://my-website.com/stamina-art\d+/ .+? </url>\R

              • Bingo ! This time, it does select the <url>•••••</url> sections 2 and 3, only, containing the https://my-website.com/stamina-art string

            • However, if we notice that all the URL addresses come next, after the <url> opening tag and the <loc> tag, a more simple regex is possible :
              (?xs-i) <url> \s+ <loc> https://my-website.com/stamina-art\d+/ .+? </url> \R

              • Indeed, between the <url> and https://my-website.com/stamina-art parts, we just have the regex \s+<loc> which is enough restritive to prevent from a wrong long match !

              • And the regex part .+?, with the lazy quantifier +?? will match the smallest range of chars, after the address till the first </url> ending tag found


            So my final solution is :

            SEARCH (?xs-i) <url> \s+ <loc> https://my-website.com/stamina-art\d+/ .+? </url> \R

            REPLACE Leave EMPTY

            Hope this helps !

            Best Regards,

            guy038

            Robin CruiseR 1 Reply Last reply Reply Quote 0
            • Robin CruiseR
              Robin Cruise @guy038
              last edited by

              thank you @guy038

              1 Reply Last reply Reply Quote 0
              • Robin CruiseR
                Robin Cruise
                last edited by

                @guy038 said in Regex: Select and delete the content of tags from xml file with skiping other tags:

                (?xs-i)

                @guy038 What exactly does (?xs-i) in the example of regex? It is not the same as (?s) ?

                Alan KilbornA 1 Reply Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @Robin Cruise
                  last edited by

                  @Robin-Cruise

                  I forgot to mention the .matches newline

                  Ah, okay. That’s an important part of it.


                  What exactly does (?xs-i) in the example of regex? It is not the same as (?s)

                  It isn’t the same, but (?xs-i) will do the (?s) functionality as well as other things.

                  It works like this:

                  (?turnOnOptions-turnOffOptions)

                  The options of value are:

                  x : free-spacing mode (embedded spaces are ignored)
                  s : single-line mode (bad name), turning on is equivalent to . matches newline ticked, turning off is same as unticked
                  i : ignore case

                  If you turn “on” ignore case, then case will be insignificant (both a and A will match if you use a in your espression.
                  If you turn “off” ignore case, then you are saying case matters (a is different from A)

                  1 Reply Last reply Reply Quote 2
                  • guy038G
                    guy038
                    last edited by guy038

                    @robin-cruise,

                    In my previous post, I said, at the beginning :

                    Note that I’ll use the free-spacing mode (?x) for a better readability

                    And, two lines after I said :

                    • Note that we use the (?s) modifier as we need to match a multi-lines area. We also use the (?-i) modifier to match </url> and, for instance, not </UrL> nor <URL>

                    Now, as always, everything is in the fucking manual ! So, first, go to the official N++ documentation :

                    https://npp-user-manual.org/docs/searching/#search-modifiers

                    And you may have a look to these two HTML pages, below, on the regular-expressions.info site :

                    https://www.regular-expressions.info/modifiers.html

                    https://www.regular-expressions.info/freespacing.html

                    Note that the N++ Boost regex engine just handles the four modifiers i, s, x and m and their negative counterparts -i, -s, -x and -m !


                    Enjoy your reading !

                    Afterwards, you should be able to understand the (?xs-i) group of in-line modifiers, which is definitively different from the shorter (?s) syntax !

                    BR

                    guy038

                    Alan KilbornA 1 Reply Last reply Reply Quote 2
                    • Alan KilbornA
                      Alan Kilborn @guy038
                      last edited by Alan Kilborn

                      @guy038 said in Regex: Select and delete the content of tags from xml file with skiping other tags:

                      Now, as always, everything is in the fucking manual ! So, first, go to the official N++ documentation

                      Whoa, whoa, WHOA!
                      Could it be that the ever-patient @guy038 is getting tired of all the repetitive regex stuff here?? Say it isn’t so.
                      I get tired of it as well…so tired…but I haven’t yet been driven to use the alternate definition of RTFM (the normal definition being Read That Fine Manual).

                      Seriously, though, when I look at that section of the manual, it reads more like formula rather than specific application (and that’s ok).

                      In the manual we have:

                      (?enable-disable)

                      which, well, isn’t exactly crystal clear for those that don’t already know what it means.

                      So my attempt, in explaining the (?.. : …) construct was to help clarify.
                      I don’t know if I succeeded.

                      1 Reply Last reply Reply Quote 2
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors