Community
    • Login

    Regex: How to find a duplicate tag on consecutive lines?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 3 Posters 1.7k Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR Offline
      Robin Cruise
      last edited by

      hi, in the code below, you will see that two lines are repeated one after the other, which is a problem:

        </url>
       </url>
      

      The code:

      <url>
        <loc>https://my-website.com/en/example-love.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
        </url>
        <url>
        <loc>https://my-website.com/en/my-cat-is-here.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
      

      I want to find out the rss.xml file that contains a duplicate </url> on consecutive lines.

      I don’t know why my regex is not working:

      FIND: (?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:(?!</url>))
      Replace by: (Leave empty)

      1 Reply Last reply Reply Quote 0
      • guy038G Online
        guy038
        last edited by

        Hi, @robin-cruise,

        The correct regex is :

        (?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:</url>)

        Indeed, as your search regex ends with \K(?-i:(?!</url>)), this means that you’re looking for an empty string, not followed by </url> with this exact case. Not what you were expected too !


        Now, why do you bother to use this complicated regex syntax ?

        Simply use this shorter regex S/R :

        SEARCH (</url>)\s+\K\1

        REPLACE Leave EMPTY

        Best Regards,

        guy038

        Alan KilbornA Robin CruiseR 2 Replies Last reply Reply Quote 0
        • Alan KilbornA Offline
          Alan Kilborn @guy038
          last edited by

          @guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

          Now, why do you bother to use this complicated regex syntax ?

          I suspect the reason for that is the OP just pasted some regex because our policy to get (repeat) help here is to show-us-what-you-tried-first.

          And currently, OP is our champion of write-my-regex-for-me.

          So it has been paste-anything-because-that-ticks-the-box-and-now-I-will-get-the-help-I-need.

          1 Reply Last reply Reply Quote 1
          • guy038G Online
            guy038
            last edited by guy038

            Hello, @alan-kilborn and All,

            Alan, I’m not fooled and I know for a fact that some people are just showing us a regular expression, loosely deduced from a previous regex, already posted on the forum, serving as an alibi for a supposed search !

            But, well: I had the opportunity, at the same time, in a short reply, to correct the OP’s version and to indicate the solution to the Robin’s question, correctly expressed !

            BR

            guy038

            1 Reply Last reply Reply Quote 1
            • Robin CruiseR Offline
              Robin Cruise @guy038
              last edited by

              @guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

              (</url>)\s+\K\1

              hello @guy038 If I make the search and replace, I will have an empty line between tags. Such as:

              </url>
                
                <url>
              

              How can I modify those 2 regex you made, as not to have an empty line after replacement?

              1 Reply Last reply Reply Quote 0
              • guy038G Online
                guy038
                last edited by

                Hi, @robin-cruise,

                Yes, I made two little mistakes but you could have found the problem by yourself !

                • First I forgot the (?-i) syntax, at the very beginning of the regex to ensures that all the process will be case sensitive !

                • Secondly, my previous regex (</url>)\s+\K\1 first searches for the <url> string, according to the match case option, then some regex space class characters ( so any non-null range of space, tab, and/or End of Line chars ), then the </url> string , according to the match case option, too. But I forgot to search for the line-break chars after </url>, which can be achieved with the \R syntax

                Thus, the exact regex to use is (?-i)(</url>)\s+\K\1\R and, of course, click on the Replace All button, only ( not on the Replace one, due to the presence of the \K construction )

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 1
                • Robin CruiseR Offline
                  Robin Cruise
                  last edited by Robin Cruise

                  @guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

                  (?-i)(</url>)\s+\K\1\R

                  thanks. But, can you tell me, when can I use that \1 in the FIND option? Because I usually use it at the replace section, not at the FIND.

                  ok, I know that \1 refers to the first bracket (</url>). But if I make a regex search, something like (<loc>).*\1 it doesn’t work, seems that notepad++ cannot find the text.

                  I try, also, (<loc>).*|\1\R

                  so, when exactly can I use \1 or \2 in the FIND option?

                  1 Reply Last reply Reply Quote 0
                  • guy038G Online
                    guy038
                    last edited by guy038

                    Hello, @Robin-Cruise and All,

                    Ah, interesting point !

                    Let’s work against this simple line, pasted in a new tab :

                    ----abc12345abc----def67890def----abc12345def----def67890abc----
                    
                    • The regex (?-i)(abc)\d+\1|(def)\d+\2 correctly matches the strings abc12345abc and def67890def :

                      • The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1

                      • The group 2 is the string def, re-used in the search regex as the back-reference \2, in its second alternative (def)\d+\2

                      • Each back-reference to a group, \1 and \2, is totally defined when using each alternative

                    • The almost identical regex (?-i)(abc)\d+\1|(def)\d+\1 just matches the string abc12345abc ! Why ?

                      • The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1

                      • The group 2 is the string def, not re-used in this search regex

                      • When trying the second alternative (def)\d+\1, the regex engine does not know the back-reference \1 which refers to group 1 ( the string abc ), defined in the first alternative. Thus, this second alternative will never match !

                    • However, note that :

                      • The regex (?i)(def)\d+\1, only, would normally match the string def67890def

                      • The regex (?i)(def)\d+\2, only, outputs the message Invalid regular expression, as no group 2 is defined in this regex

                    • Now, let’s use the \K syntax in the regex (?-i)(abc)\d+\1|(def)\d+\2.+\Kabc\d+\2

                      • The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1

                      • The group 2 is the string def, re-used in the search regex as the back-reference \2, in its second alternative (def)\d+\2.+\Kabc\d+\2

                      • The first alternative, (abc)\d+\1, as above, matches the string abc12345abc

                      • The second alternative, (def)\d+\2.+\Kabc\d+\2 first matches the def67890def string and defines the group 2 ( def ) then, further on, after reset of search by the \K feature, matches the string abc12345def because the groups defined, before \K, are still defined after \K !

                    • Finally, let’s consider the regex (?-i)(abc)\d+\1|(def)\d+\2|abc\d+\2|def\d+\1, composed of 4 alternatives

                      • Only the first two alternatives match, successively, the two strings abc12345abc and def67890def

                      • In the last two alternatives, the back-references \1 and \2 refer to a not defined group, when executing the third / fourth alternative !

                    • Now, if, instead of the back-references \1 and \2, we use the subroutines calls (?1) and (?2), which represent the exact regexes in groups 1 and 2 ( so the strings abc and def ), we get the regex (?-i)(abc)\d+\1|(def)\d+\2|abc\d+(?2)|def\d+(?1) which do match the four strings : abc12345abc, def67890def, abc12345def and def67890abc

                      • As you can see, subroutine calls, (?#) are defined by the regex engine before executing any alternative and, thus, are defined in any part of the overall regex, where they occur. You may even create a regex containing a subroutine call to a group, which comes next ! For instance, the regex (?1):(?1):(\d\d) would match the string 03:05:45 or 11:52:17 And just compare with the regex \1:\1:(\d\d) which is totally invalid !

                    Back to your questions, @robin-cruise, you said :

                    But if I make a regex search, something like (<loc>).*\1 it doesn’t work, seems that notepad++ cannot find the text.

                    Sorry, but it does work !

                    Try the regex (?s)(<loc>).*\1 against this text :

                    <loc>
                    bla
                    bla
                    blah
                    <loc>
                    

                    As about the (<loc>).*|\1\R regex, obviously, only the first alternative can match. Indeed, the second alternative \1\R contains a back-reference \1 to group 1, which is not defined in THIS alternative !

                    Finally, your (<loc>).*|\1\R regex is simply equivalent to one the two forms :

                    • (?s)(<loc>).*

                    • (?-s)(<loc>).*

                    Best Regards,

                    guy038

                    Robin CruiseR 1 Reply Last reply Reply Quote 2
                    • Robin CruiseR Offline
                      Robin Cruise @guy038
                      last edited by

                      @guy038 great answer, thanks !

                      1 Reply Last reply Reply Quote 0

                      Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                      Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                      With your input, this post could be even better 💗

                      Register Login
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors