Community
    • Login

    Regex: How to find a duplicate tag on consecutive lines?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 3 Posters 985 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR
      Robin Cruise
      last edited by

      hi, in the code below, you will see that two lines are repeated one after the other, which is a problem:

        </url>
       </url>
      

      The code:

      <url>
        <loc>https://my-website.com/en/example-love.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
        </url>
        <url>
        <loc>https://my-website.com/en/my-cat-is-here.html</loc>
        <lastmod>2018-11-30T17:19:37+00:00</lastmod>
       <changefreq>weekly</changefreq>
        <priority>0.6400</priority>
      </url>
      

      I want to find out the rss.xml file that contains a duplicate </url> on consecutive lines.

      I don’t know why my regex is not working:

      FIND: (?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:(?!</url>))
      Replace by: (Leave empty)

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by

        Hi, @robin-cruise,

        The correct regex is :

        (?-i:</url>|(?!\A)\G)(?s-i:(?!<url>).)*?\K(?-i:</url>)

        Indeed, as your search regex ends with \K(?-i:(?!</url>)), this means that you’re looking for an empty string, not followed by </url> with this exact case. Not what you were expected too !


        Now, why do you bother to use this complicated regex syntax ?

        Simply use this shorter regex S/R :

        SEARCH (</url>)\s+\K\1

        REPLACE Leave EMPTY

        Best Regards,

        guy038

        Alan KilbornA Robin CruiseR 2 Replies Last reply Reply Quote 0
        • Alan KilbornA
          Alan Kilborn @guy038
          last edited by

          @guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

          Now, why do you bother to use this complicated regex syntax ?

          I suspect the reason for that is the OP just pasted some regex because our policy to get (repeat) help here is to show-us-what-you-tried-first.

          And currently, OP is our champion of write-my-regex-for-me.

          So it has been paste-anything-because-that-ticks-the-box-and-now-I-will-get-the-help-I-need.

          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hello, @alan-kilborn and All,

            Alan, I’m not fooled and I know for a fact that some people are just showing us a regular expression, loosely deduced from a previous regex, already posted on the forum, serving as an alibi for a supposed search !

            But, well: I had the opportunity, at the same time, in a short reply, to correct the OP’s version and to indicate the solution to the Robin’s question, correctly expressed !

            BR

            guy038

            1 Reply Last reply Reply Quote 1
            • Robin CruiseR
              Robin Cruise @guy038
              last edited by

              @guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

              (</url>)\s+\K\1

              hello @guy038 If I make the search and replace, I will have an empty line between tags. Such as:

              </url>
                
                <url>
              

              How can I modify those 2 regex you made, as not to have an empty line after replacement?

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by

                Hi, @robin-cruise,

                Yes, I made two little mistakes but you could have found the problem by yourself !

                • First I forgot the (?-i) syntax, at the very beginning of the regex to ensures that all the process will be case sensitive !

                • Secondly, my previous regex (</url>)\s+\K\1 first searches for the <url> string, according to the match case option, then some regex space class characters ( so any non-null range of space, tab, and/or End of Line chars ), then the </url> string , according to the match case option, too. But I forgot to search for the line-break chars after </url>, which can be achieved with the \R syntax

                Thus, the exact regex to use is (?-i)(</url>)\s+\K\1\R and, of course, click on the Replace All button, only ( not on the Replace one, due to the presence of the \K construction )

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 1
                • Robin CruiseR
                  Robin Cruise
                  last edited by Robin Cruise

                  @guy038 said in Regex: How to find a duplicate tag on consecutive lines?:

                  (?-i)(</url>)\s+\K\1\R

                  thanks. But, can you tell me, when can I use that \1 in the FIND option? Because I usually use it at the replace section, not at the FIND.

                  ok, I know that \1 refers to the first bracket (</url>). But if I make a regex search, something like (<loc>).*\1 it doesn’t work, seems that notepad++ cannot find the text.

                  I try, also, (<loc>).*|\1\R

                  so, when exactly can I use \1 or \2 in the FIND option?

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hello, @Robin-Cruise and All,

                    Ah, interesting point !

                    Let’s work against this simple line, pasted in a new tab :

                    ----abc12345abc----def67890def----abc12345def----def67890abc----
                    
                    • The regex (?-i)(abc)\d+\1|(def)\d+\2 correctly matches the strings abc12345abc and def67890def :

                      • The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1

                      • The group 2 is the string def, re-used in the search regex as the back-reference \2, in its second alternative (def)\d+\2

                      • Each back-reference to a group, \1 and \2, is totally defined when using each alternative

                    • The almost identical regex (?-i)(abc)\d+\1|(def)\d+\1 just matches the string abc12345abc ! Why ?

                      • The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1

                      • The group 2 is the string def, not re-used in this search regex

                      • When trying the second alternative (def)\d+\1, the regex engine does not know the back-reference \1 which refers to group 1 ( the string abc ), defined in the first alternative. Thus, this second alternative will never match !

                    • However, note that :

                      • The regex (?i)(def)\d+\1, only, would normally match the string def67890def

                      • The regex (?i)(def)\d+\2, only, outputs the message Invalid regular expression, as no group 2 is defined in this regex

                    • Now, let’s use the \K syntax in the regex (?-i)(abc)\d+\1|(def)\d+\2.+\Kabc\d+\2

                      • The group 1 is the string abc, re-used in the search regex as the back-reference \1, in its first alternative (abc)\d+\1

                      • The group 2 is the string def, re-used in the search regex as the back-reference \2, in its second alternative (def)\d+\2.+\Kabc\d+\2

                      • The first alternative, (abc)\d+\1, as above, matches the string abc12345abc

                      • The second alternative, (def)\d+\2.+\Kabc\d+\2 first matches the def67890def string and defines the group 2 ( def ) then, further on, after reset of search by the \K feature, matches the string abc12345def because the groups defined, before \K, are still defined after \K !

                    • Finally, let’s consider the regex (?-i)(abc)\d+\1|(def)\d+\2|abc\d+\2|def\d+\1, composed of 4 alternatives

                      • Only the first two alternatives match, successively, the two strings abc12345abc and def67890def

                      • In the last two alternatives, the back-references \1 and \2 refer to a not defined group, when executing the third / fourth alternative !

                    • Now, if, instead of the back-references \1 and \2, we use the subroutines calls (?1) and (?2), which represent the exact regexes in groups 1 and 2 ( so the strings abc and def ), we get the regex (?-i)(abc)\d+\1|(def)\d+\2|abc\d+(?2)|def\d+(?1) which do match the four strings : abc12345abc, def67890def, abc12345def and def67890abc

                      • As you can see, subroutine calls, (?#) are defined by the regex engine before executing any alternative and, thus, are defined in any part of the overall regex, where they occur. You may even create a regex containing a subroutine call to a group, which comes next ! For instance, the regex (?1):(?1):(\d\d) would match the string 03:05:45 or 11:52:17 And just compare with the regex \1:\1:(\d\d) which is totally invalid !

                    Back to your questions, @robin-cruise, you said :

                    But if I make a regex search, something like (<loc>).*\1 it doesn’t work, seems that notepad++ cannot find the text.

                    Sorry, but it does work !

                    Try the regex (?s)(<loc>).*\1 against this text :

                    <loc>
                    bla
                    bla
                    blah
                    <loc>
                    

                    As about the (<loc>).*|\1\R regex, obviously, only the first alternative can match. Indeed, the second alternative \1\R contains a back-reference \1 to group 1, which is not defined in THIS alternative !

                    Finally, your (<loc>).*|\1\R regex is simply equivalent to one the two forms :

                    • (?s)(<loc>).*

                    • (?-s)(<loc>).*

                    Best Regards,

                    guy038

                    Robin CruiseR 1 Reply Last reply Reply Quote 2
                    • Robin CruiseR
                      Robin Cruise @guy038
                      last edited by

                      @guy038 great answer, thanks !

                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors