Community
    • Login

    replace words between tags with regular expression

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    10 Posts 4 Posters 4.7k Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR Offline
      Robin Cruise
      last edited by

      good day. I have this html code:

      <My Tag>
      	<div class="searchField">
                  <div align="right">
      
                    <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
      
      <My Tag>
      

      I want to replace all <a href="/ with <a href="https://link.ca/ between <My Tag><My Tag>

      my solution is , but not very good:

      SEARCH: <My Tag>(?s)(.*)(<a href="/).*?(?s)<My Tag>

      REPLACE BY: <My Tag>\1<a href="https://link.ca/\2<My Tag>

      Alan KilbornA 1 Reply Last reply Reply Quote 0
      • caryptC Offline
        carypt
        last edited by

        i see you dont concern the eol (“carriage return” and “new line”) non printable signs in your search expression .(menu-view-show symbol - show all characters).

        i can get the first part til <a href="/ match , so my not clever search would be : (<My Tag>\r\n)(.*\r\n)(.*\r\n)(\r\n)(.*)(<a href="/) and replace : $1$2$3$4$5<a href="https://link.ca/ could be wrong , use with care

        1 Reply Last reply Reply Quote 0
        • Alan KilbornA Online
          Alan Kilborn @Robin Cruise
          last edited by

          @Robin-Cruise

          If your solution works, is there reason for concern about it?
          Or would you just like someone to comment on how it could be better?

          @carypt

          The issue has nothing to do with “end-of-lines”, except that in your proposed solution, you turned it into something that involved end-of-lines. Probably best to refrain from offering solutions if your solution is going to be off-track, or even more complicated than prior ones proposed.

          1 Reply Last reply Reply Quote 0
          • caryptC Offline
            carypt
            last edited by

            @Alan-Kilborn idk alan , in regex tester plugin i couldnt get robins search phrase match , so i varied it into working . but its looking poor .

            1 Reply Last reply Reply Quote 0
            • guy038G Offline
              guy038
              last edited by guy038

              Hello, @robin-cruise, @alan-kilborn, @carypt and All

              First, thanks for trying to find out a regex solution by yourself !

              Now, let’s start with this sample, which contains two sections <My Tag>.....<My Tag> and one section <My Old Tag>...........<My Old Tag>

              <My Tag>
              	<div class="searchField">
                          <div align="right">
              
                            <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
              <My Tag>
              
              <My Old Tag>
              	<div class="searchField">
                          <div align="right">
              
                            <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
              <My Old Tag>
              
              <My Tag>
              	<div class="searchField">
                          <div align="right">
              
                            <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
              <My Tag>
              

              BTW, the closing tags should be </My Tag> and </My Old Tag> ? But this does not matter for the rest of this post !

              Your search regex <My Tag>(?s)(.*)(<a href="/).*?(?s)<My Tag> does not work as expected for several reasons :

              • First, it grasps the complete range between the first <My Tag> string till the last <My Tag> string, so the 3 sections all together, instead of one section only. To correct this behaviour just add a ? right after the first .*, in order to search for the smallest range of chars, instead of the greatest !

              => <My Tag>(?s)(.*?)(<a href="/).*?(?s)<My Tag>

              • Secondly, I suppose that you wrongly defined your group 2. Indeed, I think that it’s the range between <a href="/ and the closing tag <My tag>, which sould be stored as group 2. Note also, that the modifier (?s) coming next, is useless too, as already defined ! And better to place the first (?s) syntax at beginning of the regex, for a better understanding !

              So, your regex S/R is, now :

              SEARCH (?s)<My Tag>(.*?)<a href="/(.*?)<My Tag>

              REPLACE <My Tag>\1<a href="https://link.ca/\2<My Tag>

              If we run this regex S/R against our sample text, we notice that only the <a href="/website-1.html"> string of each good section <My Tag>......<My Tag> is changed. And…, after several repetitive clicks on the Replace All button, these sections are replaced but if you’re going on, then, the parts <a href="/website...."> of the wrong <My Old Tag>.......<My Old Tag> section are also modified :-((


              So we cannot go on this way ! Globally, the correct scheme is :

              • To search, first, for a <My tag> string

              • To catch any range of any characters till the nearest string <a href="/

              • Do the appropriate replacement

              • Re-start the search, immediately from the next character, with the \G assertion and…

              • Search, again, for any range of any characters till the nearest string <a href="/

              • Do the appropriate replacement

              And so on…

              However, in order that no replacement occurs when inside a wrong section as <My Old Tag>.......<My Old Tag>, we must add one condition : No < symbol, at beginning of a line, must be met at any location in the range of chars which is followed with the string <a href="/ ! This can be achieved with the regex ((?!^<).)+?

              Now, the initial string to change is <a href="/ and the final string should be <a href="https://link.ca/

              This is strictly equivalent to say that the empty location between the = sign and the / symbol must be replaced with the string https://link.ca. This empty location can be obtained with the \K syntax

              Finally, my regex S/R, using the free-spacing mode (?x) is :

              SEARCH (?xs) ( <My[ ]Tag> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

              REPLACE https://link.ca

              Note that, because of the \K syntax, you must click on the Replace All button, exclusively

              All replacements are done in all the good sections <My Tag>......<My Tag>, giving the expected text :

              <My Tag>
              	<div class="searchField">
                          <div align="right">
              
                            <a href="https://link.ca/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://link.ca/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://link.ca/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://link.ca/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://link.ca/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
              <My Tag>
              
              <My Old Tag>
              	<div class="searchField">
                          <div align="right">
              
                            <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
              <My Old Tag>
              
              <My Tag>
              	<div class="searchField">
                          <div align="right">
              
                            <a href="https://link.ca/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://link.ca/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://link.ca/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://link.ca/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://link.ca/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
              <My Tag>
              

              Notes :

              • Without the free-spacing mode the search regex becomes (?s)(<My Tag>|\G)((?!^<).)+?<a href="\K(?=/)

              • The first alternative <My Tag> occurs first, and once only, per section and the second alternative \G all the other times

              • As said above, the part ((?!^<).)+? represents the smallest range of any chars, which does not contain a < symbol at beginning of a line, till… the string <a href="

              • Then, the \K syntax resets the regex engine search location and cancels all the matches found, so far

              • As the remaining of the regex is only the look-ahead (?=/), this means that this empty location, IF followed with a / symbol, is simply changed with the replacement string https://link.ca


              You 'll probably notice that the part :

              <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>
              

              had not been changed ! But, it’s quite logical because it begins with <a href="website-3.html"> and not with <a href="/website-3.html"> !

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 4
              • caryptC Offline
                carypt
                last edited by

                sry , must be blind .

                1 Reply Last reply Reply Quote 0
                • caryptC Offline
                  carypt
                  last edited by carypt

                  i see the search phrase : <My Tag>.*?\K|(<a href="/)* matching all the <a href="/ in mark-window , but does not work in search/replace-window (giving totally different behavior) , also @guy038 search phrase isnt marking any in mark-window . why is it ? is there no way to control the matching without trying blindly ?

                  also i would say the regex trainer plugin isnt working correctly , so i leave it out . to my excuse from before , i wasnt reading well the original text , overread many search matches .

                  aaand i want to thank @guy038 for his detailed explainings . )

                  1 Reply Last reply Reply Quote 0
                  • Robin CruiseR Offline
                    Robin Cruise
                    last edited by Robin Cruise

                    @guy038 said in replace words between tags with regular expression:

                    https://link.ca

                    works fine, thank you @guy038

                    Another case. Suppose instead of <My Tag> … <My Tag> I have a comment such as <!-- BEGIN --> … <!-- BEGIN -->

                    I change a little bit your regex, but I believe I made a mistake.

                    SEARCH: (?xs) (<\!-- BEGIN --> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

                    REPLACE: https://link.ca

                    what did I do wrong here?

                    1 Reply Last reply Reply Quote 0
                    • guy038G Offline
                      guy038
                      last edited by guy038

                      Hi, @robin-cruise and All,

                      Ah…, I immediately understood the problem !

                      When you use the free-spacing mode, with a first modifier (?x), or with a (?x....) syntax, like, for instance, (?xs-i) :

                      • Any usual space character is not part of the overall regex

                      • Any text located after a first # character are considered as comments and is not part of the overall regex, too

                      Thus, you must respect two rules :

                      • Any literal space character, to search for, must be represented with one of these three syntaxes, below :

                        • An anti-slash char \ right before that specific space char

                        • The [ ] syntax, that is to say a space char between square brackets, representing a character class feature

                        • The escape syntaxes \x20 or \x{20} or \x{0020}

                      • Any literal sharp character #, to search for, must be represented with one of the three syntaxes, below :

                        • An anti-slash char \ right before that specific # char ( => \# )

                        • The [#] syntax, that is to say a sharp char between square brackets, representing a character class feature

                        • The escape syntaxes \x23 or \x{23} or \x{0023}

                      For instance, let’s imagine that you want to match three space chars, surrounded by # characters, with a regex expression, you have the choice between all these syntaxes :

                      
                      - WITHOUT the FREE-SPACING mode :
                      
                      #   #
                      
                      #\x20{3}#
                      
                      #[ ][ ][ ]#
                      
                      - WITH the FREE-SPACING mode :
                      
                      (?x)  \x23\ \ \ \x23      # ESCAPED SPACE char and HEXADECIMAL ESCAPE of # 
                      
                      (?x)  \#[ ][ ][ ]\#       # ESCAPED SHARP char and SPACE in a CHARACTER CLASS
                      
                      (?x)  [#]\x20\x20{2}[#]   # SHARP char in a CHARACTER CLASS and HEXADECIMAL ESCAPE of SPACE chars
                      ...
                      ...
                      

                      Now, let’s go back to your new regex. I suppose that you’ve already guessed the problem ;-)) Yes, this is because of the space characters which surround the word BEGIN ! Note also that the ! char is not a special char in a character class [....]. So, the correct regex should be expressed as :

                      • (?xs) (<!--\ BEGIN\ --> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

                      Or

                      • (?xs) (<!--[ ]BEGIN[ ]--> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

                      Or

                      (?xs) (<!--\x20BEGIN\x20--> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

                      And, without the free-spacing mode :

                      • (?s)(<!-- BEGIN -->|\G)((?!^<).)+?<a href="\K(?=/)

                      You may even use this layout, where I’m using the free-spacing mode with numerous comments and non-capturing groups :

                      (?xs-i)                    # FREE-SPACING mode - DOT means ANY char, even EOL chars - Search SENSITIVE to CASE ( NON-INSENSITIVE ! )
                      (?:                        # FIRST NON-CAPTURING group to DEFINE a group of ALTERNATIVES
                        <!--[ ]BEGIN[ ]-->       #   FIRST alternative : the string <!-- BEGIN --> with this EXACT case
                      |                          # The ALTERNATION regex symbol
                        \G                       #   SECOND alternative : the \G assertion which forces that the NEXT match begins RIGHT AFTER the PREVIOUS one
                      )                          # END of the FIRST NON-CAPTURING group
                      (?:                        # SECOND NON-CAPTURING group to define a SINGLE REPEATED char
                        (?!^<).                  #   ANY char, even an EOL char, IF this char is NOT an OPENING ANGLE bracket at BEGINNING of a line
                                                 #    ...That is to say that the regex engine MUST NOT enter a NEW section while doing a MATCH ATTEMPT
                      )+?                        # END of the SECOND NON-CAPTURING group, REPËATED from 1 to MORE, the MINIMUM of times till...
                      <a[ ]href="                # The LITERAL string <a href="
                      \K                         # RESETS the regex engine LOCATION and CANCELS matches, so far => the PRESENT match is ONLY the EMPTY string...
                      (?=/)                      # IF FOLLOWED with an ANTISLASH character
                      

                      Now, @robin-cruise, follow these steps :

                      • Select all the text from (?xs-i) till ANTISLASH character

                      • Open the find dialog ( Ctrl + F )

                      • Type in https://link.ca in the Replace with field

                      • Select the Regular expression search mode

                      • Click on the Replace All button

                      Here you are ;-))

                      Cheers,

                      guy038

                      1 Reply Last reply Reply Quote 2
                      • Robin CruiseR Offline
                        Robin Cruise
                        last edited by

                        thank you !

                        1 Reply Last reply Reply Quote 1

                        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                        With your input, this post could be even better 💗

                        Register Login
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors