• Login
Community
  • Login

replace words between tags with regular expression

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
10 Posts 4 Posters 3.3k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R
    Robin Cruise
    last edited by Dec 25, 2020, 5:22 PM

    good day. I have this html code:

    <My Tag>
    	<div class="searchField">
                <div align="right">
    
                  <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
    
    <My Tag>
    

    I want to replace all <a href="/ with <a href="https://link.ca/ between <My Tag><My Tag>

    my solution is , but not very good:

    SEARCH: <My Tag>(?s)(.*)(<a href="/).*?(?s)<My Tag>

    REPLACE BY: <My Tag>\1<a href="https://link.ca/\2<My Tag>

    A 1 Reply Last reply Dec 25, 2020, 8:25 PM Reply Quote 0
    • C
      carypt
      last edited by Dec 25, 2020, 8:09 PM

      i see you dont concern the eol (“carriage return” and “new line”) non printable signs in your search expression .(menu-view-show symbol - show all characters).

      i can get the first part til <a href="/ match , so my not clever search would be : (<My Tag>\r\n)(.*\r\n)(.*\r\n)(\r\n)(.*)(<a href="/) and replace : $1$2$3$4$5<a href="https://link.ca/ could be wrong , use with care

      1 Reply Last reply Reply Quote 0
      • A
        Alan Kilborn @Robin Cruise
        last edited by Dec 25, 2020, 8:25 PM

        @Robin-Cruise

        If your solution works, is there reason for concern about it?
        Or would you just like someone to comment on how it could be better?

        @carypt

        The issue has nothing to do with “end-of-lines”, except that in your proposed solution, you turned it into something that involved end-of-lines. Probably best to refrain from offering solutions if your solution is going to be off-track, or even more complicated than prior ones proposed.

        1 Reply Last reply Reply Quote 0
        • C
          carypt
          last edited by Dec 25, 2020, 9:01 PM

          @Alan-Kilborn idk alan , in regex tester plugin i couldnt get robins search phrase match , so i varied it into working . but its looking poor .

          1 Reply Last reply Reply Quote 0
          • G
            guy038
            last edited by guy038 Dec 26, 2020, 10:04 AM Dec 25, 2020, 9:43 PM

            Hello, @robin-cruise, @alan-kilborn, @carypt and All

            First, thanks for trying to find out a regex solution by yourself !

            Now, let’s start with this sample, which contains two sections <My Tag>.....<My Tag> and one section <My Old Tag>...........<My Old Tag>

            <My Tag>
            	<div class="searchField">
                        <div align="right">
            
                          <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
            <My Tag>
            
            <My Old Tag>
            	<div class="searchField">
                        <div align="right">
            
                          <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
            <My Old Tag>
            
            <My Tag>
            	<div class="searchField">
                        <div align="right">
            
                          <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
            <My Tag>
            

            BTW, the closing tags should be </My Tag> and </My Old Tag> ? But this does not matter for the rest of this post !

            Your search regex <My Tag>(?s)(.*)(<a href="/).*?(?s)<My Tag> does not work as expected for several reasons :

            • First, it grasps the complete range between the first <My Tag> string till the last <My Tag> string, so the 3 sections all together, instead of one section only. To correct this behaviour just add a ? right after the first .*, in order to search for the smallest range of chars, instead of the greatest !

            => <My Tag>(?s)(.*?)(<a href="/).*?(?s)<My Tag>

            • Secondly, I suppose that you wrongly defined your group 2. Indeed, I think that it’s the range between <a href="/ and the closing tag <My tag>, which sould be stored as group 2. Note also, that the modifier (?s) coming next, is useless too, as already defined ! And better to place the first (?s) syntax at beginning of the regex, for a better understanding !

            So, your regex S/R is, now :

            SEARCH (?s)<My Tag>(.*?)<a href="/(.*?)<My Tag>

            REPLACE <My Tag>\1<a href="https://link.ca/\2<My Tag>

            If we run this regex S/R against our sample text, we notice that only the <a href="/website-1.html"> string of each good section <My Tag>......<My Tag> is changed. And…, after several repetitive clicks on the Replace All button, these sections are replaced but if you’re going on, then, the parts <a href="/website...."> of the wrong <My Old Tag>.......<My Old Tag> section are also modified :-((


            So we cannot go on this way ! Globally, the correct scheme is :

            • To search, first, for a <My tag> string

            • To catch any range of any characters till the nearest string <a href="/

            • Do the appropriate replacement

            • Re-start the search, immediately from the next character, with the \G assertion and…

            • Search, again, for any range of any characters till the nearest string <a href="/

            • Do the appropriate replacement

            And so on…

            However, in order that no replacement occurs when inside a wrong section as <My Old Tag>.......<My Old Tag>, we must add one condition : No < symbol, at beginning of a line, must be met at any location in the range of chars which is followed with the string <a href="/ ! This can be achieved with the regex ((?!^<).)+?

            Now, the initial string to change is <a href="/ and the final string should be <a href="https://link.ca/

            This is strictly equivalent to say that the empty location between the = sign and the / symbol must be replaced with the string https://link.ca. This empty location can be obtained with the \K syntax

            Finally, my regex S/R, using the free-spacing mode (?x) is :

            SEARCH (?xs) ( <My[ ]Tag> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

            REPLACE https://link.ca

            Note that, because of the \K syntax, you must click on the Replace All button, exclusively

            All replacements are done in all the good sections <My Tag>......<My Tag>, giving the expected text :

            <My Tag>
            	<div class="searchField">
                        <div align="right">
            
                          <a href="https://link.ca/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://link.ca/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://link.ca/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://link.ca/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://link.ca/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
            <My Tag>
            
            <My Old Tag>
            	<div class="searchField">
                        <div align="right">
            
                          <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
            <My Old Tag>
            
            <My Tag>
            	<div class="searchField">
                        <div align="right">
            
                          <a href="https://link.ca/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://link.ca/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://link.ca/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://link.ca/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://link.ca/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
            <My Tag>
            

            Notes :

            • Without the free-spacing mode the search regex becomes (?s)(<My Tag>|\G)((?!^<).)+?<a href="\K(?=/)

            • The first alternative <My Tag> occurs first, and once only, per section and the second alternative \G all the other times

            • As said above, the part ((?!^<).)+? represents the smallest range of any chars, which does not contain a < symbol at beginning of a line, till… the string <a href="

            • Then, the \K syntax resets the regex engine search location and cancels all the matches found, so far

            • As the remaining of the regex is only the look-ahead (?=/), this means that this empty location, IF followed with a / symbol, is simply changed with the replacement string https://link.ca


            You 'll probably notice that the part :

            <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>
            

            had not been changed ! But, it’s quite logical because it begins with <a href="website-3.html"> and not with <a href="/website-3.html"> !

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 4
            • C
              carypt
              last edited by Dec 25, 2020, 10:55 PM

              sry , must be blind .

              1 Reply Last reply Reply Quote 0
              • C
                carypt
                last edited by carypt Dec 26, 2020, 6:59 AM Dec 26, 2020, 6:57 AM

                i see the search phrase : <My Tag>.*?\K|(<a href="/)* matching all the <a href="/ in mark-window , but does not work in search/replace-window (giving totally different behavior) , also @guy038 search phrase isnt marking any in mark-window . why is it ? is there no way to control the matching without trying blindly ?

                also i would say the regex trainer plugin isnt working correctly , so i leave it out . to my excuse from before , i wasnt reading well the original text , overread many search matches .

                aaand i want to thank @guy038 for his detailed explainings . )

                1 Reply Last reply Reply Quote 0
                • R
                  Robin Cruise
                  last edited by Robin Cruise Dec 26, 2020, 9:40 AM Dec 26, 2020, 9:40 AM

                  @guy038 said in replace words between tags with regular expression:

                  https://link.ca

                  works fine, thank you @guy038

                  Another case. Suppose instead of <My Tag> … <My Tag> I have a comment such as <!-- BEGIN --> … <!-- BEGIN -->

                  I change a little bit your regex, but I believe I made a mistake.

                  SEARCH: (?xs) (<\!-- BEGIN --> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

                  REPLACE: https://link.ca

                  what did I do wrong here?

                  1 Reply Last reply Reply Quote 0
                  • G
                    guy038
                    last edited by guy038 Dec 26, 2020, 11:43 AM Dec 26, 2020, 11:32 AM

                    Hi, @robin-cruise and All,

                    Ah…, I immediately understood the problem !

                    When you use the free-spacing mode, with a first modifier (?x), or with a (?x....) syntax, like, for instance, (?xs-i) :

                    • Any usual space character is not part of the overall regex

                    • Any text located after a first # character are considered as comments and is not part of the overall regex, too

                    Thus, you must respect two rules :

                    • Any literal space character, to search for, must be represented with one of these three syntaxes, below :

                      • An anti-slash char \ right before that specific space char

                      • The [ ] syntax, that is to say a space char between square brackets, representing a character class feature

                      • The escape syntaxes \x20 or \x{20} or \x{0020}

                    • Any literal sharp character #, to search for, must be represented with one of the three syntaxes, below :

                      • An anti-slash char \ right before that specific # char ( => \# )

                      • The [#] syntax, that is to say a sharp char between square brackets, representing a character class feature

                      • The escape syntaxes \x23 or \x{23} or \x{0023}

                    For instance, let’s imagine that you want to match three space chars, surrounded by # characters, with a regex expression, you have the choice between all these syntaxes :

                    
                    - WITHOUT the FREE-SPACING mode :
                    
                    #   #
                    
                    #\x20{3}#
                    
                    #[ ][ ][ ]#
                    
                    - WITH the FREE-SPACING mode :
                    
                    (?x)  \x23\ \ \ \x23      # ESCAPED SPACE char and HEXADECIMAL ESCAPE of # 
                    
                    (?x)  \#[ ][ ][ ]\#       # ESCAPED SHARP char and SPACE in a CHARACTER CLASS
                    
                    (?x)  [#]\x20\x20{2}[#]   # SHARP char in a CHARACTER CLASS and HEXADECIMAL ESCAPE of SPACE chars
                    ...
                    ...
                    

                    Now, let’s go back to your new regex. I suppose that you’ve already guessed the problem ;-)) Yes, this is because of the space characters which surround the word BEGIN ! Note also that the ! char is not a special char in a character class [....]. So, the correct regex should be expressed as :

                    • (?xs) (<!--\ BEGIN\ --> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

                    Or

                    • (?xs) (<!--[ ]BEGIN[ ]--> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

                    Or

                    (?xs) (<!--\x20BEGIN\x20--> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

                    And, without the free-spacing mode :

                    • (?s)(<!-- BEGIN -->|\G)((?!^<).)+?<a href="\K(?=/)

                    You may even use this layout, where I’m using the free-spacing mode with numerous comments and non-capturing groups :

                    (?xs-i)                    # FREE-SPACING mode - DOT means ANY char, even EOL chars - Search SENSITIVE to CASE ( NON-INSENSITIVE ! )
                    (?:                        # FIRST NON-CAPTURING group to DEFINE a group of ALTERNATIVES
                      <!--[ ]BEGIN[ ]-->       #   FIRST alternative : the string <!-- BEGIN --> with this EXACT case
                    |                          # The ALTERNATION regex symbol
                      \G                       #   SECOND alternative : the \G assertion which forces that the NEXT match begins RIGHT AFTER the PREVIOUS one
                    )                          # END of the FIRST NON-CAPTURING group
                    (?:                        # SECOND NON-CAPTURING group to define a SINGLE REPEATED char
                      (?!^<).                  #   ANY char, even an EOL char, IF this char is NOT an OPENING ANGLE bracket at BEGINNING of a line
                                               #    ...That is to say that the regex engine MUST NOT enter a NEW section while doing a MATCH ATTEMPT
                    )+?                        # END of the SECOND NON-CAPTURING group, REPËATED from 1 to MORE, the MINIMUM of times till...
                    <a[ ]href="                # The LITERAL string <a href="
                    \K                         # RESETS the regex engine LOCATION and CANCELS matches, so far => the PRESENT match is ONLY the EMPTY string...
                    (?=/)                      # IF FOLLOWED with an ANTISLASH character
                    

                    Now, @robin-cruise, follow these steps :

                    • Select all the text from (?xs-i) till ANTISLASH character

                    • Open the find dialog ( Ctrl + F )

                    • Type in https://link.ca in the Replace with field

                    • Select the Regular expression search mode

                    • Click on the Replace All button

                    Here you are ;-))

                    Cheers,

                    guy038

                    1 Reply Last reply Reply Quote 2
                    • R
                      Robin Cruise
                      last edited by Dec 26, 2020, 11:42 AM

                      thank you !

                      1 Reply Last reply Reply Quote 1
                      5 out of 10
                      • First post
                        5/10
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors