replace words between tags with regular expression



  • good day. I have this html code:

    <My Tag>
    	<div class="searchField">
                <div align="right">
    
                  <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
    
    <My Tag>
    

    I want to replace all <a href="/ with <a href="https://link.ca/ between <My Tag><My Tag>

    my solution is , but not very good:

    SEARCH: <My Tag>(?s)(.*)(<a href="/).*?(?s)<My Tag>

    REPLACE BY: <My Tag>\1<a href="https://link.ca/\2<My Tag>



  • i see you dont concern the eol (“carriage return” and “new line”) non printable signs in your search expression .(menu-view-show symbol - show all characters).

    i can get the first part til <a href="/ match , so my not clever search would be : (<My Tag>\r\n)(.*\r\n)(.*\r\n)(\r\n)(.*)(<a href="/) and replace : $1$2$3$4$5<a href="https://link.ca/ could be wrong , use with care



  • @Robin-Cruise

    If your solution works, is there reason for concern about it?
    Or would you just like someone to comment on how it could be better?

    @carypt

    The issue has nothing to do with “end-of-lines”, except that in your proposed solution, you turned it into something that involved end-of-lines. Probably best to refrain from offering solutions if your solution is going to be off-track, or even more complicated than prior ones proposed.



  • @Alan-Kilborn idk alan , in regex tester plugin i couldnt get robins search phrase match , so i varied it into working . but its looking poor .



  • Hello, @robin-cruise, @alan-kilborn, @carypt and All

    First, thanks for trying to find out a regex solution by yourself !

    Now, let’s start with this sample, which contains two sections <My Tag>.....<My Tag> and one section <My Old Tag>...........<My Old Tag>

    <My Tag>
    	<div class="searchField">
                <div align="right">
    
                  <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
    <My Tag>
    
    <My Old Tag>
    	<div class="searchField">
                <div align="right">
    
                  <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
    <My Old Tag>
    
    <My Tag>
    	<div class="searchField">
                <div align="right">
    
                  <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
    <My Tag>
    

    BTW, the closing tags should be </My Tag> and </My Old Tag> ? But this does not matter for the rest of this post !

    Your search regex <My Tag>(?s)(.*)(<a href="/).*?(?s)<My Tag> does not work as expected for several reasons :

    • First, it grasps the complete range between the first <My Tag> string till the last <My Tag> string, so the 3 sections all together, instead of one section only. To correct this behaviour just add a ? right after the first .*, in order to search for the smallest range of chars, instead of the greatest !

    => <My Tag>(?s)(.*?)(<a href="/).*?(?s)<My Tag>

    • Secondly, I suppose that you wrongly defined your group 2. Indeed, I think that it’s the range between <a href="/ and the closing tag <My tag>, which sould be stored as group 2. Note also, that the modifier (?s) coming next, is useless too, as already defined ! And better to place the first (?s) syntax at beginning of the regex, for a better understanding !

    So, your regex S/R is, now :

    SEARCH (?s)<My Tag>(.*?)<a href="/(.*?)<My Tag>

    REPLACE <My Tag>\1<a href="https://link.ca/\2<My Tag>

    If we run this regex S/R against our sample text, we notice that only the <a href="/website-1.html"> string of each good section <My Tag>......<My Tag> is changed. And…, after several repetitive clicks on the Replace All button, these sections are replaced but if you’re going on, then, the parts <a href="/website...."> of the wrong <My Old Tag>.......<My Old Tag> section are also modified :-((


    So we cannot go on this way ! Globally, the correct scheme is :

    • To search, first, for a <My tag> string

    • To catch any range of any characters till the nearest string <a href="/

    • Do the appropriate replacement

    • Re-start the search, immediately from the next character, with the \G assertion and…

    • Search, again, for any range of any characters till the nearest string <a href="/

    • Do the appropriate replacement

    And so on…

    However, in order that no replacement occurs when inside a wrong section as <My Old Tag>.......<My Old Tag>, we must add one condition : No < symbol, at beginning of a line, must be met at any location in the range of chars which is followed with the string <a href="/ ! This can be achieved with the regex ((?!^<).)+?

    Now, the initial string to change is <a href="/ and the final string should be <a href="https://link.ca/

    This is strictly equivalent to say that the empty location between the = sign and the / symbol must be replaced with the string https://link.ca. This empty location can be obtained with the \K syntax

    Finally, my regex S/R, using the free-spacing mode (?x) is :

    SEARCH (?xs) ( <My[ ]Tag> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

    REPLACE https://link.ca

    Note that, because of the \K syntax, you must click on the Replace All button, exclusively

    All replacements are done in all the good sections <My Tag>......<My Tag>, giving the expected text :

    <My Tag>
    	<div class="searchField">
                <div align="right">
    
                  <a href="https://link.ca/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://link.ca/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://link.ca/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://link.ca/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://link.ca/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
    <My Tag>
    
    <My Old Tag>
    	<div class="searchField">
                <div align="right">
    
                  <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
    <My Old Tag>
    
    <My Tag>
    	<div class="searchField">
                <div align="right">
    
                  <a href="https://link.ca/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>&nbsp; <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a>&nbsp; <a href="https://link.ca/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a>&nbsp; <a href="https://link.ca/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a>&nbsp; <a href="https://link.ca/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a>&nbsp; <a href="https://link.ca/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a>&nbsp; <a href="https://link.ca/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div>
    <My Tag>
    

    Notes :

    • Without the free-spacing mode the search regex becomes (?s)(<My Tag>|\G)((?!^<).)+?<a href="\K(?=/)

    • The first alternative <My Tag> occurs first, and once only, per section and the second alternative \G all the other times

    • As said above, the part ((?!^<).)+? represents the smallest range of any chars, which does not contain a < symbol at beginning of a line, till… the string <a href="

    • Then, the \K syntax resets the regex engine search location and cancels all the matches found, so far

    • As the remaining of the regex is only the look-ahead (?=/), this means that this empty location, IF followed with a / symbol, is simply changed with the replacement string https://link.ca


    You 'll probably notice that the part :

    <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>
    

    had not been changed ! But, it’s quite logical because it begins with <a href="website-3.html"> and not with <a href="/website-3.html"> !

    Best Regards,

    guy038



  • sry , must be blind .



  • i see the search phrase : <My Tag>.*?\K|(<a href="/)* matching all the <a href="/ in mark-window , but does not work in search/replace-window (giving totally different behavior) , also @guy038 search phrase isnt marking any in mark-window . why is it ? is there no way to control the matching without trying blindly ?

    also i would say the regex trainer plugin isnt working correctly , so i leave it out . to my excuse from before , i wasnt reading well the original text , overread many search matches .

    aaand i want to thank @guy038 for his detailed explainings . )



  • @guy038 said in replace words between tags with regular expression:

    https://link.ca

    works fine, thank you @guy038

    Another case. Suppose instead of <My Tag><My Tag> I have a comment such as <!-- BEGIN --><!-- BEGIN -->

    I change a little bit your regex, but I believe I made a mistake.

    SEARCH: (?xs) (<\!-- BEGIN --> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

    REPLACE: https://link.ca

    what did I do wrong here?



  • Hi, @robin-cruise and All,

    Ah…, I immediately understood the problem !

    When you use the free-spacing mode, with a first modifier (?x), or with a (?x....) syntax, like, for instance, (?xs-i) :

    • Any usual space character is not part of the overall regex

    • Any text located after a first # character are considered as comments and is not part of the overall regex, too

    Thus, you must respect two rules :

    • Any literal space character, to search for, must be represented with one of these three syntaxes, below :

      • An anti-slash char \ right before that specific space char

      • The [ ] syntax, that is to say a space char between square brackets, representing a character class feature

      • The escape syntaxes \x20 or \x{20} or \x{0020}

    • Any literal sharp character #, to search for, must be represented with one of the three syntaxes, below :

      • An anti-slash char \ right before that specific # char ( => \# )

      • The [#] syntax, that is to say a sharp char between square brackets, representing a character class feature

      • The escape syntaxes \x23 or \x{23} or \x{0023}

    For instance, let’s imagine that you want to match three space chars, surrounded by # characters, with a regex expression, you have the choice between all these syntaxes :

    
    - WITHOUT the FREE-SPACING mode :
    
    #   #
    
    #\x20{3}#
    
    #[ ][ ][ ]#
    
    - WITH the FREE-SPACING mode :
    
    (?x)  \x23\ \ \ \x23      # ESCAPED SPACE char and HEXADECIMAL ESCAPE of # 
    
    (?x)  \#[ ][ ][ ]\#       # ESCAPED SHARP char and SPACE in a CHARACTER CLASS
    
    (?x)  [#]\x20\x20{2}[#]   # SHARP char in a CHARACTER CLASS and HEXADECIMAL ESCAPE of SPACE chars
    ...
    ...
    

    Now, let’s go back to your new regex. I suppose that you’ve already guessed the problem ;-)) Yes, this is because of the space characters which surround the word BEGIN ! Note also that the ! char is not a special char in a character class [....]. So, the correct regex should be expressed as :

    • (?xs) (<!--\ BEGIN\ --> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

    Or

    • (?xs) (<!--[ ]BEGIN[ ]--> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

    Or

    (?xs) (<!--\x20BEGIN\x20--> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)

    And, without the free-spacing mode :

    • (?s)(<!-- BEGIN -->|\G)((?!^<).)+?<a href="\K(?=/)

    You may even use this layout, where I’m using the free-spacing mode with numerous comments and non-capturing groups :

    (?xs-i)                    # FREE-SPACING mode - DOT means ANY char, even EOL chars - Search SENSITIVE to CASE ( NON-INSENSITIVE ! )
    (?:                        # FIRST NON-CAPTURING group to DEFINE a group of ALTERNATIVES
      <!--[ ]BEGIN[ ]-->       #   FIRST alternative : the string <!-- BEGIN --> with this EXACT case
    |                          # The ALTERNATION regex symbol
      \G                       #   SECOND alternative : the \G assertion which forces that the NEXT match begins RIGHT AFTER the PREVIOUS one
    )                          # END of the FIRST NON-CAPTURING group
    (?:                        # SECOND NON-CAPTURING group to define a SINGLE REPEATED char
      (?!^<).                  #   ANY char, even an EOL char, IF this char is NOT an OPENING ANGLE bracket at BEGINNING of a line
                               #    ...That is to say that the regex engine MUST NOT enter a NEW section while doing a MATCH ATTEMPT
    )+?                        # END of the SECOND NON-CAPTURING group, REPËATED from 1 to MORE, the MINIMUM of times till...
    <a[ ]href="                # The LITERAL string <a href="
    \K                         # RESETS the regex engine LOCATION and CANCELS matches, so far => the PRESENT match is ONLY the EMPTY string...
    (?=/)                      # IF FOLLOWED with an ANTISLASH character
    

    Now, @robin-cruise, follow these steps :

    • Select all the text from (?xs-i) till ANTISLASH character

    • Open the find dialog ( Ctrl + F )

    • Type in https://link.ca in the Replace with field

    • Select the Regular expression search mode

    • Click on the Replace All button

    Here you are ;-))

    Cheers,

    guy038



  • thank you !


Log in to reply