replace words between tags with regular expression
-
good day. I have this html code:
<My Tag> <div class="searchField"> <div align="right"> <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a> <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a> <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a> <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a> <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a> <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a> <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a> <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a> <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a> <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div> <My Tag>
I want to replace all
<a href="/
with<a href="https://link.ca/
between<My Tag><My Tag>
my solution is , but not very good:
SEARCH:
<My Tag>(?s)(.*)(<a href="/).*?(?s)<My Tag>
REPLACE BY:
<My Tag>\1<a href="https://link.ca/\2<My Tag>
-
i see you dont concern the eol (“carriage return” and “new line”) non printable signs in your search expression .(menu-view-show symbol - show all characters).
i can get the first part til
<a href="/
match , so my not clever search would be :(<My Tag>\r\n)(.*\r\n)(.*\r\n)(\r\n)(.*)(<a href="/)
and replace :$1$2$3$4$5<a href="https://link.ca/
could be wrong , use with care -
If your solution works, is there reason for concern about it?
Or would you just like someone to comment on how it could be better?The issue has nothing to do with “end-of-lines”, except that in your proposed solution, you turned it into something that involved end-of-lines. Probably best to refrain from offering solutions if your solution is going to be off-track, or even more complicated than prior ones proposed.
-
@Alan-Kilborn idk alan , in regex tester plugin i couldnt get robins search phrase match , so i varied it into working . but its looking poor .
-
Hello, @robin-cruise, @alan-kilborn, @carypt and All
First, thanks for trying to find out a regex solution by yourself !
Now, let’s start with this sample, which contains two sections
<My Tag>.....<My Tag>
and one section<My Old Tag>...........<My Old Tag>
<My Tag> <div class="searchField"> <div align="right"> <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a> <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a> <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a> <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a> <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a> <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a> <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a> <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a> <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a> <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div> <My Tag> <My Old Tag> <div class="searchField"> <div align="right"> <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a> <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a> <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a> <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a> <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a> <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a> <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a> <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a> <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a> <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div> <My Old Tag> <My Tag> <div class="searchField"> <div align="right"> <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a> <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a> <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a> <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a> <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a> <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a> <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a> <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a> <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a> <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div> <My Tag>
BTW, the closing tags should be
</My Tag>
and</My Old Tag>
? But this does not matter for the rest of this post !Your search regex
<My Tag>(?s)(.*)(<a href="/).*?(?s)<My Tag>
does not work as expected for several reasons :- First, it grasps the complete range between the first
<My Tag>
string till the last<My Tag>
string, so the3
sections all together, instead of one section only. To correct this behaviour just add a?
right after the first.*
, in order to search for the smallest range of chars, instead of the greatest !
=>
<My Tag>(?s)(.*?)(<a href="/).*?(?s)<My Tag>
- Secondly, I suppose that you wrongly defined your group
2
. Indeed, I think that it’s the range between<a href="/
and the closing tag<My tag>
, which sould be stored as group2
. Note also, that the modifier(?s)
coming next, is useless too, as already defined ! And better to place the first(?s)
syntax at beginning of the regex, for a better understanding !
So, your regex S/R is, now :
SEARCH
(?s)<My Tag>(.*?)<a href="/(.*?)<My Tag>
REPLACE
<My Tag>\1<a href="https://link.ca/\2<My Tag>
If we run this regex S/R against our sample text, we notice that only the
<a href="/website-1.html">
string of each good section<My Tag>......<My Tag>
is changed. And…, after several repetitive clicks on theReplace All
button, these sections are replaced but if you’re going on, then, the parts<a href="/website....">
of the wrong<My Old Tag>.......<My Old Tag>
section are also modified :-((
So we cannot go on this way ! Globally, the correct scheme is :
-
To search, first, for a
<My tag>
string -
To catch any range of any characters till the nearest string
<a href="/
-
Do the appropriate replacement
-
Re-start the search, immediately from the next character, with the
\G
assertion and… -
Search, again, for any range of any characters till the nearest string
<a href="/
-
Do the appropriate replacement
And so on…
However, in order that no replacement occurs when inside a wrong section as
<My Old Tag>.......<My Old Tag>
, we must add one condition : No<
symbol, at beginning of a line, must be met at any location in the range of chars which is followed with the string<a href="/
! This can be achieved with the regex((?!^<).)+?
Now, the initial string to change is
<a href="/
and the final string should be<a href="https://link.ca/
This is strictly equivalent to say that the empty location between the
=
sign and the/
symbol must be replaced with the stringhttps://link.ca
. This empty location can be obtained with the\K
syntaxFinally, my regex S/R, using the
free-spacing
mode(?x)
is :SEARCH
(?xs) ( <My[ ]Tag> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)
REPLACE
https://link.ca
Note that, because of the
\K
syntax, you must click on theReplace All
button, exclusivelyAll replacements are done in all the good sections
<My Tag>......<My Tag>
, giving the expected text :<My Tag> <div class="searchField"> <div align="right"> <a href="https://link.ca/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a> <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a> <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a> <a href="https://link.ca/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a> <a href="https://link.ca/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a> <a href="https://link.ca/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a> <a href="https://link.ca/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a> <a href="https://link.ca/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a> <a href="https://link.ca/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a> <a href="https://link.ca/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div> <My Tag> <My Old Tag> <div class="searchField"> <div align="right"> <a href="/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a> <a href="/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a> <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a> <a href="/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a> <a href="/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a> <a href="/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a> <a href="/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a> <a href="/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a> <a href="/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a> <a href="/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div> <My Old Tag> <My Tag> <div class="searchField"> <div align="right"> <a href="https://link.ca/website-1.html"><img src="index_files/flag_lang_ro.jpg" title="ro" alt="ro" width="28" height="19" /></a> <a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a> <a href="website-3.html"><img src="index_files/flag_lang_en.jpg" title="en" alt="en" width="28" height="19" /></a> <a href="https://link.ca/es/website-4.html"><img src="index_files/flag_lang_es.jpg" title="es" alt="es" width="28" height="19" /></a> <a href="https://link.ca/pt/website-5.html"><img src="index_files/flag_lang_pt.jpg" title="pt" alt="pt" width="28" height="19" /></a> <a href="https://link.ca/ar/website-6.html"><img src="index_files/flag_lang_ae.jpg" width="28" height="19" title="ar" alt="ar" /></a> <a href="https://link.ca/zh/website-7.html"><img src="index_files/flag_lang_zh.jpg" width="28" height="19" title="zh" alt="zh" /></a> <a href="https://link.ca/hi/website-8.html"><img src="index_files/flag_lang_hi.jpg" width="28" height="19" title="hi" alt="hi" /></a> <a href="https://link.ca/de/website-9.html"><img src="index_files/flag_lang_de.jpg" width="28" height="19" title="de" alt="de" /></a> <a href="https://link.ca/ru/website-10.html"><img src="index_files/flag_lang_ru.jpg" width="28" height="19" title="ru" alt="ru" /></a></div> <My Tag>
Notes :
-
Without the free-spacing mode the search regex becomes
(?s)(<My Tag>|\G)((?!^<).)+?<a href="\K(?=/)
-
The first alternative
<My Tag>
occurs first, and once only, per section and the second alternative\G
all the other times -
As said above, the part
((?!^<).)+?
represents the smallest range of any chars, which does not contain a<
symbol at beginning of a line, till… the string<a href="
-
Then, the
\K
syntax resets the regex engine search location and cancels all the matches found, so far -
As the remaining of the regex is only the look-ahead
(?=/)
, this means that this empty location, IF followed with a/
symbol, is simply changed with the replacement stringhttps://link.ca
You 'll probably notice that the part :
<a href="https://link.ca/fr/website-2.html"><img src="index_files/flag_lang_fr.jpg" title="fr" alt="fr" width="28" height="19" /></a>
had not been changed ! But, it’s quite logical because it begins with
<a href="website-3.html">
and not with<a href="/website-3.html">
!Best Regards,
guy038
- First, it grasps the complete range between the first
-
sry , must be blind .
-
i see the search phrase :
<My Tag>.*?\K|(<a href="/)*
matching all the<a href="/
in mark-window , but does not work in search/replace-window (giving totally different behavior) , also @guy038 search phrase isnt marking any in mark-window . why is it ? is there no way to control the matching without trying blindly ?also i would say the regex trainer plugin isnt working correctly , so i leave it out . to my excuse from before , i wasnt reading well the original text , overread many search matches .
aaand i want to thank @guy038 for his detailed explainings . )
-
@guy038 said in replace words between tags with regular expression:
works fine, thank you @guy038
Another case. Suppose instead of
<My Tag>
…<My Tag>
I have a comment such as<!-- BEGIN -->
…<!-- BEGIN -->
I change a little bit your regex, but I believe I made a mistake.
SEARCH:
(?xs) (<\!-- BEGIN --> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)
REPLACE:
https://link.ca
what did I do wrong here?
-
Hi, @robin-cruise and All,
Ah…, I immediately understood the problem !
When you use the free-spacing mode, with a first modifier
(?x)
, or with a(?x....)
syntax, like, for instance,(?xs-i)
:-
Any usual space character is not part of the overall regex
-
Any text located after a first
#
character are considered as comments and is not part of the overall regex, too
Thus, you must respect two rules :
-
Any literal space character, to search for, must be represented with one of these three syntaxes, below :
-
An anti-slash char
\
right before that specific space char -
The
[ ]
syntax, that is to say a space char between square brackets, representing a character class feature -
The escape syntaxes
\x20
or\x{20}
or\x{0020}
-
-
Any literal sharp character
#
, to search for, must be represented with one of the three syntaxes, below :-
An anti-slash char
\
right before that specific#
char ( =>\#
) -
The
[#]
syntax, that is to say a sharp char between square brackets, representing a character class feature -
The escape syntaxes
\x23
or\x{23}
or\x{0023}
-
For instance, let’s imagine that you want to match three space chars, surrounded by
#
characters, with a regex expression, you have the choice between all these syntaxes :- WITHOUT the FREE-SPACING mode : # # #\x20{3}# #[ ][ ][ ]# - WITH the FREE-SPACING mode : (?x) \x23\ \ \ \x23 # ESCAPED SPACE char and HEXADECIMAL ESCAPE of # (?x) \#[ ][ ][ ]\# # ESCAPED SHARP char and SPACE in a CHARACTER CLASS (?x) [#]\x20\x20{2}[#] # SHARP char in a CHARACTER CLASS and HEXADECIMAL ESCAPE of SPACE chars ... ...
Now, let’s go back to your new regex. I suppose that you’ve already guessed the problem ;-)) Yes, this is because of the space characters which surround the word BEGIN ! Note also that the
!
char is not a special char in a character class[....]
. So, the correct regex should be expressed as :(?xs) (<!--\ BEGIN\ --> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)
Or
(?xs) (<!--[ ]BEGIN[ ]--> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)
Or
(?xs) (<!--\x20BEGIN\x20--> | \G ) ((?!^<).)+? <a[ ]href=" \K (?=/)
And, without the free-spacing mode :
(?s)(<!-- BEGIN -->|\G)((?!^<).)+?<a href="\K(?=/)
You may even use this layout, where I’m using the free-spacing mode with numerous comments and non-capturing groups :
(?xs-i) # FREE-SPACING mode - DOT means ANY char, even EOL chars - Search SENSITIVE to CASE ( NON-INSENSITIVE ! ) (?: # FIRST NON-CAPTURING group to DEFINE a group of ALTERNATIVES <!--[ ]BEGIN[ ]--> # FIRST alternative : the string <!-- BEGIN --> with this EXACT case | # The ALTERNATION regex symbol \G # SECOND alternative : the \G assertion which forces that the NEXT match begins RIGHT AFTER the PREVIOUS one ) # END of the FIRST NON-CAPTURING group (?: # SECOND NON-CAPTURING group to define a SINGLE REPEATED char (?!^<). # ANY char, even an EOL char, IF this char is NOT an OPENING ANGLE bracket at BEGINNING of a line # ...That is to say that the regex engine MUST NOT enter a NEW section while doing a MATCH ATTEMPT )+? # END of the SECOND NON-CAPTURING group, REPËATED from 1 to MORE, the MINIMUM of times till... <a[ ]href=" # The LITERAL string <a href=" \K # RESETS the regex engine LOCATION and CANCELS matches, so far => the PRESENT match is ONLY the EMPTY string... (?=/) # IF FOLLOWED with an ANTISLASH character
Now, @robin-cruise, follow these steps :
-
Select all the text from
(?xs-i)
tillANTISLASH character
-
Open the find dialog (
Ctrl + F
) -
Type in
https://link.ca
in the Replace with field -
Select the
Regular expression
search mode -
Click on the
Replace All
button
Here you are ;-))
Cheers,
guy038
-
-
thank you !