Regex: Delete only one instance of a string between two html tags (double quotes)

Robin Cruise

hello. I have some html tags, for example:

<meta name="description" content="......"/>

As you can see there are 2 double quotes " + " One at starting content of tag, one at the end content of that tag.

But in the example below, I have one (or I cand have multiple double quotes, apart from the two basic. How can I delete those extra double quotes?

<meta name="description" content="Kiel vi rilatigas vian juĝvaloron "al la kredoj esprimitaj de aliaj se vi ne pretas elporti la kostojn de misinterpretado de la " cirkonstancoj en kiuj okazas evento?"/>

I try to use and old generic regex that @guy038 made:

(?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)

will become:

FIND: (?-si:<meta name="description" content="|(?!\A)\G)(?s-i:(?!"/>).)*?\K(?-si:")

REPLACE BY: (leave empty)

The problem is that, this solution delete all double quotes, except the first one. But also, the last one (this did not have to be deleted)

Neil Schipper

@robin-cruise

Hard to understand.

Try: "This regex produced the output output1 but the output I want is output2.

Robin Cruise

@neil-schipper only those 2 double quotes (between dotes) are important:

<meta name="description" content="......"/>

guy038

Hi @robin-Cruise, @neil-schipper and All,

Given this text :

<meta name="description" content="Kiel vi rilatigas vian juĝvaloron "al la kredoj esprimitaj de aliaj se vi ne pretas elporti la kostojn de misinterpretado de la " cirkonstancoj en kiuj okazas evento?"/>

You used this regex :

FIND: (?-si:<meta name="description" content="|(?!\A)\G)(?s-i:(?!"/>).)*?\K(?-si:")

So, after finding and deleting the two non-wanted " characters, then, due to the \G feature, it first selects the remaining range cirkonstancoj en kiuj okazas evento?

When reading the last char ? of that range, the (?!"/>) condition is still verified. So, due to the \K syntax, it wrongly selects the last " char !

This case is special because the string to find is part of the ESR region too. The rule should be :

In single lines, containing the <meta name="description" content=" string, delete any subsequent double-quote, that is not ending the tag. This gives this simple regex :

SEARCH (?-si:<meta name="description" content="|(?!\A)\G).*?\K"(?!/>)

REPLACE Leave EMPTY

Note, that I keep only a No Single Line and Not Insensitive modifiers at beginning of the regex (?-si) and did not use any modifier afterwards, whereas you used this syntax .....(?s-i:(?!"/>).)*?.....

Then, each time, the .+? represents the range of text to forget before catching the " char and the ESR region becomes the final negative look-ahead (?!/>)

Now, the above regex S/R works only for lines containing <meta name="description" content=" Below, here is a regex which will find out any double-quote, between the usual " boundaries, in an HTML or XML file :

SEARCH (?<!=\x20)(?<!=)"(?!>|/>|\x20>|\x20/>|\?>|\x20\?>|\x20\w+=)

Normally, this case should occur only in comments !

Best Regards,

guy038

Robin Cruise

thank you @guy038

(?-si:<meta name="description" content="|(?!\A)\G).*?\K"(?!/>)

So, I extracted a new generic from your regex above:

This is The Generic regex for search and replace:

(?-si:BSR|(?!\A)\G).*?\KFR(?!ESR)

For the second regex you made, I also try to extract the generic, but I can’t figure it out…

guy038

Hi, @robin-cruise,

Regarding the first regex, your equivalent generic regex is correct
However, we cannot find any generic regex, related to my second regex ! Indeed, it just finds any double quote character when :
- Some characters, before the " char, do not occur ( (?<!=\x20)(?<!=) )
AND
- Some characters, after the " char, do not occur ( (?!>|/>|\x20>|\x20/>|\?>|\x20\?>|\x20\w+=) )

BR

guy038

Robin Cruise

@guy038

I try myself to find a generic, from your regex. Works well, except doesn’t work for " (double quotes) because is repeated in the tag construction. I change those extra quotes on the content of tags, with a work, like “BOOM” and it find/replace it well beetween start and ending tag.

These are the generic regex for your second solution. Are almost the same, short and long version. Makes the same thing, find and replace just well between start and ending tags.

(?<!=\x20)(?<!=)FR(?!>|ESR|\x20>|\x20/>|\?>|\x20\?>|BSR)

OR

(?<!=\x20)(?<!=)FR(?!>|ESR|\x20>|\x20/>|\?>|\x20\?>|\x20BSR)

OR

(?<!=\x20)(?<!=)FR(?!>|ESR|\x20\?>|\x20BSR)