Regex: Find and Delete duplicate apostrophe on a html tag

Robin Cruise

hello. I have this html tag:

<meta name="description" content="I love my mother" but I love my sister" more than I can say"/>

As you can see, I have 4 apostrophe in the content. Should be only 2 apostrophe, on the beginning content=" and at the end "/>

I must find all tags that contains other apostrophe except those 2 in the content section.

I made a Regex, but not too good. Maybe you can help me:

FIND: (?-s)(<meta name="description" content=")(*?\K.*"(?s))"/>
REPLACE BY: \1\2

Alan Kilborn

@Robin-Cruise

At first glance this seems to be the typical “replace only within delimiters” problem.

But applying the technique shown HERE doesn’t seem to work.
Maybe @guy038 could comment on that and if possible amend the generic solution so that it does work?

guy038

Hello, @robin-cruise, @alan-kilborn and All,

As @alan-kilborn said, we can use this generic regex :

(?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)

Note that the negative look-head is tested at any position BEFORE the FR regex to search for !

So, in the event that the FR zone ( " ) is located right before the ESR zone ( /> ), you must add a negative look-ahead (?!ESR) after FR, giving this general syntax :

(?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)

In our case, we have :

BSR, Beginning Search-region Regex, is the regex \x20content="
ESR, Ending Search-region Regex, is the regex />
FR, Find Regex, is the regex "
RR, Replacement Regex is the EMPTY string

So, the real regex is (?s-i:\x20content="|(?!\A)\G)(?s-i:(?!/>).)*?\K(?s-i:")(?!/>). Now :

In the first non-capturing group, the s modifier is useless as no . exists in that group
In the second non-capturing group, the i modifier is useless as, either, the string /> and the dot . don’t refer to a letter
In the third non-capturing group, the s and i modifiers are useless, as the non-capturing group as well. Indeed, we just search for a double quote " !

Finally, our practical regex S/R can be simplified as :

SEARCH (?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)

REPLACE Leave EMPTY

You may test this regex S/R against the sample text below :

<meta name="description" content="I" love my mother" but I love my sister" more "/than I can sa"y"/>

<meta name="descrip
tion" content="I" love my 
mother" but I love my sis
ter" more "/than I ca
n sa"
y"/>

<meta name="descrip
tion" content=""I love my 
mother" but I love my sis
ter" more "/than I ca
n sa"
y""/>

<meta name="description" content=""I love my mother" but I love my sister" more "/than I can sa"y""/>

After a Replace All action, you should get the expected text :

<meta name="description" content="I love my mother but I love my sister more /than I can say"/>

<meta name="descrip
tion" content="I love my 
mother but I love my sis
ter more /than I ca
n sa
y"/>

<meta name="descrip
tion" content="I love my 
mother but I love my sis
ter more /than I ca
n sa
y"/>

<meta name="description" content="I love my mother but I love my sister more /than I can say"/>

IMPORTANT :

In case you just want to find the different occurrences, systematically move the caret at the very beginning of current file, before searching
In case of replacement, do not use the Replace button ( step by step replacement )
This regex S/R works fine, even in case of muti-line tags !

Best Regards,

guy038

Robin Cruise

@guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:

(?-i:\x20content=“|(?!\A)\G)(?s:(?!/>).)*?\K”(?!/>)

hello, @guy038 Your regex is almost good. But you must consider one other case:

<meta http-equiv="X-UA-Compatible" content="IE=edge">

In this case, your regex will delete exactly the last apostrophe, and I don’t wanna do that. First and the last apostrophe must remain still. Also, if I update your regex and put one single meta tag , also, will not delete all the apostrophe from content section:

<meta name="description"(?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)

Alan Kilborn

@guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:

So, in the event that the FR zone ( " ) is located right before the ESR zone ( /> ), you must add a negative look-ahead (?!ESR) after FR, giving this general syntax :
(?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)

I suppose that is the reason that the earlier version of the generic solution didn’t work when I tried it.

Robin Cruise

This post is deleted!

guy038

Hello, @robin-cruise, @alan-kilborn and All,

Pheeew ! I worked very hard and I must have fired up the regular expression engine ;-))

I succeeded to get an general regex which is able to match any double-quote, in non-allowed locations of any value of an attribute, in an HTML file ! Of course, this regex does not take in account the comment zones !

Thus, the following S/R :

SEARCH (?is)(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

REPLACE Leave EMPTY

With the free-spacing mode, it can expressed like below :

(?xis)                      # FREE-SPACING mode, search INSENSITIVE to CASE and DOT matches a SINGLE STANDARD char
  <!--.+-->(*SKIP)(*F)      #   Any matched COMMENT zone, MULTI-lines or NOT, is CANCELLED and WORKING location moves RIGHT AFTER that zone
|                           # OR
  (?:                       #   START of 1st NON-CAPTURING group ( BSR )
      ="                    #       ="
    |                       #     OR
      >"                    #       >"
    |                       #     OR
	(?!\A)\G                #       END of PREVIOUS search, if current location is NOT at the VERY BEGINNING of file
  )                         #   END NON-CAPTURING 1st group
                            #
  (?:                       #   START of 2nd NON-CAPTURING group ( ESR )
    [^<>"]                  #     ANY char DIFFERENT from <  and  > and "
  )*?                       #   END NON-CAPTURING 2nd group, defining the SHORTEST range, possibly EMPTY, of chars as ABOVE 
                            #
  \K                        #   Match, so far, CANCELLED and location UPDATED
  "                         #   A NON allowed DOUBLE-QUOTE ( FR ) So the EXPECTED match
                            #
                            #   The regexes, inside the NEGATIVE look-aheads, below, define the CORRECT locations of a DOUBLE-QUOTE
                            #     so the NEGATIVE look-aheads restrict the MATCH of a DOUBLE-QUOTE to NON-ALLOWED locations, ONLY
                            #
  (?!\s*[,;<>])             #   If NOT FOLLOWED with possible BLANK chars FOLLOWED with a , or ; or < or > char
  (?!\s*/>)                 #   AND if NOT FOLLOWED with possible BLANK chars FOLLOWED with the /> string
  (?!\s+[a-z][^<>="]+=\s*") #   AND if NOT FOLLOWED with ( BLANK char(s), a LETTER, some chars DIFFERENT from < and > and = and "
                            #     FOLLOWED with a = char, possible BLANK chars and a " char ) Case of a " char FOLLOWED with an ATTRIBUTE

You can test it against this HTML example, below, containing :

A lot of " , in non-allowed locations, of course, which should be matched by the search regex
Probably, a non-regular HTML syntax. Anyway it’s just for testing !

<html>
  <head>
    <meta charset="UTF-8">
    <meta name=""key"words" content="HTML, CSS, JavaScript""/>
    <meta name=""des"cr
	ip"tion" content="Free Web tu"torials"">
    <!-- <meta name="viewport" content="wi
    dth=device-"width, ini"tial-scale=1.0" -->
    <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>  -->
    <meta name="author" content="John Doe">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content=""width=de""vice-w"idth, i
	nit"ial-" sc"/"ale=1.0"">
  </head>
  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph</p>
    <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>
    <a href=""https:/
	/www.w3"schools.com/
	html"/
	"">"Visit our" H
	TML tu"torial"</a>
    <form action="/action_pa"ge.php">
      <label for=""fna"me">"First "name:"</label>
      <input type="te"xt" id="fname" name=""fna"me"><br><br>
    </form> 
  </body>
</html>

After replacement ( 36 occurrences ), we get :

<html>
  <head>
    <meta charset="UTF-8">
    <meta name="keywords" content="HTML, CSS, JavaScript"/>
    <meta name="descr
	iption" content="Free Web tutorials">
    <!-- <meta name="viewport" content="wi
    dth=device-"width, ini"tial-scale=1.0" -->
    <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>  -->
    <meta name="author" content="John Doe">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, i
	nitial- sc/ale=1.0">
  </head>
  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph</p>
    <a href="https://www.w3schools.com/html/">"Visit our HTML tutorial"</a>
    <a href="https:/
	/www.w3schools.com/
	html/
	">"Visit our H
	TML tutorial"</a>
    <form action="/action_page.php">
      <label for="fname">"First name:"</label>
      <input type="text" id="fname" name="fname"><br><br>
    </form> 
  </body>
</html>

IMPORTANT :

This search regex works properly since the v7.9.1 release only and later
In case you just want to find or mark the different occurrences, systematically move the caret at the very beginning of current file, before searching / marking
In case of replacement, do not use the Replace button ( step by step replacement )

Now, Robin, it’s up to you to find out other cases, that I have not considered yet ;-))

Best Regards

guy038

Robin Cruise

@guy038 yes, works. But you must also mention the <meta> tag in your regex, or else, your regex will also delete other “double-quote”. For example, any html file have also some java script, and the regex you made will delete some double-quotes that must not be deleted.

Try your regex on this script, and will ruin it:

(?is)(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

<script LANGUAGE="JavaScript">
function emailCheck() {
	if (document.Form.nume.value=="") {
		alert("Please enter your full name!");
		document.Form.nume.focus();
		return false
	}
	
	if(document.Form.nume.value.indexOf(' ', 0) == -1){
		alert("Please enter your full name!");
		document.Form.nume.focus();
		return false;
	}

	if (document.Form.email.value=="") {
		alert("Please enter your email!");
		document.Form.email.focus();
		return false;
	}
	if(document.Form.email.value.indexOf('@', 0) == -1){
		alert("Invalid email address!");
		document.Form.email.focus();
		return false;
	}
	if(document.Form.email.value.indexOf('.', 0) == -1){
		alert("Invalid email address!");
		document.Form.email.focus();
		return false;
	}
	if (document.Form.varsta.value=="") {
		alert("Please mention your age!");
		document.Form.varsta.focus();
		return false;
	}
}
</script>

guy038

Hi, @robin-cruise, @alan-kilborn and All,

I did consider the nested Java scripts, within an HTML file with, for instance, that small part of text :

var d=window,e="length",h="",k="__duration__",l="function";function m(c){return document.getElementById(c)}

So, in the present regex, the locations of double-quote, right before a comma and right before a semicolon are skipped because of the negative look-ahead (?!\s*[,;<>])

Now, with your JavaScript example, we must also consider the right syntax ("••••••••"). This means that we have to :

Add the \(" in the BSR region
Change the negative look-ahead (?!\s*[,;<>]) as (?!\s*[,;<>)]) in the second part of the ESR region

In addition, I supposed that, in the BSR region, the =, > and ( characters may be separated from the double-quote with some blank characters

All in all, this gives this new regex version :

SEARCH / MARK :

(?is)(*SKIP)(*F)|(?:=\s*"|>\s*"|\(\s*"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>)])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

REPLACE Leave EMPTY

Reminder : If you just want to find/mark the possible non-allowed ", in an HTML file, remember to always place the caret at the very beginning of file, before processing !!

Particularly, in script parts, the regex may wrongly match some locations of double-quote, which, in fact, are totally legal !

So, in order to simplify the problem, we could consider that all the JavaScript parts are beyond current parsing attempt and treat them in the same way than comments, skipping any <script•••••>•••••••••</script> section !

BR

guy038

P.S. :

I have very basic notions about HTML and none about Javascript ! So, I just consider the HTML text as pure text to be parsed with regular expressions. Sorry for this limitation !

Robin Cruise

super answer, @guy038 Thanks