Regex: Find and Delete duplicate apostrophe on a html tag
-
hello. I have this html tag:
<meta name="description" content="I love my mother" but I love my sister" more than I can say"/>
As you can see, I have 4 apostrophe in the content. Should be only 2 apostrophe, on the beginning
content="
and at the end"/>
I must find all tags that contains other apostrophe except those 2 in the content section.
I made a Regex, but not too good. Maybe you can help me:
FIND:
(?-s)(<meta name="description" content=")(*?\K.*"(?s))"/>
REPLACE BY:\1\2
-
-
Hello, @robin-cruise, @alan-kilborn and All,
As @alan-kilborn said, we can use this generic regex :
(?s-i:
BSR|(?!\A)\G)(?s-i:(?!
ESR).)*?\K(?s-i:
FR)
Note that the negative look-head is tested at any position BEFORE the FR regex to search for !
So, in the event that the FR zone (
"
) is located right before the ESR zone (/>
), you must add a negative look-ahead(?!
ESR)
after FR, giving this general syntax :(?s-i:
BSR|(?!\A)\G)(?s-i:(?!
ESR).)*?\K(?s-i:
FR)(?!
ESR)
In our case, we have :
-
BSR, Beginning Search-region Regex, is the regex
\x20content="
-
ESR, Ending Search-region Regex, is the regex
/>
-
FR, Find Regex, is the regex
"
-
RR, Replacement Regex is the EMPTY string
So, the real regex is
(?s-i:\x20content="|(?!\A)\G)(?s-i:(?!/>).)*?\K(?s-i:")(?!/>)
. Now :-
In the first non-capturing group, the
s
modifier is useless as no.
exists in that group -
In the second non-capturing group, the
i
modifier is useless as, either, the string/>
and the dot.
don’t refer to a letter -
In the third non-capturing group, the
s
andi
modifiers are useless, as the non-capturing group as well. Indeed, we just search for a double quote"
!
Finally, our practical regex S/R can be simplified as :
SEARCH
(?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)
REPLACE
Leave EMPTY
You may test this regex S/R against the sample text below :
<meta name="description" content="I" love my mother" but I love my sister" more "/than I can sa"y"/> <meta name="descrip tion" content="I" love my mother" but I love my sis ter" more "/than I ca n sa" y"/> <meta name="descrip tion" content=""I love my mother" but I love my sis ter" more "/than I ca n sa" y""/> <meta name="description" content=""I love my mother" but I love my sister" more "/than I can sa"y""/>
After a Replace All action, you should get the expected text :
<meta name="description" content="I love my mother but I love my sister more /than I can say"/> <meta name="descrip tion" content="I love my mother but I love my sis ter more /than I ca n sa y"/> <meta name="descrip tion" content="I love my mother but I love my sis ter more /than I ca n sa y"/> <meta name="description" content="I love my mother but I love my sister more /than I can say"/>
IMPORTANT :
-
In case you just want to find the different occurrences, systematically move the caret at the very beginning of current file, before searching
-
In case of replacement, do not use the
Replace
button ( step by step replacement ) -
This regex S/R works fine, even in case of muti-line tags !
Best Regards,
guy038
-
-
@guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:
(?-i:\x20content=“|(?!\A)\G)(?s:(?!/>).)*?\K”(?!/>)
hello, @guy038 Your regex is almost good. But you must consider one other case:
<meta http-equiv="X-UA-Compatible" content="IE=edge">
In this case, your regex will delete exactly the last apostrophe, and I don’t wanna do that. First and the last apostrophe must remain still. Also, if I update your regex and put one single meta tag , also, will not delete all the apostrophe from content section:
<meta name="description"(?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)
-
@guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:
So, in the event that the FR zone ( " ) is located right before the ESR zone ( /> ), you must add a negative look-ahead (?!ESR) after FR, giving this general syntax :
(?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)I suppose that is the reason that the earlier version of the generic solution didn’t work when I tried it.
-
This post is deleted! -
Hello, @robin-cruise, @alan-kilborn and All,
Pheeew ! I worked very hard and I must have fired up the regular expression engine ;-))
I succeeded to get an general regex which is able to match any double-quote, in non-allowed locations of any value of an attribute, in an
HTML
file ! Of course, this regex does not take in account the comment zones !Thus, the following S/R :
SEARCH
(?is)<!--.+-->(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")
REPLACE
Leave EMPTY
With the free-spacing mode, it can expressed like below :
(?xis) # FREE-SPACING mode, search INSENSITIVE to CASE and DOT matches a SINGLE STANDARD char <!--.+-->(*SKIP)(*F) # Any matched COMMENT zone, MULTI-lines or NOT, is CANCELLED and WORKING location moves RIGHT AFTER that zone | # OR (?: # START of 1st NON-CAPTURING group ( BSR ) =" # =" | # OR >" # >" | # OR (?!\A)\G # END of PREVIOUS search, if current location is NOT at the VERY BEGINNING of file ) # END NON-CAPTURING 1st group # (?: # START of 2nd NON-CAPTURING group ( ESR ) [^<>"] # ANY char DIFFERENT from < and > and " )*? # END NON-CAPTURING 2nd group, defining the SHORTEST range, possibly EMPTY, of chars as ABOVE # \K # Match, so far, CANCELLED and location UPDATED " # A NON allowed DOUBLE-QUOTE ( FR ) So the EXPECTED match # # The regexes, inside the NEGATIVE look-aheads, below, define the CORRECT locations of a DOUBLE-QUOTE # so the NEGATIVE look-aheads restrict the MATCH of a DOUBLE-QUOTE to NON-ALLOWED locations, ONLY # (?!\s*[,;<>]) # If NOT FOLLOWED with possible BLANK chars FOLLOWED with a , or ; or < or > char (?!\s*/>) # AND if NOT FOLLOWED with possible BLANK chars FOLLOWED with the /> string (?!\s+[a-z][^<>="]+=\s*") # AND if NOT FOLLOWED with ( BLANK char(s), a LETTER, some chars DIFFERENT from < and > and = and " # FOLLOWED with a = char, possible BLANK chars and a " char ) Case of a " char FOLLOWED with an ATTRIBUTE
You can test it against this
HTML
example, below, containing :-
A lot of
"
, in non-allowed locations, of course, which should be matched by the search regex -
Probably, a non-regular
HTML
syntax. Anyway it’s just for testing !
<html> <head> <meta charset="UTF-8"> <meta name=""key"words" content="HTML, CSS, JavaScript""/> <meta name=""des"cr ip"tion" content="Free Web tu"torials""> <!-- <meta name="viewport" content="wi dth=device-"width, ini"tial-scale=1.0" --> <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a> --> <meta name="author" content="John Doe"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content=""width=de""vice-w"idth, i nit"ial-" sc"/"ale=1.0""> </head> <body> <h1>My First Heading</h1> <p>My first paragraph</p> <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a> <a href=""https:/ /www.w3"schools.com/ html"/ "">"Visit our" H TML tu"torial"</a> <form action="/action_pa"ge.php"> <label for=""fna"me">"First "name:"</label> <input type="te"xt" id="fname" name=""fna"me"><br><br> </form> </body> </html>
After replacement (
36
occurrences ), we get :<html> <head> <meta charset="UTF-8"> <meta name="keywords" content="HTML, CSS, JavaScript"/> <meta name="descr iption" content="Free Web tutorials"> <!-- <meta name="viewport" content="wi dth=device-"width, ini"tial-scale=1.0" --> <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a> --> <meta name="author" content="John Doe"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, i nitial- sc/ale=1.0"> </head> <body> <h1>My First Heading</h1> <p>My first paragraph</p> <a href="https://www.w3schools.com/html/">"Visit our HTML tutorial"</a> <a href="https:/ /www.w3schools.com/ html/ ">"Visit our H TML tutorial"</a> <form action="/action_page.php"> <label for="fname">"First name:"</label> <input type="text" id="fname" name="fname"><br><br> </form> </body> </html>
IMPORTANT :
-
This search regex works properly since the
v7.9.1
release only and later -
In case you just want to find or mark the different occurrences, systematically move the caret at the
very beginning
of current file, before searching / marking -
In case of replacement, do not use the
Replace
button ( step by step replacement )
Now, Robin, it’s up to you to find out other cases, that I have not considered yet ;-))
Best Regards
guy038
-
-
@guy038 yes, works. But you must also mention the <meta> tag in your regex, or else, your regex will also delete other “double-quote”. For example, any html file have also some java script, and the regex you made will delete some double-quotes that must not be deleted.
Try your regex on this script, and will ruin it:
(?is)<!--.+-->(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")
<script LANGUAGE="JavaScript"> function emailCheck() { if (document.Form.nume.value=="") { alert("Please enter your full name!"); document.Form.nume.focus(); return false } if(document.Form.nume.value.indexOf(' ', 0) == -1){ alert("Please enter your full name!"); document.Form.nume.focus(); return false; } if (document.Form.email.value=="") { alert("Please enter your email!"); document.Form.email.focus(); return false; } if(document.Form.email.value.indexOf('@', 0) == -1){ alert("Invalid email address!"); document.Form.email.focus(); return false; } if(document.Form.email.value.indexOf('.', 0) == -1){ alert("Invalid email address!"); document.Form.email.focus(); return false; } if (document.Form.varsta.value=="") { alert("Please mention your age!"); document.Form.varsta.focus(); return false; } } </script>
-
Hi, @robin-cruise, @alan-kilborn and All,
I did consider the nested
Java
scripts, within anHTML
file with, for instance, that small part of text :var d=window,e="length",h="",k="__duration__",l="function";function m(c){return document.getElementById(c)}
So, in the present regex, the locations of double-quote, right before a comma and right before a semicolon are skipped because of the negative look-ahead
(?!\s*[,;<>])
Now, with your
JavaScript
example, we must also consider the right syntax("••••••••")
. This means that we have to :-
Add the
\("
in the BSR region -
Change the negative look-ahead
(?!\s*[,;<>])
as(?!\s*[,;<>)])
in the second part of the ESR region
In addition, I supposed that, in the BSR region, the
=
,>
and(
characters may be separated from the double-quote with some blank charactersAll in all, this gives this new regex version :
SEARCH / MARK :
(?is)<!--.+-->(*SKIP)(*F)|(?:=\s*"|>\s*"|\(\s*"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>)])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")
REPLACE
Leave EMPTY
Reminder : If you just want to find/mark the possible non-allowed
"
, in anHTML
file, remember to always place the caret at the very beginning of file, before processing !!Particularly, in script parts, the regex may wrongly match some locations of double-quote, which, in fact, are totally legal !
So, in order to simplify the problem, we could consider that all the
JavaScript
parts are beyond current parsing attempt and treat them in the same way than comments, skipping any<script•••••>•••••••••</script>
section !BR
guy038
P.S. :
I have very basic notions about
HTML
and none aboutJavascript
! So, I just consider theHTML
text as pure text to be parsed with regular expressions. Sorry for this limitation ! -
-
super answer, @guy038 Thanks