Regex: Find and Delete duplicate apostrophe on a html tag
-
hello. I have this html tag:
<meta name="description" content="I love my mother" but I love my sister" more than I can say"/>As you can see, I have 4 apostrophe in the content. Should be only 2 apostrophe, on the beginning
content="and at the end"/>I must find all tags that contains other apostrophe except those 2 in the content section.
I made a Regex, but not too good. Maybe you can help me:
FIND:
(?-s)(<meta name="description" content=")(*?\K.*"(?s))"/>
REPLACE BY:\1\2 -
-
Hello, @robin-cruise, @alan-kilborn and All,
As @alan-kilborn said, we can use this generic regex :
(?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)Note that the negative look-head is tested at any position BEFORE the FR regex to search for !
So, in the event that the FR zone (
") is located right before the ESR zone (/>), you must add a negative look-ahead(?!ESR)after FR, giving this general syntax :(?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)
In our case, we have :
-
BSR, Beginning Search-region Regex, is the regex
\x20content=" -
ESR, Ending Search-region Regex, is the regex
/> -
FR, Find Regex, is the regex
" -
RR, Replacement Regex is the EMPTY string
So, the real regex is
(?s-i:\x20content="|(?!\A)\G)(?s-i:(?!/>).)*?\K(?s-i:")(?!/>). Now :-
In the first non-capturing group, the
smodifier is useless as no.exists in that group -
In the second non-capturing group, the
imodifier is useless as, either, the string/>and the dot.don’t refer to a letter -
In the third non-capturing group, the
sandimodifiers are useless, as the non-capturing group as well. Indeed, we just search for a double quote"!
Finally, our practical regex S/R can be simplified as :
SEARCH
(?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)REPLACE
Leave EMPTYYou may test this regex S/R against the sample text below :
<meta name="description" content="I" love my mother" but I love my sister" more "/than I can sa"y"/> <meta name="descrip tion" content="I" love my mother" but I love my sis ter" more "/than I ca n sa" y"/> <meta name="descrip tion" content=""I love my mother" but I love my sis ter" more "/than I ca n sa" y""/> <meta name="description" content=""I love my mother" but I love my sister" more "/than I can sa"y""/>After a Replace All action, you should get the expected text :
<meta name="description" content="I love my mother but I love my sister more /than I can say"/> <meta name="descrip tion" content="I love my mother but I love my sis ter more /than I ca n sa y"/> <meta name="descrip tion" content="I love my mother but I love my sis ter more /than I ca n sa y"/> <meta name="description" content="I love my mother but I love my sister more /than I can say"/>
IMPORTANT :
-
In case you just want to find the different occurrences, systematically move the caret at the very beginning of current file, before searching
-
In case of replacement, do not use the
Replacebutton ( step by step replacement ) -
This regex S/R works fine, even in case of muti-line tags !
Best Regards,
guy038
-
-
@guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:
(?-i:\x20content=“|(?!\A)\G)(?s:(?!/>).)*?\K”(?!/>)
hello, @guy038 Your regex is almost good. But you must consider one other case:
<meta http-equiv="X-UA-Compatible" content="IE=edge">In this case, your regex will delete exactly the last apostrophe, and I don’t wanna do that. First and the last apostrophe must remain still. Also, if I update your regex and put one single meta tag , also, will not delete all the apostrophe from content section:
<meta name="description"(?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>) -
@guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:
So, in the event that the FR zone ( " ) is located right before the ESR zone ( /> ), you must add a negative look-ahead (?!ESR) after FR, giving this general syntax :
(?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)I suppose that is the reason that the earlier version of the generic solution didn’t work when I tried it.
-
This post is deleted! -
Hello, @robin-cruise, @alan-kilborn and All,
Pheeew ! I worked very hard and I must have fired up the regular expression engine ;-))
I succeeded to get an general regex which is able to match any double-quote, in non-allowed locations of any value of an attribute, in an
HTMLfile ! Of course, this regex does not take in account the comment zones !Thus, the following S/R :
SEARCH
(?is)<!--.+-->(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")REPLACE
Leave EMPTY
With the free-spacing mode, it can expressed like below :
(?xis) # FREE-SPACING mode, search INSENSITIVE to CASE and DOT matches a SINGLE STANDARD char <!--.+-->(*SKIP)(*F) # Any matched COMMENT zone, MULTI-lines or NOT, is CANCELLED and WORKING location moves RIGHT AFTER that zone | # OR (?: # START of 1st NON-CAPTURING group ( BSR ) =" # =" | # OR >" # >" | # OR (?!\A)\G # END of PREVIOUS search, if current location is NOT at the VERY BEGINNING of file ) # END NON-CAPTURING 1st group # (?: # START of 2nd NON-CAPTURING group ( ESR ) [^<>"] # ANY char DIFFERENT from < and > and " )*? # END NON-CAPTURING 2nd group, defining the SHORTEST range, possibly EMPTY, of chars as ABOVE # \K # Match, so far, CANCELLED and location UPDATED " # A NON allowed DOUBLE-QUOTE ( FR ) So the EXPECTED match # # The regexes, inside the NEGATIVE look-aheads, below, define the CORRECT locations of a DOUBLE-QUOTE # so the NEGATIVE look-aheads restrict the MATCH of a DOUBLE-QUOTE to NON-ALLOWED locations, ONLY # (?!\s*[,;<>]) # If NOT FOLLOWED with possible BLANK chars FOLLOWED with a , or ; or < or > char (?!\s*/>) # AND if NOT FOLLOWED with possible BLANK chars FOLLOWED with the /> string (?!\s+[a-z][^<>="]+=\s*") # AND if NOT FOLLOWED with ( BLANK char(s), a LETTER, some chars DIFFERENT from < and > and = and " # FOLLOWED with a = char, possible BLANK chars and a " char ) Case of a " char FOLLOWED with an ATTRIBUTE
You can test it against this
HTMLexample, below, containing :-
A lot of
", in non-allowed locations, of course, which should be matched by the search regex -
Probably, a non-regular
HTMLsyntax. Anyway it’s just for testing !
<html> <head> <meta charset="UTF-8"> <meta name=""key"words" content="HTML, CSS, JavaScript""/> <meta name=""des"cr ip"tion" content="Free Web tu"torials""> <!-- <meta name="viewport" content="wi dth=device-"width, ini"tial-scale=1.0" --> <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a> --> <meta name="author" content="John Doe"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content=""width=de""vice-w"idth, i nit"ial-" sc"/"ale=1.0""> </head> <body> <h1>My First Heading</h1> <p>My first paragraph</p> <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a> <a href=""https:/ /www.w3"schools.com/ html"/ "">"Visit our" H TML tu"torial"</a> <form action="/action_pa"ge.php"> <label for=""fna"me">"First "name:"</label> <input type="te"xt" id="fname" name=""fna"me"><br><br> </form> </body> </html>After replacement (
36occurrences ), we get :<html> <head> <meta charset="UTF-8"> <meta name="keywords" content="HTML, CSS, JavaScript"/> <meta name="descr iption" content="Free Web tutorials"> <!-- <meta name="viewport" content="wi dth=device-"width, ini"tial-scale=1.0" --> <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a> --> <meta name="author" content="John Doe"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, i nitial- sc/ale=1.0"> </head> <body> <h1>My First Heading</h1> <p>My first paragraph</p> <a href="https://www.w3schools.com/html/">"Visit our HTML tutorial"</a> <a href="https:/ /www.w3schools.com/ html/ ">"Visit our H TML tutorial"</a> <form action="/action_page.php"> <label for="fname">"First name:"</label> <input type="text" id="fname" name="fname"><br><br> </form> </body> </html>
IMPORTANT :
-
This search regex works properly since the
v7.9.1release only and later -
In case you just want to find or mark the different occurrences, systematically move the caret at the
very beginningof current file, before searching / marking -
In case of replacement, do not use the
Replacebutton ( step by step replacement )
Now, Robin, it’s up to you to find out other cases, that I have not considered yet ;-))
Best Regards
guy038
-
-
@guy038 yes, works. But you must also mention the <meta> tag in your regex, or else, your regex will also delete other “double-quote”. For example, any html file have also some java script, and the regex you made will delete some double-quotes that must not be deleted.
Try your regex on this script, and will ruin it:
(?is)<!--.+-->(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")<script LANGUAGE="JavaScript"> function emailCheck() { if (document.Form.nume.value=="") { alert("Please enter your full name!"); document.Form.nume.focus(); return false } if(document.Form.nume.value.indexOf(' ', 0) == -1){ alert("Please enter your full name!"); document.Form.nume.focus(); return false; } if (document.Form.email.value=="") { alert("Please enter your email!"); document.Form.email.focus(); return false; } if(document.Form.email.value.indexOf('@', 0) == -1){ alert("Invalid email address!"); document.Form.email.focus(); return false; } if(document.Form.email.value.indexOf('.', 0) == -1){ alert("Invalid email address!"); document.Form.email.focus(); return false; } if (document.Form.varsta.value=="") { alert("Please mention your age!"); document.Form.varsta.focus(); return false; } } </script> -
Hi, @robin-cruise, @alan-kilborn and All,
I did consider the nested
Javascripts, within anHTMLfile with, for instance, that small part of text :var d=window,e="length",h="",k="__duration__",l="function";function m(c){return document.getElementById(c)}So, in the present regex, the locations of double-quote, right before a comma and right before a semicolon are skipped because of the negative look-ahead
(?!\s*[,;<>])Now, with your
JavaScriptexample, we must also consider the right syntax("••••••••"). This means that we have to :-
Add the
\("in the BSR region -
Change the negative look-ahead
(?!\s*[,;<>])as(?!\s*[,;<>)])in the second part of the ESR region
In addition, I supposed that, in the BSR region, the
=,>and(characters may be separated from the double-quote with some blank charactersAll in all, this gives this new regex version :
SEARCH / MARK :
(?is)<!--.+-->(*SKIP)(*F)|(?:=\s*"|>\s*"|\(\s*"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>)])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")REPLACE
Leave EMPTY
Reminder : If you just want to find/mark the possible non-allowed
", in anHTMLfile, remember to always place the caret at the very beginning of file, before processing !!Particularly, in script parts, the regex may wrongly match some locations of double-quote, which, in fact, are totally legal !
So, in order to simplify the problem, we could consider that all the
JavaScriptparts are beyond current parsing attempt and treat them in the same way than comments, skipping any<script•••••>•••••••••</script>section !BR
guy038
P.S. :
I have very basic notions about
HTMLand none aboutJavascript! So, I just consider theHTMLtext as pure text to be parsed with regular expressions. Sorry for this limitation ! -
-
super answer, @guy038 Thanks