Regex: Find and Delete duplicate apostrophe on a html tag



  • hello. I have this html tag:

    <meta name="description" content="I love my mother" but I love my sister" more than I can say"/>

    As you can see, I have 4 apostrophe in the content. Should be only 2 apostrophe, on the beginning content=" and at the end "/>

    I must find all tags that contains other apostrophe except those 2 in the content section.

    I made a Regex, but not too good. Maybe you can help me:

    FIND: (?-s)(<meta name="description" content=")(*?\K.*"(?s))"/>
    REPLACE BY: \1\2



  • @Robin-Cruise

    At first glance this seems to be the typical “replace only within delimiters” problem.

    But applying the technique shown HERE doesn’t seem to work.
    Maybe @guy038 could comment on that and if possible amend the generic solution so that it does work?



  • Hello, @robin-cruise, @alan-kilborn and All,

    As @alan-kilborn said, we can use this generic regex :

    (?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)

    Note that the negative look-head is tested at any position BEFORE the FR regex to search for !

    So, in the event that the FR zone ( " ) is located right before the ESR zone ( /> ), you must add a negative look-ahead (?!ESR) after FR, giving this general syntax :

    (?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)


    In our case, we have :

    • BSR, Beginning Search-region Regex, is the regex \x20content="

    • ESR, Ending Search-region Regex, is the regex />

    • FR, Find Regex, is the regex "

    • RR, Replacement Regex is the EMPTY string

    So, the real regex is (?s-i:\x20content="|(?!\A)\G)(?s-i:(?!/>).)*?\K(?s-i:")(?!/>). Now :

    • In the first non-capturing group, the s modifier is useless as no . exists in that group

    • In the second non-capturing group, the i modifier is useless as, either, the string /> and the dot . don’t refer to a letter

    • In the third non-capturing group, the s and i modifiers are useless, as the non-capturing group as well. Indeed, we just search for a double quote " !

    Finally, our practical regex S/R can be simplified as :

    SEARCH (?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)

    REPLACE Leave EMPTY

    You may test this regex S/R against the sample text below :

    <meta name="description" content="I" love my mother" but I love my sister" more "/than I can sa"y"/>
    
    <meta name="descrip
    tion" content="I" love my 
    mother" but I love my sis
    ter" more "/than I ca
    n sa"
    y"/>
    
    <meta name="descrip
    tion" content=""I love my 
    mother" but I love my sis
    ter" more "/than I ca
    n sa"
    y""/>
    
    <meta name="description" content=""I love my mother" but I love my sister" more "/than I can sa"y""/>
    

    After a Replace All action, you should get the expected text :

    <meta name="description" content="I love my mother but I love my sister more /than I can say"/>
    
    <meta name="descrip
    tion" content="I love my 
    mother but I love my sis
    ter more /than I ca
    n sa
    y"/>
    
    <meta name="descrip
    tion" content="I love my 
    mother but I love my sis
    ter more /than I ca
    n sa
    y"/>
    
    <meta name="description" content="I love my mother but I love my sister more /than I can say"/>
    

    IMPORTANT :

    • In case you just want to find the different occurrences, systematically move the caret at the very beginning of current file, before searching

    • In case of replacement, do not use the Replace button ( step by step replacement )

    • This regex S/R works fine, even in case of muti-line tags !

    Best Regards,

    guy038



  • @guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:

    (?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)

    hello, @guy038 Your regex is almost good. But you must consider one other case:

    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    In this case, your regex will delete exactly the last apostrophe, and I don’t wanna do that. First and the last apostrophe must remain still. Also, if I update your regex and put one single meta tag , also, will not delete all the apostrophe from content section:

    <meta name="description"(?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)



  • @guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:

    So, in the event that the FR zone ( " ) is located right before the ESR zone ( /> ), you must add a negative look-ahead (?!ESR) after FR, giving this general syntax :
    (?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)

    I suppose that is the reason that the earlier version of the generic solution didn’t work when I tried it.



  • This post is deleted!


  • Hello, @robin-cruise, @alan-kilborn and All,

    Pheeew ! I worked very hard and I must have fired up the regular expression engine ;-))

    I succeeded to get an general regex which is able to match any double-quote, in non-allowed locations of any value of an attribute, in an HTML file ! Of course, this regex does not take in account the comment zones !

    Thus, the following S/R :

    SEARCH (?is)<!--.+-->(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

    REPLACE Leave EMPTY


    With the free-spacing mode, it can expressed like below :

    (?xis)                      # FREE-SPACING mode, search INSENSITIVE to CASE and DOT matches a SINGLE STANDARD char
      <!--.+-->(*SKIP)(*F)      #   Any matched COMMENT zone, MULTI-lines or NOT, is CANCELLED and WORKING location moves RIGHT AFTER that zone
    |                           # OR
      (?:                       #   START of 1st NON-CAPTURING group ( BSR )
          ="                    #       ="
        |                       #     OR
          >"                    #       >"
        |                       #     OR
    	(?!\A)\G                #       END of PREVIOUS search, if current location is NOT at the VERY BEGINNING of file
      )                         #   END NON-CAPTURING 1st group
                                #
      (?:                       #   START of 2nd NON-CAPTURING group ( ESR )
        [^<>"]                  #     ANY char DIFFERENT from <  and  > and "
      )*?                       #   END NON-CAPTURING 2nd group, defining the SHORTEST range, possibly EMPTY, of chars as ABOVE 
                                #
      \K                        #   Match, so far, CANCELLED and location UPDATED
      "                         #   A NON allowed DOUBLE-QUOTE ( FR ) So the EXPECTED match
                                #
                                #   The regexes, inside the NEGATIVE look-aheads, below, define the CORRECT locations of a DOUBLE-QUOTE
                                #     so the NEGATIVE look-aheads restrict the MATCH of a DOUBLE-QUOTE to NON-ALLOWED locations, ONLY
                                #
      (?!\s*[,;<>])             #   If NOT FOLLOWED with possible BLANK chars FOLLOWED with a , or ; or < or > char
      (?!\s*/>)                 #   AND if NOT FOLLOWED with possible BLANK chars FOLLOWED with the /> string
      (?!\s+[a-z][^<>="]+=\s*") #   AND if NOT FOLLOWED with ( BLANK char(s), a LETTER, some chars DIFFERENT from < and > and = and "
                                #     FOLLOWED with a = char, possible BLANK chars and a " char ) Case of a " char FOLLOWED with an ATTRIBUTE
    

    You can test it against this HTML example, below, containing :

    • A lot of " , in non-allowed locations, of course, which should be matched by the search regex

    • Probably, a non-regular HTML syntax. Anyway it’s just for testing !

    <html>
      <head>
        <meta charset="UTF-8">
        <meta name=""key"words" content="HTML, CSS, JavaScript""/>
        <meta name=""des"cr
    	ip"tion" content="Free Web tu"torials"">
        <!-- <meta name="viewport" content="wi
        dth=device-"width, ini"tial-scale=1.0" -->
        <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>  -->
        <meta name="author" content="John Doe">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content=""width=de""vice-w"idth, i
    	nit"ial-" sc"/"ale=1.0"">
      </head>
      <body>
        <h1>My First Heading</h1>
        <p>My first paragraph</p>
        <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>
        <a href=""https:/
    	/www.w3"schools.com/
    	html"/
    	"">"Visit our" H
    	TML tu"torial"</a>
        <form action="/action_pa"ge.php">
          <label for=""fna"me">"First "name:"</label>
          <input type="te"xt" id="fname" name=""fna"me"><br><br>
        </form> 
      </body>
    </html> 
    

    After replacement ( 36 occurrences ), we get :

    <html>
      <head>
        <meta charset="UTF-8">
        <meta name="keywords" content="HTML, CSS, JavaScript"/>
        <meta name="descr
    	iption" content="Free Web tutorials">
        <!-- <meta name="viewport" content="wi
        dth=device-"width, ini"tial-scale=1.0" -->
        <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>  -->
        <meta name="author" content="John Doe">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, i
    	nitial- sc/ale=1.0">
      </head>
      <body>
        <h1>My First Heading</h1>
        <p>My first paragraph</p>
        <a href="https://www.w3schools.com/html/">"Visit our HTML tutorial"</a>
        <a href="https:/
    	/www.w3schools.com/
    	html/
    	">"Visit our H
    	TML tutorial"</a>
        <form action="/action_page.php">
          <label for="fname">"First name:"</label>
          <input type="text" id="fname" name="fname"><br><br>
        </form> 
      </body>
    </html> 
    

    IMPORTANT :

    • This search regex works properly since the v7.9.1 release only and later

    • In case you just want to find or mark the different occurrences, systematically move the caret at the very beginning of current file, before searching / marking

    • In case of replacement, do not use the Replace button ( step by step replacement )

    Now, Robin, it’s up to you to find out other cases, that I have not considered yet ;-))

    Best Regards

    guy038



  • @guy038 yes, works. But you must also mention the <meta> tag in your regex, or else, your regex will also delete other “double-quote”. For example, any html file have also some java script, and the regex you made will delete some double-quotes that must not be deleted.

    Try your regex on this script, and will ruin it:

    (?is)<!--.+-->(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

    <script LANGUAGE="JavaScript">
    function emailCheck() {
    	if (document.Form.nume.value=="") {
    		alert("Please enter your full name!");
    		document.Form.nume.focus();
    		return false
    	}
    	
    	if(document.Form.nume.value.indexOf(' ', 0) == -1){
    		alert("Please enter your full name!");
    		document.Form.nume.focus();
    		return false;
    	}
    
    	if (document.Form.email.value=="") {
    		alert("Please enter your email!");
    		document.Form.email.focus();
    		return false;
    	}
    	if(document.Form.email.value.indexOf('@', 0) == -1){
    		alert("Invalid email address!");
    		document.Form.email.focus();
    		return false;
    	}
    	if(document.Form.email.value.indexOf('.', 0) == -1){
    		alert("Invalid email address!");
    		document.Form.email.focus();
    		return false;
    	}
    	if (document.Form.varsta.value=="") {
    		alert("Please mention your age!");
    		document.Form.varsta.focus();
    		return false;
    	}
    }
    </script>
    


  • Hi, @robin-cruise, @alan-kilborn and All,

    I did consider the nested Java scripts, within an HTML file with, for instance, that small part of text :

    var d=window,e="length",h="",k="__duration__",l="function";function m(c){return document.getElementById(c)}
    

    So, in the present regex, the locations of double-quote, right before a comma and right before a semicolon are skipped because of the negative look-ahead (?!\s*[,;<>])

    Now, with your JavaScript example, we must also consider the right syntax ("••••••••"). This means that we have to :

    • Add the \(" in the BSR region

    • Change the negative look-ahead (?!\s*[,;<>]) as (?!\s*[,;<>)]) in the second part of the ESR region

    In addition, I supposed that, in the BSR region, the =, > and ( characters may be separated from the double-quote with some blank characters

    All in all, this gives this new regex version :

    SEARCH / MARK :

    (?is)<!--.+-->(*SKIP)(*F)|(?:=\s*"|>\s*"|\(\s*"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>)])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

    REPLACE Leave EMPTY


    Reminder : If you just want to find/mark the possible non-allowed ", in an HTML file, remember to always place the caret at the very beginning of file, before processing !!

    Particularly, in script parts, the regex may wrongly match some locations of double-quote, which, in fact, are totally legal !

    So, in order to simplify the problem, we could consider that all the JavaScript parts are beyond current parsing attempt and treat them in the same way than comments, skipping any <script•••••>•••••••••</script> section !

    BR

    guy038

    P.S. :

    I have very basic notions about HTML and none about Javascript ! So, I just consider the HTML text as pure text to be parsed with regular expressions. Sorry for this limitation !



  • super answer, @guy038 Thanks


Log in to reply