Community
    • Login

    Regex: Find and Delete duplicate apostrophe on a html tag

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    10 Posts 3 Posters 617 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR
      Robin Cruise
      last edited by Robin Cruise

      hello. I have this html tag:

      <meta name="description" content="I love my mother" but I love my sister" more than I can say"/>

      As you can see, I have 4 apostrophe in the content. Should be only 2 apostrophe, on the beginning content=" and at the end "/>

      I must find all tags that contains other apostrophe except those 2 in the content section.

      I made a Regex, but not too good. Maybe you can help me:

      FIND: (?-s)(<meta name="description" content=")(*?\K.*"(?s))"/>
      REPLACE BY: \1\2

      Alan KilbornA 1 Reply Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @Robin Cruise
        last edited by

        @Robin-Cruise

        At first glance this seems to be the typical “replace only within delimiters” problem.

        But applying the technique shown HERE doesn’t seem to work.
        Maybe @guy038 could comment on that and if possible amend the generic solution so that it does work?

        1 Reply Last reply Reply Quote 1
        • guy038G
          guy038
          last edited by

          Hello, @robin-cruise, @alan-kilborn and All,

          As @alan-kilborn said, we can use this generic regex :

          (?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)

          Note that the negative look-head is tested at any position BEFORE the FR regex to search for !

          So, in the event that the FR zone ( " ) is located right before the ESR zone ( /> ), you must add a negative look-ahead (?!ESR) after FR, giving this general syntax :

          (?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)


          In our case, we have :

          • BSR, Beginning Search-region Regex, is the regex \x20content="

          • ESR, Ending Search-region Regex, is the regex />

          • FR, Find Regex, is the regex "

          • RR, Replacement Regex is the EMPTY string

          So, the real regex is (?s-i:\x20content="|(?!\A)\G)(?s-i:(?!/>).)*?\K(?s-i:")(?!/>). Now :

          • In the first non-capturing group, the s modifier is useless as no . exists in that group

          • In the second non-capturing group, the i modifier is useless as, either, the string /> and the dot . don’t refer to a letter

          • In the third non-capturing group, the s and i modifiers are useless, as the non-capturing group as well. Indeed, we just search for a double quote " !

          Finally, our practical regex S/R can be simplified as :

          SEARCH (?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)

          REPLACE Leave EMPTY

          You may test this regex S/R against the sample text below :

          <meta name="description" content="I" love my mother" but I love my sister" more "/than I can sa"y"/>
          
          <meta name="descrip
          tion" content="I" love my 
          mother" but I love my sis
          ter" more "/than I ca
          n sa"
          y"/>
          
          <meta name="descrip
          tion" content=""I love my 
          mother" but I love my sis
          ter" more "/than I ca
          n sa"
          y""/>
          
          <meta name="description" content=""I love my mother" but I love my sister" more "/than I can sa"y""/>
          

          After a Replace All action, you should get the expected text :

          <meta name="description" content="I love my mother but I love my sister more /than I can say"/>
          
          <meta name="descrip
          tion" content="I love my 
          mother but I love my sis
          ter more /than I ca
          n sa
          y"/>
          
          <meta name="descrip
          tion" content="I love my 
          mother but I love my sis
          ter more /than I ca
          n sa
          y"/>
          
          <meta name="description" content="I love my mother but I love my sister more /than I can say"/>
          

          IMPORTANT :

          • In case you just want to find the different occurrences, systematically move the caret at the very beginning of current file, before searching

          • In case of replacement, do not use the Replace button ( step by step replacement )

          • This regex S/R works fine, even in case of muti-line tags !

          Best Regards,

          guy038

          1 Reply Last reply Reply Quote 1
          • Robin CruiseR
            Robin Cruise
            last edited by Robin Cruise

            @guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:

            (?-i:\x20content=“|(?!\A)\G)(?s:(?!/>).)*?\K”(?!/>)

            hello, @guy038 Your regex is almost good. But you must consider one other case:

            <meta http-equiv="X-UA-Compatible" content="IE=edge">

            In this case, your regex will delete exactly the last apostrophe, and I don’t wanna do that. First and the last apostrophe must remain still. Also, if I update your regex and put one single meta tag , also, will not delete all the apostrophe from content section:

            <meta name="description"(?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)

            1 Reply Last reply Reply Quote 0
            • Alan KilbornA
              Alan Kilborn
              last edited by

              @guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:

              So, in the event that the FR zone ( " ) is located right before the ESR zone ( /> ), you must add a negative look-ahead (?!ESR) after FR, giving this general syntax :
              (?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)

              I suppose that is the reason that the earlier version of the generic solution didn’t work when I tried it.

              1 Reply Last reply Reply Quote 0
              • Robin CruiseR
                Robin Cruise
                last edited by

                This post is deleted!
                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, @robin-cruise, @alan-kilborn and All,

                  Pheeew ! I worked very hard and I must have fired up the regular expression engine ;-))

                  I succeeded to get an general regex which is able to match any double-quote, in non-allowed locations of any value of an attribute, in an HTML file ! Of course, this regex does not take in account the comment zones !

                  Thus, the following S/R :

                  SEARCH (?is)<!--.+-->(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

                  REPLACE Leave EMPTY


                  With the free-spacing mode, it can expressed like below :

                  (?xis)                      # FREE-SPACING mode, search INSENSITIVE to CASE and DOT matches a SINGLE STANDARD char
                    <!--.+-->(*SKIP)(*F)      #   Any matched COMMENT zone, MULTI-lines or NOT, is CANCELLED and WORKING location moves RIGHT AFTER that zone
                  |                           # OR
                    (?:                       #   START of 1st NON-CAPTURING group ( BSR )
                        ="                    #       ="
                      |                       #     OR
                        >"                    #       >"
                      |                       #     OR
                  	(?!\A)\G                #       END of PREVIOUS search, if current location is NOT at the VERY BEGINNING of file
                    )                         #   END NON-CAPTURING 1st group
                                              #
                    (?:                       #   START of 2nd NON-CAPTURING group ( ESR )
                      [^<>"]                  #     ANY char DIFFERENT from <  and  > and "
                    )*?                       #   END NON-CAPTURING 2nd group, defining the SHORTEST range, possibly EMPTY, of chars as ABOVE 
                                              #
                    \K                        #   Match, so far, CANCELLED and location UPDATED
                    "                         #   A NON allowed DOUBLE-QUOTE ( FR ) So the EXPECTED match
                                              #
                                              #   The regexes, inside the NEGATIVE look-aheads, below, define the CORRECT locations of a DOUBLE-QUOTE
                                              #     so the NEGATIVE look-aheads restrict the MATCH of a DOUBLE-QUOTE to NON-ALLOWED locations, ONLY
                                              #
                    (?!\s*[,;<>])             #   If NOT FOLLOWED with possible BLANK chars FOLLOWED with a , or ; or < or > char
                    (?!\s*/>)                 #   AND if NOT FOLLOWED with possible BLANK chars FOLLOWED with the /> string
                    (?!\s+[a-z][^<>="]+=\s*") #   AND if NOT FOLLOWED with ( BLANK char(s), a LETTER, some chars DIFFERENT from < and > and = and "
                                              #     FOLLOWED with a = char, possible BLANK chars and a " char ) Case of a " char FOLLOWED with an ATTRIBUTE
                  

                  You can test it against this HTML example, below, containing :

                  • A lot of " , in non-allowed locations, of course, which should be matched by the search regex

                  • Probably, a non-regular HTML syntax. Anyway it’s just for testing !

                  <html>
                    <head>
                      <meta charset="UTF-8">
                      <meta name=""key"words" content="HTML, CSS, JavaScript""/>
                      <meta name=""des"cr
                  	ip"tion" content="Free Web tu"torials"">
                      <!-- <meta name="viewport" content="wi
                      dth=device-"width, ini"tial-scale=1.0" -->
                      <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>  -->
                      <meta name="author" content="John Doe">
                      <meta http-equiv="X-UA-Compatible" content="IE=edge">
                      <meta name="viewport" content=""width=de""vice-w"idth, i
                  	nit"ial-" sc"/"ale=1.0"">
                    </head>
                    <body>
                      <h1>My First Heading</h1>
                      <p>My first paragraph</p>
                      <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>
                      <a href=""https:/
                  	/www.w3"schools.com/
                  	html"/
                  	"">"Visit our" H
                  	TML tu"torial"</a>
                      <form action="/action_pa"ge.php">
                        <label for=""fna"me">"First "name:"</label>
                        <input type="te"xt" id="fname" name=""fna"me"><br><br>
                      </form> 
                    </body>
                  </html> 
                  

                  After replacement ( 36 occurrences ), we get :

                  <html>
                    <head>
                      <meta charset="UTF-8">
                      <meta name="keywords" content="HTML, CSS, JavaScript"/>
                      <meta name="descr
                  	iption" content="Free Web tutorials">
                      <!-- <meta name="viewport" content="wi
                      dth=device-"width, ini"tial-scale=1.0" -->
                      <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>  -->
                      <meta name="author" content="John Doe">
                      <meta http-equiv="X-UA-Compatible" content="IE=edge">
                      <meta name="viewport" content="width=device-width, i
                  	nitial- sc/ale=1.0">
                    </head>
                    <body>
                      <h1>My First Heading</h1>
                      <p>My first paragraph</p>
                      <a href="https://www.w3schools.com/html/">"Visit our HTML tutorial"</a>
                      <a href="https:/
                  	/www.w3schools.com/
                  	html/
                  	">"Visit our H
                  	TML tutorial"</a>
                      <form action="/action_page.php">
                        <label for="fname">"First name:"</label>
                        <input type="text" id="fname" name="fname"><br><br>
                      </form> 
                    </body>
                  </html> 
                  

                  IMPORTANT :

                  • This search regex works properly since the v7.9.1 release only and later

                  • In case you just want to find or mark the different occurrences, systematically move the caret at the very beginning of current file, before searching / marking

                  • In case of replacement, do not use the Replace button ( step by step replacement )

                  Now, Robin, it’s up to you to find out other cases, that I have not considered yet ;-))

                  Best Regards

                  guy038

                  1 Reply Last reply Reply Quote 1
                  • Robin CruiseR
                    Robin Cruise
                    last edited by

                    @guy038 yes, works. But you must also mention the <meta> tag in your regex, or else, your regex will also delete other “double-quote”. For example, any html file have also some java script, and the regex you made will delete some double-quotes that must not be deleted.

                    Try your regex on this script, and will ruin it:

                    (?is)<!--.+-->(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

                    <script LANGUAGE="JavaScript">
                    function emailCheck() {
                    	if (document.Form.nume.value=="") {
                    		alert("Please enter your full name!");
                    		document.Form.nume.focus();
                    		return false
                    	}
                    	
                    	if(document.Form.nume.value.indexOf(' ', 0) == -1){
                    		alert("Please enter your full name!");
                    		document.Form.nume.focus();
                    		return false;
                    	}
                    
                    	if (document.Form.email.value=="") {
                    		alert("Please enter your email!");
                    		document.Form.email.focus();
                    		return false;
                    	}
                    	if(document.Form.email.value.indexOf('@', 0) == -1){
                    		alert("Invalid email address!");
                    		document.Form.email.focus();
                    		return false;
                    	}
                    	if(document.Form.email.value.indexOf('.', 0) == -1){
                    		alert("Invalid email address!");
                    		document.Form.email.focus();
                    		return false;
                    	}
                    	if (document.Form.varsta.value=="") {
                    		alert("Please mention your age!");
                    		document.Form.varsta.focus();
                    		return false;
                    	}
                    }
                    </script>
                    
                    1 Reply Last reply Reply Quote 0
                    • guy038G
                      guy038
                      last edited by guy038

                      Hi, @robin-cruise, @alan-kilborn and All,

                      I did consider the nested Java scripts, within an HTML file with, for instance, that small part of text :

                      var d=window,e="length",h="",k="__duration__",l="function";function m(c){return document.getElementById(c)}
                      

                      So, in the present regex, the locations of double-quote, right before a comma and right before a semicolon are skipped because of the negative look-ahead (?!\s*[,;<>])

                      Now, with your JavaScript example, we must also consider the right syntax ("••••••••"). This means that we have to :

                      • Add the \(" in the BSR region

                      • Change the negative look-ahead (?!\s*[,;<>]) as (?!\s*[,;<>)]) in the second part of the ESR region

                      In addition, I supposed that, in the BSR region, the =, > and ( characters may be separated from the double-quote with some blank characters

                      All in all, this gives this new regex version :

                      SEARCH / MARK :

                      (?is)<!--.+-->(*SKIP)(*F)|(?:=\s*"|>\s*"|\(\s*"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>)])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

                      REPLACE Leave EMPTY


                      Reminder : If you just want to find/mark the possible non-allowed ", in an HTML file, remember to always place the caret at the very beginning of file, before processing !!

                      Particularly, in script parts, the regex may wrongly match some locations of double-quote, which, in fact, are totally legal !

                      So, in order to simplify the problem, we could consider that all the JavaScript parts are beyond current parsing attempt and treat them in the same way than comments, skipping any <script•••••>•••••••••</script> section !

                      BR

                      guy038

                      P.S. :

                      I have very basic notions about HTML and none about Javascript ! So, I just consider the HTML text as pure text to be parsed with regular expressions. Sorry for this limitation !

                      1 Reply Last reply Reply Quote 1
                      • Robin CruiseR
                        Robin Cruise
                        last edited by

                        super answer, @guy038 Thanks

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors