• Login
Community
  • Login

Regex: Find and Delete duplicate apostrophe on a html tag

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
10 Posts 3 Posters 711 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R
    Robin Cruise
    last edited by Robin Cruise Apr 23, 2021, 7:55 AM Apr 23, 2021, 7:53 AM

    hello. I have this html tag:

    <meta name="description" content="I love my mother" but I love my sister" more than I can say"/>

    As you can see, I have 4 apostrophe in the content. Should be only 2 apostrophe, on the beginning content=" and at the end "/>

    I must find all tags that contains other apostrophe except those 2 in the content section.

    I made a Regex, but not too good. Maybe you can help me:

    FIND: (?-s)(<meta name="description" content=")(*?\K.*"(?s))"/>
    REPLACE BY: \1\2

    A 1 Reply Last reply Apr 23, 2021, 12:08 PM Reply Quote 0
    • A
      Alan Kilborn @Robin Cruise
      last edited by Apr 23, 2021, 12:08 PM

      @Robin-Cruise

      At first glance this seems to be the typical “replace only within delimiters” problem.

      But applying the technique shown HERE doesn’t seem to work.
      Maybe @guy038 could comment on that and if possible amend the generic solution so that it does work?

      1 Reply Last reply Reply Quote 1
      • G
        guy038
        last edited by Apr 24, 2021, 1:52 AM

        Hello, @robin-cruise, @alan-kilborn and All,

        As @alan-kilborn said, we can use this generic regex :

        (?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)

        Note that the negative look-head is tested at any position BEFORE the FR regex to search for !

        So, in the event that the FR zone ( " ) is located right before the ESR zone ( /> ), you must add a negative look-ahead (?!ESR) after FR, giving this general syntax :

        (?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)


        In our case, we have :

        • BSR, Beginning Search-region Regex, is the regex \x20content="

        • ESR, Ending Search-region Regex, is the regex />

        • FR, Find Regex, is the regex "

        • RR, Replacement Regex is the EMPTY string

        So, the real regex is (?s-i:\x20content="|(?!\A)\G)(?s-i:(?!/>).)*?\K(?s-i:")(?!/>). Now :

        • In the first non-capturing group, the s modifier is useless as no . exists in that group

        • In the second non-capturing group, the i modifier is useless as, either, the string /> and the dot . don’t refer to a letter

        • In the third non-capturing group, the s and i modifiers are useless, as the non-capturing group as well. Indeed, we just search for a double quote " !

        Finally, our practical regex S/R can be simplified as :

        SEARCH (?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)

        REPLACE Leave EMPTY

        You may test this regex S/R against the sample text below :

        <meta name="description" content="I" love my mother" but I love my sister" more "/than I can sa"y"/>
        
        <meta name="descrip
        tion" content="I" love my 
        mother" but I love my sis
        ter" more "/than I ca
        n sa"
        y"/>
        
        <meta name="descrip
        tion" content=""I love my 
        mother" but I love my sis
        ter" more "/than I ca
        n sa"
        y""/>
        
        <meta name="description" content=""I love my mother" but I love my sister" more "/than I can sa"y""/>
        

        After a Replace All action, you should get the expected text :

        <meta name="description" content="I love my mother but I love my sister more /than I can say"/>
        
        <meta name="descrip
        tion" content="I love my 
        mother but I love my sis
        ter more /than I ca
        n sa
        y"/>
        
        <meta name="descrip
        tion" content="I love my 
        mother but I love my sis
        ter more /than I ca
        n sa
        y"/>
        
        <meta name="description" content="I love my mother but I love my sister more /than I can say"/>
        

        IMPORTANT :

        • In case you just want to find the different occurrences, systematically move the caret at the very beginning of current file, before searching

        • In case of replacement, do not use the Replace button ( step by step replacement )

        • This regex S/R works fine, even in case of muti-line tags !

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 1
        • R
          Robin Cruise
          last edited by Robin Cruise Apr 24, 2021, 6:01 AM Apr 24, 2021, 5:59 AM

          @guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:

          (?-i:\x20content=“|(?!\A)\G)(?s:(?!/>).)*?\K”(?!/>)

          hello, @guy038 Your regex is almost good. But you must consider one other case:

          <meta http-equiv="X-UA-Compatible" content="IE=edge">

          In this case, your regex will delete exactly the last apostrophe, and I don’t wanna do that. First and the last apostrophe must remain still. Also, if I update your regex and put one single meta tag , also, will not delete all the apostrophe from content section:

          <meta name="description"(?-i:\x20content="|(?!\A)\G)(?s:(?!/>).)*?\K"(?!/>)

          1 Reply Last reply Reply Quote 0
          • A
            Alan Kilborn
            last edited by Apr 24, 2021, 1:52 PM

            @guy038 said in Regex: Find and Delete duplicate apostrophe on a html tag:

            So, in the event that the FR zone ( " ) is located right before the ESR zone ( /> ), you must add a negative look-ahead (?!ESR) after FR, giving this general syntax :
            (?s-i:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?s-i:FR)(?!ESR)

            I suppose that is the reason that the earlier version of the generic solution didn’t work when I tried it.

            1 Reply Last reply Reply Quote 0
            • R
              Robin Cruise
              last edited by Apr 24, 2021, 2:01 PM

              This post is deleted!
              1 Reply Last reply Reply Quote 0
              • G
                guy038
                last edited by guy038 Apr 24, 2021, 10:34 PM Apr 24, 2021, 10:04 PM

                Hello, @robin-cruise, @alan-kilborn and All,

                Pheeew ! I worked very hard and I must have fired up the regular expression engine ;-))

                I succeeded to get an general regex which is able to match any double-quote, in non-allowed locations of any value of an attribute, in an HTML file ! Of course, this regex does not take in account the comment zones !

                Thus, the following S/R :

                SEARCH (?is)<!--.+-->(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

                REPLACE Leave EMPTY


                With the free-spacing mode, it can expressed like below :

                (?xis)                      # FREE-SPACING mode, search INSENSITIVE to CASE and DOT matches a SINGLE STANDARD char
                  <!--.+-->(*SKIP)(*F)      #   Any matched COMMENT zone, MULTI-lines or NOT, is CANCELLED and WORKING location moves RIGHT AFTER that zone
                |                           # OR
                  (?:                       #   START of 1st NON-CAPTURING group ( BSR )
                      ="                    #       ="
                    |                       #     OR
                      >"                    #       >"
                    |                       #     OR
                	(?!\A)\G                #       END of PREVIOUS search, if current location is NOT at the VERY BEGINNING of file
                  )                         #   END NON-CAPTURING 1st group
                                            #
                  (?:                       #   START of 2nd NON-CAPTURING group ( ESR )
                    [^<>"]                  #     ANY char DIFFERENT from <  and  > and "
                  )*?                       #   END NON-CAPTURING 2nd group, defining the SHORTEST range, possibly EMPTY, of chars as ABOVE 
                                            #
                  \K                        #   Match, so far, CANCELLED and location UPDATED
                  "                         #   A NON allowed DOUBLE-QUOTE ( FR ) So the EXPECTED match
                                            #
                                            #   The regexes, inside the NEGATIVE look-aheads, below, define the CORRECT locations of a DOUBLE-QUOTE
                                            #     so the NEGATIVE look-aheads restrict the MATCH of a DOUBLE-QUOTE to NON-ALLOWED locations, ONLY
                                            #
                  (?!\s*[,;<>])             #   If NOT FOLLOWED with possible BLANK chars FOLLOWED with a , or ; or < or > char
                  (?!\s*/>)                 #   AND if NOT FOLLOWED with possible BLANK chars FOLLOWED with the /> string
                  (?!\s+[a-z][^<>="]+=\s*") #   AND if NOT FOLLOWED with ( BLANK char(s), a LETTER, some chars DIFFERENT from < and > and = and "
                                            #     FOLLOWED with a = char, possible BLANK chars and a " char ) Case of a " char FOLLOWED with an ATTRIBUTE
                

                You can test it against this HTML example, below, containing :

                • A lot of " , in non-allowed locations, of course, which should be matched by the search regex

                • Probably, a non-regular HTML syntax. Anyway it’s just for testing !

                <html>
                  <head>
                    <meta charset="UTF-8">
                    <meta name=""key"words" content="HTML, CSS, JavaScript""/>
                    <meta name=""des"cr
                	ip"tion" content="Free Web tu"torials"">
                    <!-- <meta name="viewport" content="wi
                    dth=device-"width, ini"tial-scale=1.0" -->
                    <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>  -->
                    <meta name="author" content="John Doe">
                    <meta http-equiv="X-UA-Compatible" content="IE=edge">
                    <meta name="viewport" content=""width=de""vice-w"idth, i
                	nit"ial-" sc"/"ale=1.0"">
                  </head>
                  <body>
                    <h1>My First Heading</h1>
                    <p>My first paragraph</p>
                    <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>
                    <a href=""https:/
                	/www.w3"schools.com/
                	html"/
                	"">"Visit our" H
                	TML tu"torial"</a>
                    <form action="/action_pa"ge.php">
                      <label for=""fna"me">"First "name:"</label>
                      <input type="te"xt" id="fname" name=""fna"me"><br><br>
                    </form> 
                  </body>
                </html> 
                

                After replacement ( 36 occurrences ), we get :

                <html>
                  <head>
                    <meta charset="UTF-8">
                    <meta name="keywords" content="HTML, CSS, JavaScript"/>
                    <meta name="descr
                	iption" content="Free Web tutorials">
                    <!-- <meta name="viewport" content="wi
                    dth=device-"width, ini"tial-scale=1.0" -->
                    <!-- <a href=""https://www.w3"schools.com/html"/"">"Visit our" HTML tu"torial"</a>  -->
                    <meta name="author" content="John Doe">
                    <meta http-equiv="X-UA-Compatible" content="IE=edge">
                    <meta name="viewport" content="width=device-width, i
                	nitial- sc/ale=1.0">
                  </head>
                  <body>
                    <h1>My First Heading</h1>
                    <p>My first paragraph</p>
                    <a href="https://www.w3schools.com/html/">"Visit our HTML tutorial"</a>
                    <a href="https:/
                	/www.w3schools.com/
                	html/
                	">"Visit our H
                	TML tutorial"</a>
                    <form action="/action_page.php">
                      <label for="fname">"First name:"</label>
                      <input type="text" id="fname" name="fname"><br><br>
                    </form> 
                  </body>
                </html> 
                

                IMPORTANT :

                • This search regex works properly since the v7.9.1 release only and later

                • In case you just want to find or mark the different occurrences, systematically move the caret at the very beginning of current file, before searching / marking

                • In case of replacement, do not use the Replace button ( step by step replacement )

                Now, Robin, it’s up to you to find out other cases, that I have not considered yet ;-))

                Best Regards

                guy038

                1 Reply Last reply Reply Quote 1
                • R
                  Robin Cruise
                  last edited by Apr 25, 2021, 5:46 AM

                  @guy038 yes, works. But you must also mention the <meta> tag in your regex, or else, your regex will also delete other “double-quote”. For example, any html file have also some java script, and the regex you made will delete some double-quotes that must not be deleted.

                  Try your regex on this script, and will ruin it:

                  (?is)<!--.+-->(*SKIP)(*F)|(?:="|>"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

                  <script LANGUAGE="JavaScript">
                  function emailCheck() {
                  	if (document.Form.nume.value=="") {
                  		alert("Please enter your full name!");
                  		document.Form.nume.focus();
                  		return false
                  	}
                  	
                  	if(document.Form.nume.value.indexOf(' ', 0) == -1){
                  		alert("Please enter your full name!");
                  		document.Form.nume.focus();
                  		return false;
                  	}
                  
                  	if (document.Form.email.value=="") {
                  		alert("Please enter your email!");
                  		document.Form.email.focus();
                  		return false;
                  	}
                  	if(document.Form.email.value.indexOf('@', 0) == -1){
                  		alert("Invalid email address!");
                  		document.Form.email.focus();
                  		return false;
                  	}
                  	if(document.Form.email.value.indexOf('.', 0) == -1){
                  		alert("Invalid email address!");
                  		document.Form.email.focus();
                  		return false;
                  	}
                  	if (document.Form.varsta.value=="") {
                  		alert("Please mention your age!");
                  		document.Form.varsta.focus();
                  		return false;
                  	}
                  }
                  </script>
                  
                  1 Reply Last reply Reply Quote 0
                  • G
                    guy038
                    last edited by guy038 Apr 25, 2021, 10:15 AM Apr 25, 2021, 8:48 AM

                    Hi, @robin-cruise, @alan-kilborn and All,

                    I did consider the nested Java scripts, within an HTML file with, for instance, that small part of text :

                    var d=window,e="length",h="",k="__duration__",l="function";function m(c){return document.getElementById(c)}
                    

                    So, in the present regex, the locations of double-quote, right before a comma and right before a semicolon are skipped because of the negative look-ahead (?!\s*[,;<>])

                    Now, with your JavaScript example, we must also consider the right syntax ("••••••••"). This means that we have to :

                    • Add the \(" in the BSR region

                    • Change the negative look-ahead (?!\s*[,;<>]) as (?!\s*[,;<>)]) in the second part of the ESR region

                    In addition, I supposed that, in the BSR region, the =, > and ( characters may be separated from the double-quote with some blank characters

                    All in all, this gives this new regex version :

                    SEARCH / MARK :

                    (?is)<!--.+-->(*SKIP)(*F)|(?:=\s*"|>\s*"|\(\s*"|(?!\A)\G)(?:[^<>"])*?\K"(?!\s*[,;<>)])(?!\s*/>)(?!\s+[a-z][^<>="]+=\s*")

                    REPLACE Leave EMPTY


                    Reminder : If you just want to find/mark the possible non-allowed ", in an HTML file, remember to always place the caret at the very beginning of file, before processing !!

                    Particularly, in script parts, the regex may wrongly match some locations of double-quote, which, in fact, are totally legal !

                    So, in order to simplify the problem, we could consider that all the JavaScript parts are beyond current parsing attempt and treat them in the same way than comments, skipping any <script•••••>•••••••••</script> section !

                    BR

                    guy038

                    P.S. :

                    I have very basic notions about HTML and none about Javascript ! So, I just consider the HTML text as pure text to be parsed with regular expressions. Sorry for this limitation !

                    1 Reply Last reply Reply Quote 1
                    • R
                      Robin Cruise
                      last edited by Apr 25, 2021, 10:19 AM

                      super answer, @guy038 Thanks

                      1 Reply Last reply Reply Quote 0
                      5 out of 10
                      • First post
                        5/10
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors