Community
    • Login

    Regex: Delete only one instance of a string between two html tags (double quotes)

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    7 Posts 3 Posters 585 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR
      Robin Cruise
      last edited by

      hello. I have some html tags, for example:

      <meta name="description" content="......"/>

      As you can see there are 2 double quotes " + " One at starting content of tag, one at the end content of that tag.

      But in the example below, I have one (or I cand have multiple double quotes, apart from the two basic. How can I delete those extra double quotes?

      <meta name="description" content="Kiel vi rilatigas vian juĝvaloron "al la kredoj esprimitaj de aliaj se vi ne pretas elporti la kostojn de misinterpretado de la " cirkonstancoj en kiuj okazas evento?"/>
      

      I try to use and old generic regex that @guy038 made:

      (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)

      will become:

      FIND: (?-si:<meta name="description" content="|(?!\A)\G)(?s-i:(?!"/>).)*?\K(?-si:")

      REPLACE BY: (leave empty)

      The problem is that, this solution delete all double quotes, except the first one. But also, the last one (this did not have to be deleted)

      Neil SchipperN 1 Reply Last reply Reply Quote 0
      • Neil SchipperN
        Neil Schipper @Robin Cruise
        last edited by

        @robin-cruise

        Hard to understand.

        Try: "This regex produced the output output1 but the output I want is output2.

        Robin CruiseR 1 Reply Last reply Reply Quote 0
        • Robin CruiseR
          Robin Cruise @Neil Schipper
          last edited by

          @neil-schipper only those 2 double quotes (between dotes) are important:

          <meta name="description" content="......"/>

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi @robin-Cruise, @neil-schipper and All,

            Given this text :

            <meta name="description" content="Kiel vi rilatigas vian juĝvaloron "al la kredoj esprimitaj de aliaj se vi ne pretas elporti la kostojn de misinterpretado de la " cirkonstancoj en kiuj okazas evento?"/>
            

            You used this regex :

            FIND: (?-si:<meta name="description" content="|(?!\A)\G)(?s-i:(?!"/>).)*?\K(?-si:")

            So, after finding and deleting the two non-wanted " characters, then, due to the \G feature, it first selects the remaining range cirkonstancoj en kiuj okazas evento?

            When reading the last char ? of that range, the (?!"/>) condition is still verified. So, due to the \K syntax, it wrongly selects the last " char !

            This case is special because the string to find is part of the ESR region too. The rule should be :

            In single lines, containing the <meta name="description" content=" string, delete any subsequent double-quote, that is not ending the tag. This gives this simple regex :

            SEARCH (?-si:<meta name="description" content="|(?!\A)\G).*?\K"(?!/>)

            REPLACE Leave EMPTY

            Note, that I keep only a No Single Line and Not Insensitive modifiers at beginning of the regex (?-si) and did not use any modifier afterwards, whereas you used this syntax .....(?s-i:(?!"/>).)*?.....

            Then, each time, the .+? represents the range of text to forget before catching the " char and the ESR region becomes the final negative look-ahead (?!/>)


            Now, the above regex S/R works only for lines containing <meta name="description" content=" Below, here is a regex which will find out any double-quote, between the usual " boundaries, in an HTML or XML file :

            SEARCH (?<!=\x20)(?<!=)"(?!>|/>|\x20>|\x20/>|\?>|\x20\?>|\x20\w+=)

            Normally, this case should occur only in comments !

            Best Regards,

            guy038

            Robin CruiseR 1 Reply Last reply Reply Quote 1
            • Robin CruiseR
              Robin Cruise @guy038
              last edited by Robin Cruise

              thank you @guy038

              (?-si:<meta name="description" content="|(?!\A)\G).*?\K"(?!/>)

              So, I extracted a new generic from your regex above:

              This is The Generic regex for search and replace:

              (?-si:BSR|(?!\A)\G).*?\KFR(?!ESR)

              For the second regex you made, I also try to extract the generic, but I can’t figure it out…

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by

                Hi, @robin-cruise,

                • Regarding the first regex, your equivalent generic regex is correct

                • However, we cannot find any generic regex, related to my second regex ! Indeed, it just finds any double quote character when :

                  • Some characters, before the " char, do not occur ( (?<!=\x20)(?<!=) )
                • AND

                  • Some characters, after the " char, do not occur ( (?!>|/>|\x20>|\x20/>|\?>|\x20\?>|\x20\w+=) )

                BR

                guy038

                Robin CruiseR 1 Reply Last reply Reply Quote 0
                • Robin CruiseR
                  Robin Cruise @guy038
                  last edited by Robin Cruise

                  @guy038

                  I try myself to find a generic, from your regex. Works well, except doesn’t work for " (double quotes) because is repeated in the tag construction. I change those extra quotes on the content of tags, with a work, like “BOOM” and it find/replace it well beetween start and ending tag.

                  These are the generic regex for your second solution. Are almost the same, short and long version. Makes the same thing, find and replace just well between start and ending tags.

                  (?<!=\x20)(?<!=)FR(?!>|ESR|\x20>|\x20/>|\?>|\x20\?>|BSR)

                  OR

                  (?<!=\x20)(?<!=)FR(?!>|ESR|\x20>|\x20/>|\?>|\x20\?>|\x20BSR)

                  OR

                  (?<!=\x20)(?<!=)FR(?!>|ESR|\x20\?>|\x20BSR)

                  1 Reply Last reply Reply Quote 0
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors