• Login
Community
  • Login

Need help replacing HTML tags! Any help welcome.

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
strongdivspan langdiv stylespan
5 Posts 2 Posters 2.6k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • G
    Girlswhocode312
    last edited by Oct 23, 2017, 9:31 PM

    I need help with replacing and removing a series of HTML tags for 500+ excel cells.

    • There are HTML tags that I need to replace with new tag names and there are tags that I would like to remove altogether.
    • The text I need to format is case sensitive, and in some cases, the tags contain capitalization. I need to replace all capitalized tags to lower-case lettering.
    • Lastly, some tags have additional formatting included within the tag, but I would like to simplify them to their most basic tag naming convention. An asterisk is noted in some tags to indicate wildcard HTML tags where I would like to replace all variations of that tag that start the way I indicated.

    There are variations of the following tags but here are just a few examples:

    Find and Replace:
    Find <DIV style…> and replace with <DIV>
    Find <STRONG> and replace with <b>
    Find <SPAN lang*> and replace with <span>

    Remove Altogether:
    <FONT Style…> and </FONT>
    <tbody*> and </tbody>

    1 Reply Last reply Reply Quote 0
    • G
      guy038
      last edited by guy038 Oct 24, 2017, 12:50 PM Oct 24, 2017, 12:48 PM

      Hello, @carolin-marschke,

      REading carefully your post, I noticed a small contradiction ! You said :

      I need to replace all capitalized tags to lower-case lettering.

      And, further on, the example :

      Find <DIV style…> and replace with <DIV>

      So, I supposed that replacement should be <div> !

      Anyway, for all your text manipulations, here are some appropriate regexes to run, after opening the Replace dialog and checking the Regular exprssion search mode :


      • For changes, as :

      Find <DIV style…> and replace with <DIV>
      Find <SPAN lang*> and replace with <span>

      Use the regex S/R : SEARCH = (?i-s)<(SPAN|DIV).*?> and REPLACE = <\L\1>

      Notes :

      • Search is non-insensitive ( i ) and the dot matches any standard character only (-s )

      • It looks for the tag SPAN or DIV, in any case, stored as group 1 ( <(SPAN|DIV) ), possibly followed by some characters, till the nearest > closing tag ( .*?> ), in the same line

      • In replacement, it just rewrites the group 1 tag ( \1 ), in lower case, due to the preceding syntax \L


      • For changes, as :

      Find <STRONG> and replace with <b>

      Use, for instance, the regex S/R : SEARCH = (?i)(<strong>)|(<em>) and REPLACE = (?1<b>)(?2<i>)

      Notes :

      • Simple change of a string by an another one !. The general syntax is :

      SEARCH = (?i)(Word1)|(Word2)|(Word3)........|(Wordn)

      REPLACE = (?1subst1)(?2subst2)(?3subst3)........(?nsubstn)


      For suppressions as :

      <FONT Style…> and </FONT>
      <tbody*> and </tbody>

      Use the regex S/R : SEARCH = (?is)<(FONT|tbody).+?</\1>\R? and REPLACE = EMPTY

      Notes :

      • Search is non-insensitive ( i ) and the dot matches any character ( standard or EOL ones ) (s )

      • It looks for tags FONT or tbody, in any case, stored as group 1 ( <(FONT|tbody) ), till the nearest closing IDENTICAL group 1 tag ( </\1> ), possibly followed by EOL characters(s) ( \R? )

      • As the replacement zone is empty, the bloc <Tag...>.......</Tag>, most of a time, multi-lines, is simply deleted


      Hoping that this first attempt will be useful to you !

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 2
      • G
        Girlswhocode312
        last edited by Oct 24, 2017, 5:29 PM

        Hello @guy038 ,

        Thank you very much for your response. Also, you are correct in catching my mistake. Thanks for pointing that out. I will use your suggestions and get back to you shortly to let you know if those worked for me.

        Best,

        Carolin

        1 Reply Last reply Reply Quote 0
        • G
          Girlswhocode312
          last edited by Girlswhocode312 Oct 25, 2017, 6:43 PM Oct 25, 2017, 6:42 PM

          @guy038 ,

          Hello! Thank you again for your help. I am now stumbling upon an issue with the following tag (any possible variations of it after the first few characters):

          <?xml *>

          There may be variations of strings that begin with this lettering. I tried to replace these by the following search function:

          (?i)<?xml[^>]*>

          But once replaced, the following remained in my script:
          <?

          How do you recommend I remove all instances of " <? " ?

          1 Reply Last reply Reply Quote 0
          • G
            guy038
            last edited by Oct 25, 2017, 10:11 PM

            Hi, @carolin-marschke,

            I supposed that you’re speaking about the tag , as below, which starts, generally, an XML file !

            <?xml version=“1.0” encoding=“UTF-8” ?>

            Your troubles come from the Question mark character ( ? ), which is considered, in the regex universe, as a special character, also called a meta-character. So, by default, these characters are NOT simple literal characters !


            There are two different sets of meta-characters :

            • Those, which are recognized anywhere in the pattern, EXCEPT within square brackets :

              • \ General escape character, with several uses

              • ^ Start of a line ( or a file )

              • $ End of a line ( or a file )

              • . Match any character, except new-line ones ( by default )

              • [ Start a class definition or a class range

              • | Start of an alternative branch

              • ( Start of a sub-pattern group

              • ) End of a sub-pattern group

              • { Start a Min / Max range of a quantifier

              • } End a `Min / Max range of a quantifier

              • * 0 to more times, the preceding character or group

              • +

                • 1 to more times, the preceding character or group

                • Possessive behaviour of the quantifiers *, + and ?

              • ?

                • 0 or 1 time, the preceding character or group

                • Meaning extender, for groups or conditions, (....)

                • Minimizer of the quantifiers *, + and ?

            • Those, which are recognized within square brackets ( character class ), EXCLUSIVELY :

              • \ General escape character

              • ^ Negate the class, if first character of the class

              • - Character range indicator

              • [: Start of a POSIX character class, if followed by regular POSIX syntax

              • :] End of a POSIX character class

            So, Carolin, if you need to search for any of the above characters, as a literal, you must escape it with the backslash character \

            Therefore, your regex must be rewritten as : (?i)<\?xml[^>]*>, with a \, right before the ? character !

            However, the regex (?-is)<\?xml.+?> give better results ! Indeed, due to the -s modifier, any dot will match standard characters, only, with case sensitivity. So, after matching the literal string <?xml, with that exact case ( <\?xml ), it looks for the shortest non-null range of characters ( .+? ), of the current line, till a closing symbol >

            Assuming that the unique XML line <?xml version=“1.0” encoding=“UTF-8” ?> would be split into the four lines, as below :

            <?xml ver
            sion="1.0" enc
            oding="UT
            F-8" ?>
            

            My regex doesn’t match this incorrect text, unlike your regex ! To get the same behaviour, you should change your regex as (?i)<\?xml[^>\r\n]*>. This time, the syntax [^>\r\n] matches any character, different from the > character AND different, also, from any EOL character :-))

            Cheers,

            guy038

            1 Reply Last reply Reply Quote 0
            5 out of 5
            • First post
              5/5
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors