Community
    • Login

    Need help replacing HTML tags! Any help welcome.

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    strongdivspan langdiv stylespan
    5 Posts 2 Posters 2.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Girlswhocode312G
      Girlswhocode312
      last edited by

      I need help with replacing and removing a series of HTML tags for 500+ excel cells.

      • There are HTML tags that I need to replace with new tag names and there are tags that I would like to remove altogether.
      • The text I need to format is case sensitive, and in some cases, the tags contain capitalization. I need to replace all capitalized tags to lower-case lettering.
      • Lastly, some tags have additional formatting included within the tag, but I would like to simplify them to their most basic tag naming convention. An asterisk is noted in some tags to indicate wildcard HTML tags where I would like to replace all variations of that tag that start the way I indicated.

      There are variations of the following tags but here are just a few examples:

      Find and Replace:
      Find <DIV style…> and replace with <DIV>
      Find <STRONG> and replace with <b>
      Find <SPAN lang*> and replace with <span>

      Remove Altogether:
      <FONT Style…> and </FONT>
      <tbody*> and </tbody>

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @carolin-marschke,

        REading carefully your post, I noticed a small contradiction ! You said :

        I need to replace all capitalized tags to lower-case lettering.

        And, further on, the example :

        Find <DIV style…> and replace with <DIV>

        So, I supposed that replacement should be <div> !

        Anyway, for all your text manipulations, here are some appropriate regexes to run, after opening the Replace dialog and checking the Regular exprssion search mode :


        • For changes, as :

        Find <DIV style…> and replace with <DIV>
        Find <SPAN lang*> and replace with <span>

        Use the regex S/R : SEARCH = (?i-s)<(SPAN|DIV).*?> and REPLACE = <\L\1>

        Notes :

        • Search is non-insensitive ( i ) and the dot matches any standard character only (-s )

        • It looks for the tag SPAN or DIV, in any case, stored as group 1 ( <(SPAN|DIV) ), possibly followed by some characters, till the nearest > closing tag ( .*?> ), in the same line

        • In replacement, it just rewrites the group 1 tag ( \1 ), in lower case, due to the preceding syntax \L


        • For changes, as :

        Find <STRONG> and replace with <b>

        Use, for instance, the regex S/R : SEARCH = (?i)(<strong>)|(<em>) and REPLACE = (?1<b>)(?2<i>)

        Notes :

        • Simple change of a string by an another one !. The general syntax is :

        SEARCH = (?i)(Word1)|(Word2)|(Word3)........|(Wordn)

        REPLACE = (?1subst1)(?2subst2)(?3subst3)........(?nsubstn)


        For suppressions as :

        <FONT Style…> and </FONT>
        <tbody*> and </tbody>

        Use the regex S/R : SEARCH = (?is)<(FONT|tbody).+?</\1>\R? and REPLACE = EMPTY

        Notes :

        • Search is non-insensitive ( i ) and the dot matches any character ( standard or EOL ones ) (s )

        • It looks for tags FONT or tbody, in any case, stored as group 1 ( <(FONT|tbody) ), till the nearest closing IDENTICAL group 1 tag ( </\1> ), possibly followed by EOL characters(s) ( \R? )

        • As the replacement zone is empty, the bloc <Tag...>.......</Tag>, most of a time, multi-lines, is simply deleted


        Hoping that this first attempt will be useful to you !

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 2
        • Girlswhocode312G
          Girlswhocode312
          last edited by

          Hello @guy038 ,

          Thank you very much for your response. Also, you are correct in catching my mistake. Thanks for pointing that out. I will use your suggestions and get back to you shortly to let you know if those worked for me.

          Best,

          Carolin

          1 Reply Last reply Reply Quote 0
          • Girlswhocode312G
            Girlswhocode312
            last edited by Girlswhocode312

            @guy038 ,

            Hello! Thank you again for your help. I am now stumbling upon an issue with the following tag (any possible variations of it after the first few characters):

            <?xml *>

            There may be variations of strings that begin with this lettering. I tried to replace these by the following search function:

            (?i)<?xml[^>]*>

            But once replaced, the following remained in my script:
            <?

            How do you recommend I remove all instances of " <? " ?

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by

              Hi, @carolin-marschke,

              I supposed that you’re speaking about the tag , as below, which starts, generally, an XML file !

              <?xml version=“1.0” encoding=“UTF-8” ?>

              Your troubles come from the Question mark character ( ? ), which is considered, in the regex universe, as a special character, also called a meta-character. So, by default, these characters are NOT simple literal characters !


              There are two different sets of meta-characters :

              • Those, which are recognized anywhere in the pattern, EXCEPT within square brackets :

                • \ General escape character, with several uses

                • ^ Start of a line ( or a file )

                • $ End of a line ( or a file )

                • . Match any character, except new-line ones ( by default )

                • [ Start a class definition or a class range

                • | Start of an alternative branch

                • ( Start of a sub-pattern group

                • ) End of a sub-pattern group

                • { Start a Min / Max range of a quantifier

                • } End a `Min / Max range of a quantifier

                • * 0 to more times, the preceding character or group

                • +

                  • 1 to more times, the preceding character or group

                  • Possessive behaviour of the quantifiers *, + and ?

                • ?

                  • 0 or 1 time, the preceding character or group

                  • Meaning extender, for groups or conditions, (....)

                  • Minimizer of the quantifiers *, + and ?

              • Those, which are recognized within square brackets ( character class ), EXCLUSIVELY :

                • \ General escape character

                • ^ Negate the class, if first character of the class

                • - Character range indicator

                • [: Start of a POSIX character class, if followed by regular POSIX syntax

                • :] End of a POSIX character class

              So, Carolin, if you need to search for any of the above characters, as a literal, you must escape it with the backslash character \

              Therefore, your regex must be rewritten as : (?i)<\?xml[^>]*>, with a \, right before the ? character !

              However, the regex (?-is)<\?xml.+?> give better results ! Indeed, due to the -s modifier, any dot will match standard characters, only, with case sensitivity. So, after matching the literal string <?xml, with that exact case ( <\?xml ), it looks for the shortest non-null range of characters ( .+? ), of the current line, till a closing symbol >

              Assuming that the unique XML line <?xml version=“1.0” encoding=“UTF-8” ?> would be split into the four lines, as below :

              <?xml ver
              sion="1.0" enc
              oding="UT
              F-8" ?>
              

              My regex doesn’t match this incorrect text, unlike your regex ! To get the same behaviour, you should change your regex as (?i)<\?xml[^>\r\n]*>. This time, the syntax [^>\r\n] matches any character, different from the > character AND different, also, from any EOL character :-))

              Cheers,

              guy038

              1 Reply Last reply Reply Quote 0
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors