Community
    • Login

    Need help replacing HTML tags! Any help welcome.

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    strongdivspan langdiv stylespan
    5 Posts 2 Posters 3.0k Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Girlswhocode312G Offline
      Girlswhocode312
      last edited by

      I need help with replacing and removing a series of HTML tags for 500+ excel cells.

      • There are HTML tags that I need to replace with new tag names and there are tags that I would like to remove altogether.
      • The text I need to format is case sensitive, and in some cases, the tags contain capitalization. I need to replace all capitalized tags to lower-case lettering.
      • Lastly, some tags have additional formatting included within the tag, but I would like to simplify them to their most basic tag naming convention. An asterisk is noted in some tags to indicate wildcard HTML tags where I would like to replace all variations of that tag that start the way I indicated.

      There are variations of the following tags but here are just a few examples:

      Find and Replace:
      Find <DIV style…> and replace with <DIV>
      Find <STRONG> and replace with <b>
      Find <SPAN lang*> and replace with <span>

      Remove Altogether:
      <FONT Style…> and </FONT>
      <tbody*> and </tbody>

      1 Reply Last reply Reply Quote 0
      • guy038G Offline
        guy038
        last edited by guy038

        Hello, @carolin-marschke,

        REading carefully your post, I noticed a small contradiction ! You said :

        I need to replace all capitalized tags to lower-case lettering.

        And, further on, the example :

        Find <DIV style…> and replace with <DIV>

        So, I supposed that replacement should be <div> !

        Anyway, for all your text manipulations, here are some appropriate regexes to run, after opening the Replace dialog and checking the Regular exprssion search mode :


        • For changes, as :

        Find <DIV style…> and replace with <DIV>
        Find <SPAN lang*> and replace with <span>

        Use the regex S/R : SEARCH = (?i-s)<(SPAN|DIV).*?> and REPLACE = <\L\1>

        Notes :

        • Search is non-insensitive ( i ) and the dot matches any standard character only (-s )

        • It looks for the tag SPAN or DIV, in any case, stored as group 1 ( <(SPAN|DIV) ), possibly followed by some characters, till the nearest > closing tag ( .*?> ), in the same line

        • In replacement, it just rewrites the group 1 tag ( \1 ), in lower case, due to the preceding syntax \L


        • For changes, as :

        Find <STRONG> and replace with <b>

        Use, for instance, the regex S/R : SEARCH = (?i)(<strong>)|(<em>) and REPLACE = (?1<b>)(?2<i>)

        Notes :

        • Simple change of a string by an another one !. The general syntax is :

        SEARCH = (?i)(Word1)|(Word2)|(Word3)........|(Wordn)

        REPLACE = (?1subst1)(?2subst2)(?3subst3)........(?nsubstn)


        For suppressions as :

        <FONT Style…> and </FONT>
        <tbody*> and </tbody>

        Use the regex S/R : SEARCH = (?is)<(FONT|tbody).+?</\1>\R? and REPLACE = EMPTY

        Notes :

        • Search is non-insensitive ( i ) and the dot matches any character ( standard or EOL ones ) (s )

        • It looks for tags FONT or tbody, in any case, stored as group 1 ( <(FONT|tbody) ), till the nearest closing IDENTICAL group 1 tag ( </\1> ), possibly followed by EOL characters(s) ( \R? )

        • As the replacement zone is empty, the bloc <Tag...>.......</Tag>, most of a time, multi-lines, is simply deleted


        Hoping that this first attempt will be useful to you !

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 2
        • Girlswhocode312G Offline
          Girlswhocode312
          last edited by

          Hello @guy038 ,

          Thank you very much for your response. Also, you are correct in catching my mistake. Thanks for pointing that out. I will use your suggestions and get back to you shortly to let you know if those worked for me.

          Best,

          Carolin

          1 Reply Last reply Reply Quote 0
          • Girlswhocode312G Offline
            Girlswhocode312
            last edited by Girlswhocode312

            @guy038 ,

            Hello! Thank you again for your help. I am now stumbling upon an issue with the following tag (any possible variations of it after the first few characters):

            <?xml *>

            There may be variations of strings that begin with this lettering. I tried to replace these by the following search function:

            (?i)<?xml[^>]*>

            But once replaced, the following remained in my script:
            <?

            How do you recommend I remove all instances of " <? " ?

            1 Reply Last reply Reply Quote 0
            • guy038G Offline
              guy038
              last edited by

              Hi, @carolin-marschke,

              I supposed that you’re speaking about the tag , as below, which starts, generally, an XML file !

              <?xml version=“1.0” encoding=“UTF-8” ?>

              Your troubles come from the Question mark character ( ? ), which is considered, in the regex universe, as a special character, also called a meta-character. So, by default, these characters are NOT simple literal characters !


              There are two different sets of meta-characters :

              • Those, which are recognized anywhere in the pattern, EXCEPT within square brackets :

                • \ General escape character, with several uses

                • ^ Start of a line ( or a file )

                • $ End of a line ( or a file )

                • . Match any character, except new-line ones ( by default )

                • [ Start a class definition or a class range

                • | Start of an alternative branch

                • ( Start of a sub-pattern group

                • ) End of a sub-pattern group

                • { Start a Min / Max range of a quantifier

                • } End a `Min / Max range of a quantifier

                • * 0 to more times, the preceding character or group

                • +

                  • 1 to more times, the preceding character or group

                  • Possessive behaviour of the quantifiers *, + and ?

                • ?

                  • 0 or 1 time, the preceding character or group

                  • Meaning extender, for groups or conditions, (....)

                  • Minimizer of the quantifiers *, + and ?

              • Those, which are recognized within square brackets ( character class ), EXCLUSIVELY :

                • \ General escape character

                • ^ Negate the class, if first character of the class

                • - Character range indicator

                • [: Start of a POSIX character class, if followed by regular POSIX syntax

                • :] End of a POSIX character class

              So, Carolin, if you need to search for any of the above characters, as a literal, you must escape it with the backslash character \

              Therefore, your regex must be rewritten as : (?i)<\?xml[^>]*>, with a \, right before the ? character !

              However, the regex (?-is)<\?xml.+?> give better results ! Indeed, due to the -s modifier, any dot will match standard characters, only, with case sensitivity. So, after matching the literal string <?xml, with that exact case ( <\?xml ), it looks for the shortest non-null range of characters ( .+? ), of the current line, till a closing symbol >

              Assuming that the unique XML line <?xml version=“1.0” encoding=“UTF-8” ?> would be split into the four lines, as below :

              <?xml ver
              sion="1.0" enc
              oding="UT
              F-8" ?>
              

              My regex doesn’t match this incorrect text, unlike your regex ! To get the same behaviour, you should change your regex as (?i)<\?xml[^>\r\n]*>. This time, the syntax [^>\r\n] matches any character, different from the > character AND different, also, from any EOL character :-))

              Cheers,

              guy038

              1 Reply Last reply Reply Quote 0

              Hello! It looks like you're interested in this conversation, but you don't have an account yet.

              Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

              With your input, this post could be even better 💗

              Register Login
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors