Community
    • Login

    how to remove empty spaces from a particular tag (regular expression)

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    10 Posts 4 Posters 6.8k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR Offline
      Robin Cruise
      last edited by Robin Cruise

      good day, I made this regular expression to remove empty spaces from a particular tag that has many empty spaces and tabs:

      (?-s)(\G(?!^)|<p\s+class="oyric">)((?!</p>).)*?\K\s\s+ but it does not work well

      <p class="oyric">  Laurie Strode comes to   her final confrontation		 with Michael Myers, the   masked figure  who has haunted her 	  since she narrowly escaped.  </p>
      

      Output should be:

      <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
      
      1 Reply Last reply Reply Quote 1
      • guy038G Offline
        guy038
        last edited by guy038

        Hi, @Robin-cruise and All,

        You were not very far from the right solution ! The way to replace something :

        • In a particular tag section, as <p>........</p>

        • In a particular tag section, with a particular class name, as <p class="test">Bla bla blah</p>

        has already been discussed in these posts :

        https://notepad-plus-plus.org/community/topic/15058/regex-remove-particular-words-from-tags-in-several-text-pages/10

        https://notepad-plus-plus.org/community/topic/15058/regex-remove-particular-words-from-tags-in-several-text-pages/12


        So, the general regex solution, when you want to perform a Search/Replacement, ONLY in an area, which is located between two particular boundaries, is :

        SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR        OR        (?s)(\G|BR)((?!ER).)*?\KSR

        REPLACE RR

        where :

        • BR ( Begining Regex ) is the regex which defines the start of the defined zone, for the search/replacement

        • ER ( Ending Regex ) is the regex which defines the end of the defined zone, for the search/replacement

        • SR ( Search Regex ) is the regex which defines the expression to search, in any defined zone

        • RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex, in any defined zone

        In your case :

        BR = <p class="oyric">

        ER = </p>

        SR = ((?<=>)\h+|\h+(?=<|\h))

        RR = Nothing

        Notes :

        • SR is a search of any of the two alternatives, separated with the | symbol, and surrounded by parentheses, because of the lowest priority of the alternation symbol |

          • (?<=>)\h+ which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if preceded with the > symbol

          • \h+(?=<|\h)) which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if followed by, either, the < symbol or a final horizontal blank character

        • As all these blank characters matched have to be deleted, the replacement zone is just empty

        • First, the regex tries to find the string <p class="oyric">, followed by the shortest range, even null, of characters, .*?, till the search regex, explained above, with the condition that the string </p> must not located at any position of this range

        • Due to the \K syntax, the regex engine resets its working location and forgets any previous match. So the final match is ,simply, the part described above ((?<=>)\h+|\h+(?=<|\h)) ( SR )

        • After this first match, it can only match the zero-length assertion \G, followed, again, with a possible other shortest range, even null … … … as just above !

        • When the regex engine skips the ending boundary </p>, the \G cannot be verified anymore and the only way to match something else is to grab, again, a <p class="oyric"> string, further on !

        • If you are only interested in single-line ranges BR.........ER, use the (?-s) modifier, at beginning of the search regex

        • If you may have some multi-lines ranges BR.........ER, use the (?s) modifier, at beginning of the search regex


        So, Robin, let’s imagine the sample text, below :

        <p class="oyric">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
        
        <p class="Tag_1">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
        
        <p class="oyric">            Laurie Strode         comes to her final confrontation with Michael     Myers, the masked figure  who has              haunted her since she            narrowly escaped.</p>
        
        <p class="Tag_2">bla    blah     blah   </p>
        
        <p class="oyric">  This is    a test </p>   <p class="Tag_3">bla       blah   blah </p>  <p class="oyric">  The    final   test   </p>
        
        
        <p class="oyric">  Laurie Strode comes to   her final
         confrontation      
         with Michael Myers, the   masked
         figure  who has haunted her     since she
         narrowly escaped.  </p>
        
        <p class="Tag_2">bla
            blah     
        ....blah   </p>
        
        <p class="oyric">     This is    an           
             other  test to verify    if the      regex
                   is correct         </p>
        

        Using the following regex S/R :

        SEARCH (?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))

        REPLACE Leave EMPTY

        You should get the expected text, below :

        <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
        
        <p class="Tag_1">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
        
        <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
        
        <p class="Tag_2">bla    blah     blah   </p>
        
        <p class="oyric">This is a test</p>   <p class="Tag_3">bla       blah   blah </p>  <p class="oyric">The final test</p>
        
        
        <p class="oyric">Laurie Strode comes to her final
         confrontation 
         with Michael Myers, the masked
         figure who has haunted her since she
         narrowly escaped.</p>
        
        <p class="Tag_2">bla
            blah     
        ....blah   </p>
        
        <p class="oyric">This is an 
         other test to verify if the regex
         is correct</p>
        

        Notes :

        • It’s easy to verify that blank characters have been removed, ONLY in all areas <p class="oyric">..........</p>, whatever they were single-line areas or a multi-lines blocks

        • However, note that if a line ends with blank chars and the next line begins, also, with blank characters, this regex S/R will keep one blank char, either, at the end of this line and at the beginning of the next line !

        • Finally, beware that, if words are separated with a mix of space and tabulation characters, only the final blank character will be kept !

        For instance from text :

        <p class="oyric">	   		Test	  	    #1  	    </p>  ( Last BLANK char = SPACE      ,before #1 )
        
        <p class="oyric">   	    Test  	  		#2		  	</p>  ( Last BLANK char = TABULATION, before #2 )
        

        You’ll obtain :

        <p class="oyric">Test #1</p>   ( SPACE      char between Test and #1 )
        
        <p class="oyric">Test	#2</p> ( TABULATION char between Test and #2 )
        

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 2
        • Robin CruiseR Offline
          Robin Cruise
          last edited by

          GREAT ! thank you very much ;)

          1 Reply Last reply Reply Quote 0
          • Robin CruiseR Offline
            Robin Cruise
            last edited by Robin Cruise

            SEARCH (?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))

            REPLACE Leave EMPTY


            by the way, there is a little problem in your regex, guy038. Now I discover that.

            Seems that your regex selects all spaces outside the specified tag, and disturb
            all my other lines.

            See a print screen:

            https://snag.gy/fRX1ZO.jpg

            or check regex on this code on notepad: https://regex101.com/r/dDBYSk/1

            See what happen after “Replace all”. You will see that all tags with spaces before and after are modify, not only that particular tag I want. (<p class=“oyric”>)

            1 Reply Last reply Reply Quote 0
            • rinku singhR Offline
              rinku singh
              last edited by

              i’m thinking about swiss file knife plugins to build.
              http://stahlworks.com/dev/swiss-file-knife.html

              1 Reply Last reply Reply Quote 0
              • guy038G Offline
                guy038
                last edited by guy038

                Hello, @Robin-cruise and All,

                Ah Yes ! My regex wasn’t enough accurate ! And worse, my formulation of the general regex solution was not exact too :-((


                So, the general regex solution, when you want to perform a Search/Replacement in a specific area, only, is :

                SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR        OR        (?s)(\G|BR)((?!ER).)*?\KSR

                REPLACE RR

                where :

                • BR ( Begining Regex ) is the regex which defines, either, the start of that specific area and the start for a possible Search Regex match

                • ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur

                • SR ( Search Regex ) is the regex which defines the expression to search, if , both, the Begining Regex and the Excluded Regex are TRUE

                • RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex


                In your case, we must look for unnecessary blank characters, in a <..........> area, without any < nor > inside. Hence, the excluded chars are , simply, the two symbols < and >

                Now, inside that area, possibly multi-lines, we’ll look for either:

                • Case A : All blank characters,at beginning of lines, in case of a correct area, split in several lines

                • Case B : All blank characters,at end of lines, in case of a correct area, split in several lines

                • Case C : All blank characters, right after the < symbol

                • Case D : All blank characters, right before the </p> ending tag

                • Case E : All ranges of, at least, two blank characters, not followed from, either, a blank char or a < symbol

                Theses 5 cases correspond to the different alternatives of the SR search regex, *separated with the | symbol

                So, we have :

                BR = <p class="oyric">

                ER = <|>

                SR = (?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

                RR = (?1$0)(?2\x20)

                Remarks :

                • The assertion \G is considered as true ( current position of caret), during the first run of the regex. So, if you move the cursor inside the leading spaces of a line, in purpose, before running the reges S/R, it could wrongly match the remaining leading spaces, located after the cursor position

                • So, in order to avoid theses matches, I added, for case E, the restrictive condition (\h{2,})(?=[^<\h]), which must be true, after the blank matched range !

                • Regarding the replacement :

                  • If case A occurs, we must keep the leading spaces, stored in group1 So , we rewrite the entire match (?1$0)

                  • If cases B, C or D occurs, we need to delete all these blank chars => Nothing is rewritten

                  • If case E occurs, we just replace all the blank chars matched, stored in group2 with a single space character => (?2\x20)


                Finally, we get this new regex S/R :

                SEARCH (?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

                REPLACE (?1$0)(?2\x20)

                which should avoid the side-effects of my first attempt ;-))

                Beware

                • If you decide, before performing this regex S/R, to move the cursor, on purpose, inside the <.........> area of a <p> tag, with a class different from oyric, it will, also, match all the additional blank characters of that <.........> zone. Can’t do anything about this !

                • Luckily, once the caret is located after that first zone <.........>, the behavior of the regex is, again, as expected :-))

                Cheers,

                guy038

                PS :

                I say, above, regarding the ER ( Excluded Regex ), that it, implicitly, defines a zone, where the Search Regex may occur

                Just an example to explain this notion. Let’s consider these 3 simple regexes :

                • [^<>]+(?=\h) , which searches the greatest range of chars, different of < and >, if followed with a blank char

                • [^<]+(?=\h) , which searches the greatest range of chars, different of <, if followed with a blank char

                • [^>]+(?=\h) , which searches the greatest range of chars, different of >, if followed with a blank char

                Here is, below, the results, with any range of chars, underlined with - and the blank char, underlined with ^

                                                    REGEX  [^<>]+(?=\h)
                
                <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
                 -               ------------------------^    --^ -^              -------------^    -^ -^              ------^
                
                
                
                                                    REGEX  [^<]+(?=\h)
                
                <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
                 ----------------------------------------^ -----^ -----------------------------^ ----^ ----------------------^
                
                
                
                                                    REGEX  [^>]+(?=\h)
                
                <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
                --^              ------------------------^    -----^              -------------^    ----^              ------^
                

                As we use the greedy quantifier +, it easy to visualize the complete zones where it is allowed to look for a blank character, underlined with ^, in each case ;-))

                1 Reply Last reply Reply Quote 2
                • Robin CruiseR Offline
                  Robin Cruise
                  last edited by

                  well done, thank you very much

                  1 Reply Last reply Reply Quote 0
                  • Neculai I. FantanaruN Offline
                    Neculai I. Fantanaru
                    last edited by Neculai I. Fantanaru

                    @guy038 said:

                    (?1$0)(?2\x20)

                    also, it can be replace with: (?{2}$1 ) such as:

                    SEARCH: (?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

                    REPLACE BY: (?{2}$1 )

                    1 Reply Last reply Reply Quote 1
                    • Neculai I. FantanaruN Offline
                      Neculai I. Fantanaru
                      last edited by

                      @guy038 I just review this post, because I like it and remembered the same thing from the post today.

                      SEARCH (?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

                      REPLACE (?1$0)(?2\x20)

                      What if, in case I have another tag, like <em>

                      So, @Robin Cruise scenario become:

                      <p class="oyric"> Laurie Strode comes to her final confrontation <em> with Michael Myers, the masked figure who has haunted her </em>since she narrowly escaped. </p>

                      So, In this case, your regex does not remove empty spaces because of those <em>

                      1 Reply Last reply Reply Quote 1
                      • Neculai I. FantanaruN Offline
                        Neculai I. Fantanaru
                        last edited by

                        This post is deleted!
                        1 Reply Last reply Reply Quote 0

                        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                        With your input, this post could be even better 💗

                        Register Login
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors