Community
    • Login

    how to remove empty spaces from a particular tag (regular expression)

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    10 Posts 4 Posters 5.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR
      Robin Cruise
      last edited by Robin Cruise

      good day, I made this regular expression to remove empty spaces from a particular tag that has many empty spaces and tabs:

      (?-s)(\G(?!^)|<p\s+class="oyric">)((?!</p>).)*?\K\s\s+ but it does not work well

      <p class="oyric">  Laurie Strode comes to   her final confrontation		 with Michael Myers, the   masked figure  who has haunted her 	  since she narrowly escaped.  </p>
      

      Output should be:

      <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
      
      1 Reply Last reply Reply Quote 1
      • guy038G
        guy038
        last edited by guy038

        Hi, @Robin-cruise and All,

        You were not very far from the right solution ! The way to replace something :

        • In a particular tag section, as <p>........</p>

        • In a particular tag section, with a particular class name, as <p class="test">Bla bla blah</p>

        has already been discussed in these posts :

        https://notepad-plus-plus.org/community/topic/15058/regex-remove-particular-words-from-tags-in-several-text-pages/10

        https://notepad-plus-plus.org/community/topic/15058/regex-remove-particular-words-from-tags-in-several-text-pages/12


        So, the general regex solution, when you want to perform a Search/Replacement, ONLY in an area, which is located between two particular boundaries, is :

        SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR        OR        (?s)(\G|BR)((?!ER).)*?\KSR

        REPLACE RR

        where :

        • BR ( Begining Regex ) is the regex which defines the start of the defined zone, for the search/replacement

        • ER ( Ending Regex ) is the regex which defines the end of the defined zone, for the search/replacement

        • SR ( Search Regex ) is the regex which defines the expression to search, in any defined zone

        • RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex, in any defined zone

        In your case :

        BR = <p class="oyric">

        ER = </p>

        SR = ((?<=>)\h+|\h+(?=<|\h))

        RR = Nothing

        Notes :

        • SR is a search of any of the two alternatives, separated with the | symbol, and surrounded by parentheses, because of the lowest priority of the alternation symbol |

          • (?<=>)\h+ which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if preceded with the > symbol

          • \h+(?=<|\h)) which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if followed by, either, the < symbol or a final horizontal blank character

        • As all these blank characters matched have to be deleted, the replacement zone is just empty

        • First, the regex tries to find the string <p class="oyric">, followed by the shortest range, even null, of characters, .*?, till the search regex, explained above, with the condition that the string </p> must not located at any position of this range

        • Due to the \K syntax, the regex engine resets its working location and forgets any previous match. So the final match is ,simply, the part described above ((?<=>)\h+|\h+(?=<|\h)) ( SR )

        • After this first match, it can only match the zero-length assertion \G, followed, again, with a possible other shortest range, even null … … … as just above !

        • When the regex engine skips the ending boundary </p>, the \G cannot be verified anymore and the only way to match something else is to grab, again, a <p class="oyric"> string, further on !

        • If you are only interested in single-line ranges BR.........ER, use the (?-s) modifier, at beginning of the search regex

        • If you may have some multi-lines ranges BR.........ER, use the (?s) modifier, at beginning of the search regex


        So, Robin, let’s imagine the sample text, below :

        <p class="oyric">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
        
        <p class="Tag_1">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
        
        <p class="oyric">            Laurie Strode         comes to her final confrontation with Michael     Myers, the masked figure  who has              haunted her since she            narrowly escaped.</p>
        
        <p class="Tag_2">bla    blah     blah   </p>
        
        <p class="oyric">  This is    a test </p>   <p class="Tag_3">bla       blah   blah </p>  <p class="oyric">  The    final   test   </p>
        
        
        <p class="oyric">  Laurie Strode comes to   her final
         confrontation      
         with Michael Myers, the   masked
         figure  who has haunted her     since she
         narrowly escaped.  </p>
        
        <p class="Tag_2">bla
            blah     
        ....blah   </p>
        
        <p class="oyric">     This is    an           
             other  test to verify    if the      regex
                   is correct         </p>
        

        Using the following regex S/R :

        SEARCH (?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))

        REPLACE Leave EMPTY

        You should get the expected text, below :

        <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
        
        <p class="Tag_1">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
        
        <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
        
        <p class="Tag_2">bla    blah     blah   </p>
        
        <p class="oyric">This is a test</p>   <p class="Tag_3">bla       blah   blah </p>  <p class="oyric">The final test</p>
        
        
        <p class="oyric">Laurie Strode comes to her final
         confrontation 
         with Michael Myers, the masked
         figure who has haunted her since she
         narrowly escaped.</p>
        
        <p class="Tag_2">bla
            blah     
        ....blah   </p>
        
        <p class="oyric">This is an 
         other test to verify if the regex
         is correct</p>
        

        Notes :

        • It’s easy to verify that blank characters have been removed, ONLY in all areas <p class="oyric">..........</p>, whatever they were single-line areas or a multi-lines blocks

        • However, note that if a line ends with blank chars and the next line begins, also, with blank characters, this regex S/R will keep one blank char, either, at the end of this line and at the beginning of the next line !

        • Finally, beware that, if words are separated with a mix of space and tabulation characters, only the final blank character will be kept !

        For instance from text :

        <p class="oyric">	   		Test	  	    #1  	    </p>  ( Last BLANK char = SPACE      ,before #1 )
        
        <p class="oyric">   	    Test  	  		#2		  	</p>  ( Last BLANK char = TABULATION, before #2 )
        

        You’ll obtain :

        <p class="oyric">Test #1</p>   ( SPACE      char between Test and #1 )
        
        <p class="oyric">Test	#2</p> ( TABULATION char between Test and #2 )
        

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 2
        • Robin CruiseR
          Robin Cruise
          last edited by

          GREAT ! thank you very much ;)

          1 Reply Last reply Reply Quote 0
          • Robin CruiseR
            Robin Cruise
            last edited by Robin Cruise

            SEARCH (?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))

            REPLACE Leave EMPTY


            by the way, there is a little problem in your regex, guy038. Now I discover that.

            Seems that your regex selects all spaces outside the specified tag, and disturb
            all my other lines.

            See a print screen:

            https://snag.gy/fRX1ZO.jpg

            or check regex on this code on notepad: https://regex101.com/r/dDBYSk/1

            See what happen after “Replace all”. You will see that all tags with spaces before and after are modify, not only that particular tag I want. (<p class=“oyric”>)

            1 Reply Last reply Reply Quote 0
            • rinku singhR
              rinku singh
              last edited by

              i’m thinking about swiss file knife plugins to build.
              http://stahlworks.com/dev/swiss-file-knife.html

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by guy038

                Hello, @Robin-cruise and All,

                Ah Yes ! My regex wasn’t enough accurate ! And worse, my formulation of the general regex solution was not exact too :-((


                So, the general regex solution, when you want to perform a Search/Replacement in a specific area, only, is :

                SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR        OR        (?s)(\G|BR)((?!ER).)*?\KSR

                REPLACE RR

                where :

                • BR ( Begining Regex ) is the regex which defines, either, the start of that specific area and the start for a possible Search Regex match

                • ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur

                • SR ( Search Regex ) is the regex which defines the expression to search, if , both, the Begining Regex and the Excluded Regex are TRUE

                • RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex


                In your case, we must look for unnecessary blank characters, in a <..........> area, without any < nor > inside. Hence, the excluded chars are , simply, the two symbols < and >

                Now, inside that area, possibly multi-lines, we’ll look for either:

                • Case A : All blank characters,at beginning of lines, in case of a correct area, split in several lines

                • Case B : All blank characters,at end of lines, in case of a correct area, split in several lines

                • Case C : All blank characters, right after the < symbol

                • Case D : All blank characters, right before the </p> ending tag

                • Case E : All ranges of, at least, two blank characters, not followed from, either, a blank char or a < symbol

                Theses 5 cases correspond to the different alternatives of the SR search regex, *separated with the | symbol

                So, we have :

                BR = <p class="oyric">

                ER = <|>

                SR = (?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

                RR = (?1$0)(?2\x20)

                Remarks :

                • The assertion \G is considered as true ( current position of caret), during the first run of the regex. So, if you move the cursor inside the leading spaces of a line, in purpose, before running the reges S/R, it could wrongly match the remaining leading spaces, located after the cursor position

                • So, in order to avoid theses matches, I added, for case E, the restrictive condition (\h{2,})(?=[^<\h]), which must be true, after the blank matched range !

                • Regarding the replacement :

                  • If case A occurs, we must keep the leading spaces, stored in group1 So , we rewrite the entire match (?1$0)

                  • If cases B, C or D occurs, we need to delete all these blank chars => Nothing is rewritten

                  • If case E occurs, we just replace all the blank chars matched, stored in group2 with a single space character => (?2\x20)


                Finally, we get this new regex S/R :

                SEARCH (?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

                REPLACE (?1$0)(?2\x20)

                which should avoid the side-effects of my first attempt ;-))

                Beware

                • If you decide, before performing this regex S/R, to move the cursor, on purpose, inside the <.........> area of a <p> tag, with a class different from oyric, it will, also, match all the additional blank characters of that <.........> zone. Can’t do anything about this !

                • Luckily, once the caret is located after that first zone <.........>, the behavior of the regex is, again, as expected :-))

                Cheers,

                guy038

                PS :

                I say, above, regarding the ER ( Excluded Regex ), that it, implicitly, defines a zone, where the Search Regex may occur

                Just an example to explain this notion. Let’s consider these 3 simple regexes :

                • [^<>]+(?=\h) , which searches the greatest range of chars, different of < and >, if followed with a blank char

                • [^<]+(?=\h) , which searches the greatest range of chars, different of <, if followed with a blank char

                • [^>]+(?=\h) , which searches the greatest range of chars, different of >, if followed with a blank char

                Here is, below, the results, with any range of chars, underlined with - and the blank char, underlined with ^

                                                    REGEX  [^<>]+(?=\h)
                
                <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
                 -               ------------------------^    --^ -^              -------------^    -^ -^              ------^
                
                
                
                                                    REGEX  [^<]+(?=\h)
                
                <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
                 ----------------------------------------^ -----^ -----------------------------^ ----^ ----------------------^
                
                
                
                                                    REGEX  [^>]+(?=\h)
                
                <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
                --^              ------------------------^    -----^              -------------^    ----^              ------^
                

                As we use the greedy quantifier +, it easy to visualize the complete zones where it is allowed to look for a blank character, underlined with ^, in each case ;-))

                1 Reply Last reply Reply Quote 2
                • Robin CruiseR
                  Robin Cruise
                  last edited by

                  well done, thank you very much

                  1 Reply Last reply Reply Quote 0
                  • Neculai I. FantanaruN
                    Neculai I. Fantanaru
                    last edited by Neculai I. Fantanaru

                    @guy038 said:

                    (?1$0)(?2\x20)

                    also, it can be replace with: (?{2}$1 ) such as:

                    SEARCH: (?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

                    REPLACE BY: (?{2}$1 )

                    1 Reply Last reply Reply Quote 1
                    • Neculai I. FantanaruN
                      Neculai I. Fantanaru
                      last edited by

                      @guy038 I just review this post, because I like it and remembered the same thing from the post today.

                      SEARCH (?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

                      REPLACE (?1$0)(?2\x20)

                      What if, in case I have another tag, like <em>

                      So, @Robin Cruise scenario become:

                      <p class="oyric"> Laurie Strode comes to her final confrontation <em> with Michael Myers, the masked figure who has haunted her </em>since she narrowly escaped. </p>

                      So, In this case, your regex does not remove empty spaces because of those <em>

                      1 Reply Last reply Reply Quote 1
                      • Neculai I. FantanaruN
                        Neculai I. Fantanaru
                        last edited by

                        This post is deleted!
                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors