• Login
Community
  • Login

how to remove empty spaces from a particular tag (regular expression)

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
10 Posts 4 Posters 5.9k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R
    Robin Cruise
    last edited by Robin Cruise Oct 30, 2018, 7:35 AM Oct 30, 2018, 7:34 AM

    good day, I made this regular expression to remove empty spaces from a particular tag that has many empty spaces and tabs:

    (?-s)(\G(?!^)|<p\s+class="oyric">)((?!</p>).)*?\K\s\s+ but it does not work well

    <p class="oyric">  Laurie Strode comes to   her final confrontation		 with Michael Myers, the   masked figure  who has haunted her 	  since she narrowly escaped.  </p>
    

    Output should be:

    <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
    
    1 Reply Last reply Reply Quote 1
    • G
      guy038
      last edited by guy038 Oct 30, 2018, 7:40 PM Oct 30, 2018, 12:40 PM

      Hi, @Robin-cruise and All,

      You were not very far from the right solution ! The way to replace something :

      • In a particular tag section, as <p>........</p>

      • In a particular tag section, with a particular class name, as <p class="test">Bla bla blah</p>

      has already been discussed in these posts :

      https://notepad-plus-plus.org/community/topic/15058/regex-remove-particular-words-from-tags-in-several-text-pages/10

      https://notepad-plus-plus.org/community/topic/15058/regex-remove-particular-words-from-tags-in-several-text-pages/12


      So, the general regex solution, when you want to perform a Search/Replacement, ONLY in an area, which is located between two particular boundaries, is :

      SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR        OR        (?s)(\G|BR)((?!ER).)*?\KSR

      REPLACE RR

      where :

      • BR ( Begining Regex ) is the regex which defines the start of the defined zone, for the search/replacement

      • ER ( Ending Regex ) is the regex which defines the end of the defined zone, for the search/replacement

      • SR ( Search Regex ) is the regex which defines the expression to search, in any defined zone

      • RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex, in any defined zone

      In your case :

      BR = <p class="oyric">

      ER = </p>

      SR = ((?<=>)\h+|\h+(?=<|\h))

      RR = Nothing

      Notes :

      • SR is a search of any of the two alternatives, separated with the | symbol, and surrounded by parentheses, because of the lowest priority of the alternation symbol |

        • (?<=>)\h+ which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if preceded with the > symbol

        • \h+(?=<|\h)) which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if followed by, either, the < symbol or a final horizontal blank character

      • As all these blank characters matched have to be deleted, the replacement zone is just empty

      • First, the regex tries to find the string <p class="oyric">, followed by the shortest range, even null, of characters, .*?, till the search regex, explained above, with the condition that the string </p> must not located at any position of this range

      • Due to the \K syntax, the regex engine resets its working location and forgets any previous match. So the final match is ,simply, the part described above ((?<=>)\h+|\h+(?=<|\h)) ( SR )

      • After this first match, it can only match the zero-length assertion \G, followed, again, with a possible other shortest range, even null … … … as just above !

      • When the regex engine skips the ending boundary </p>, the \G cannot be verified anymore and the only way to match something else is to grab, again, a <p class="oyric"> string, further on !

      • If you are only interested in single-line ranges BR.........ER, use the (?-s) modifier, at beginning of the search regex

      • If you may have some multi-lines ranges BR.........ER, use the (?s) modifier, at beginning of the search regex


      So, Robin, let’s imagine the sample text, below :

      <p class="oyric">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
      
      <p class="Tag_1">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
      
      <p class="oyric">            Laurie Strode         comes to her final confrontation with Michael     Myers, the masked figure  who has              haunted her since she            narrowly escaped.</p>
      
      <p class="Tag_2">bla    blah     blah   </p>
      
      <p class="oyric">  This is    a test </p>   <p class="Tag_3">bla       blah   blah </p>  <p class="oyric">  The    final   test   </p>
      
      
      <p class="oyric">  Laurie Strode comes to   her final
       confrontation      
       with Michael Myers, the   masked
       figure  who has haunted her     since she
       narrowly escaped.  </p>
      
      <p class="Tag_2">bla
          blah     
      ....blah   </p>
      
      <p class="oyric">     This is    an           
           other  test to verify    if the      regex
                 is correct         </p>
      

      Using the following regex S/R :

      SEARCH (?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))

      REPLACE Leave EMPTY

      You should get the expected text, below :

      <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
      
      <p class="Tag_1">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
      
      <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
      
      <p class="Tag_2">bla    blah     blah   </p>
      
      <p class="oyric">This is a test</p>   <p class="Tag_3">bla       blah   blah </p>  <p class="oyric">The final test</p>
      
      
      <p class="oyric">Laurie Strode comes to her final
       confrontation 
       with Michael Myers, the masked
       figure who has haunted her since she
       narrowly escaped.</p>
      
      <p class="Tag_2">bla
          blah     
      ....blah   </p>
      
      <p class="oyric">This is an 
       other test to verify if the regex
       is correct</p>
      

      Notes :

      • It’s easy to verify that blank characters have been removed, ONLY in all areas <p class="oyric">..........</p>, whatever they were single-line areas or a multi-lines blocks

      • However, note that if a line ends with blank chars and the next line begins, also, with blank characters, this regex S/R will keep one blank char, either, at the end of this line and at the beginning of the next line !

      • Finally, beware that, if words are separated with a mix of space and tabulation characters, only the final blank character will be kept !

      For instance from text :

      <p class="oyric">	   		Test	  	    #1  	    </p>  ( Last BLANK char = SPACE      ,before #1 )
      
      <p class="oyric">   	    Test  	  		#2		  	</p>  ( Last BLANK char = TABULATION, before #2 )
      

      You’ll obtain :

      <p class="oyric">Test #1</p>   ( SPACE      char between Test and #1 )
      
      <p class="oyric">Test	#2</p> ( TABULATION char between Test and #2 )
      

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 2
      • R
        Robin Cruise
        last edited by Oct 30, 2018, 2:49 PM

        GREAT ! thank you very much ;)

        1 Reply Last reply Reply Quote 0
        • R
          Robin Cruise
          last edited by Robin Cruise Nov 10, 2018, 9:23 AM Nov 10, 2018, 9:21 AM

          SEARCH (?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))

          REPLACE Leave EMPTY


          by the way, there is a little problem in your regex, guy038. Now I discover that.

          Seems that your regex selects all spaces outside the specified tag, and disturb
          all my other lines.

          See a print screen:

          https://snag.gy/fRX1ZO.jpg

          or check regex on this code on notepad: https://regex101.com/r/dDBYSk/1

          See what happen after “Replace all”. You will see that all tags with spaces before and after are modify, not only that particular tag I want. (<p class=“oyric”>)

          1 Reply Last reply Reply Quote 0
          • rinku singhR
            rinku singh
            last edited by Nov 10, 2018, 10:04 AM

            i’m thinking about swiss file knife plugins to build.
            http://stahlworks.com/dev/swiss-file-knife.html

            1 Reply Last reply Reply Quote 0
            • G
              guy038
              last edited by guy038 Nov 11, 2018, 12:34 PM Nov 10, 2018, 4:35 PM

              Hello, @Robin-cruise and All,

              Ah Yes ! My regex wasn’t enough accurate ! And worse, my formulation of the general regex solution was not exact too :-((


              So, the general regex solution, when you want to perform a Search/Replacement in a specific area, only, is :

              SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR        OR        (?s)(\G|BR)((?!ER).)*?\KSR

              REPLACE RR

              where :

              • BR ( Begining Regex ) is the regex which defines, either, the start of that specific area and the start for a possible Search Regex match

              • ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur

              • SR ( Search Regex ) is the regex which defines the expression to search, if , both, the Begining Regex and the Excluded Regex are TRUE

              • RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex


              In your case, we must look for unnecessary blank characters, in a <..........> area, without any < nor > inside. Hence, the excluded chars are , simply, the two symbols < and >

              Now, inside that area, possibly multi-lines, we’ll look for either:

              • Case A : All blank characters,at beginning of lines, in case of a correct area, split in several lines

              • Case B : All blank characters,at end of lines, in case of a correct area, split in several lines

              • Case C : All blank characters, right after the < symbol

              • Case D : All blank characters, right before the </p> ending tag

              • Case E : All ranges of, at least, two blank characters, not followed from, either, a blank char or a < symbol

              Theses 5 cases correspond to the different alternatives of the SR search regex, *separated with the | symbol

              So, we have :

              BR = <p class="oyric">

              ER = <|>

              SR = (?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

              RR = (?1$0)(?2\x20)

              Remarks :

              • The assertion \G is considered as true ( current position of caret), during the first run of the regex. So, if you move the cursor inside the leading spaces of a line, in purpose, before running the reges S/R, it could wrongly match the remaining leading spaces, located after the cursor position

              • So, in order to avoid theses matches, I added, for case E, the restrictive condition (\h{2,})(?=[^<\h]), which must be true, after the blank matched range !

              • Regarding the replacement :

                • If case A occurs, we must keep the leading spaces, stored in group1 So , we rewrite the entire match (?1$0)

                • If cases B, C or D occurs, we need to delete all these blank chars => Nothing is rewritten

                • If case E occurs, we just replace all the blank chars matched, stored in group2 with a single space character => (?2\x20)


              Finally, we get this new regex S/R :

              SEARCH (?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

              REPLACE (?1$0)(?2\x20)

              which should avoid the side-effects of my first attempt ;-))

              Beware

              • If you decide, before performing this regex S/R, to move the cursor, on purpose, inside the <.........> area of a <p> tag, with a class different from oyric, it will, also, match all the additional blank characters of that <.........> zone. Can’t do anything about this !

              • Luckily, once the caret is located after that first zone <.........>, the behavior of the regex is, again, as expected :-))

              Cheers,

              guy038

              PS :

              I say, above, regarding the ER ( Excluded Regex ), that it, implicitly, defines a zone, where the Search Regex may occur

              Just an example to explain this notion. Let’s consider these 3 simple regexes :

              • [^<>]+(?=\h) , which searches the greatest range of chars, different of < and >, if followed with a blank char

              • [^<]+(?=\h) , which searches the greatest range of chars, different of <, if followed with a blank char

              • [^>]+(?=\h) , which searches the greatest range of chars, different of >, if followed with a blank char

              Here is, below, the results, with any range of chars, underlined with - and the blank char, underlined with ^

                                                  REGEX  [^<>]+(?=\h)
              
              <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
               -               ------------------------^    --^ -^              -------------^    -^ -^              ------^
              
              
              
                                                  REGEX  [^<]+(?=\h)
              
              <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
               ----------------------------------------^ -----^ -----------------------------^ ----^ ----------------------^
              
              
              
                                                  REGEX  [^>]+(?=\h)
              
              <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
              --^              ------------------------^    -----^              -------------^    ----^              ------^
              

              As we use the greedy quantifier +, it easy to visualize the complete zones where it is allowed to look for a blank character, underlined with ^, in each case ;-))

              1 Reply Last reply Reply Quote 2
              • R
                Robin Cruise
                last edited by Nov 10, 2018, 5:53 PM

                well done, thank you very much

                1 Reply Last reply Reply Quote 0
                • Neculai I. FantanaruN
                  Neculai I. Fantanaru
                  last edited by Neculai I. Fantanaru Dec 3, 2018, 1:41 PM Dec 3, 2018, 1:40 PM

                  @guy038 said:

                  (?1$0)(?2\x20)

                  also, it can be replace with: (?{2}$1 ) such as:

                  SEARCH: (?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

                  REPLACE BY: (?{2}$1 )

                  1 Reply Last reply Reply Quote 1
                  • Neculai I. FantanaruN
                    Neculai I. Fantanaru
                    last edited by Dec 29, 2018, 9:41 AM

                    @guy038 I just review this post, because I like it and remembered the same thing from the post today.

                    SEARCH (?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

                    REPLACE (?1$0)(?2\x20)

                    What if, in case I have another tag, like <em>

                    So, @Robin Cruise scenario become:

                    <p class="oyric"> Laurie Strode comes to her final confrontation <em> with Michael Myers, the masked figure who has haunted her </em>since she narrowly escaped. </p>

                    So, In this case, your regex does not remove empty spaces because of those <em>

                    1 Reply Last reply Reply Quote 1
                    • Neculai I. FantanaruN
                      Neculai I. Fantanaru
                      last edited by Jan 2, 2019, 11:26 AM

                      This post is deleted!
                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors