how to remove empty spaces from a particular tag (regular expression)



  • good day, I made this regular expression to remove empty spaces from a particular tag that has many empty spaces and tabs:

    (?-s)(\G(?!^)|<p\s+class="oyric">)((?!</p>).)*?\K\s\s+ but it does not work well

    <p class="oyric">  Laurie Strode comes to   her final confrontation		 with Michael Myers, the   masked figure  who has haunted her 	  since she narrowly escaped.  </p>
    

    Output should be:

    <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>


  • Hi, @Robin-cruise and All,

    You were not very far from the right solution ! The way to replace something :

    • In a particular tag section, as <p>........</p>

    • In a particular tag section, with a particular class name, as <p class="test">Bla bla blah</p>

    has already been discussed in these posts :

    https://notepad-plus-plus.org/community/topic/15058/regex-remove-particular-words-from-tags-in-several-text-pages/10

    https://notepad-plus-plus.org/community/topic/15058/regex-remove-particular-words-from-tags-in-several-text-pages/12


    So, the general regex solution, when you want to perform a Search/Replacement, ONLY in an area, which is located between two particular boundaries, is :

    SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR        OR        (?s)(\G|BR)((?!ER).)*?\KSR

    REPLACE RR

    where :

    • BR ( Begining Regex ) is the regex which defines the start of the defined zone, for the search/replacement

    • ER ( Ending Regex ) is the regex which defines the end of the defined zone, for the search/replacement

    • SR ( Search Regex ) is the regex which defines the expression to search, in any defined zone

    • RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex, in any defined zone

    In your case :

    BR = <p class="oyric">

    ER = </p>

    SR = ((?<=>)\h+|\h+(?=<|\h))

    RR = Nothing

    Notes :

    • SR is a search of any of the two alternatives, separated with the | symbol, and surrounded by parentheses, because of the lowest priority of the alternation symbol |

      • (?<=>)\h+ which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if preceded with the > symbol

      • \h+(?=<|\h)) which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if followed by, either, the < symbol or a final horizontal blank character

    • As all these blank characters matched have to be deleted, the replacement zone is just empty

    • First, the regex tries to find the string <p class="oyric">, followed by the shortest range, even null, of characters, .*?, till the search regex, explained above, with the condition that the string </p> must not located at any position of this range

    • Due to the \K syntax, the regex engine resets its working location and forgets any previous match. So the final match is ,simply, the part described above ((?<=>)\h+|\h+(?=<|\h)) ( SR )

    • After this first match, it can only match the zero-length assertion \G, followed, again, with a possible other shortest range, even null … … … as just above !

    • When the regex engine skips the ending boundary </p>, the \G cannot be verified anymore and the only way to match something else is to grab, again, a <p class="oyric"> string, further on !

    • If you are only interested in single-line ranges BR.........ER, use the (?-s) modifier, at beginning of the search regex

    • If you may have some multi-lines ranges BR.........ER, use the (?s) modifier, at beginning of the search regex


    So, Robin, let’s imagine the sample text, below :

    <p class="oyric">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
    
    <p class="Tag_1">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
    
    <p class="oyric">            Laurie Strode         comes to her final confrontation with Michael     Myers, the masked figure  who has              haunted her since she            narrowly escaped.</p>
    
    <p class="Tag_2">bla    blah     blah   </p>
    
    <p class="oyric">  This is    a test </p>   <p class="Tag_3">bla       blah   blah </p>  <p class="oyric">  The    final   test   </p>
    
    
    <p class="oyric">  Laurie Strode comes to   her final
     confrontation      
     with Michael Myers, the   masked
     figure  who has haunted her     since she
     narrowly escaped.  </p>
    
    <p class="Tag_2">bla
        blah     
    ....blah   </p>
    
    <p class="oyric">     This is    an           
         other  test to verify    if the      regex
               is correct         </p>
    

    Using the following regex S/R :

    SEARCH (?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))

    REPLACE Leave EMPTY

    You should get the expected text, below :

    <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
    
    <p class="Tag_1">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>
    
    <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
    
    <p class="Tag_2">bla    blah     blah   </p>
    
    <p class="oyric">This is a test</p>   <p class="Tag_3">bla       blah   blah </p>  <p class="oyric">The final test</p>
    
    
    <p class="oyric">Laurie Strode comes to her final
     confrontation 
     with Michael Myers, the masked
     figure who has haunted her since she
     narrowly escaped.</p>
    
    <p class="Tag_2">bla
        blah     
    ....blah   </p>
    
    <p class="oyric">This is an 
     other test to verify if the regex
     is correct</p>
    

    Notes :

    • It’s easy to verify that blank characters have been removed, ONLY in all areas <p class="oyric">..........</p>, whatever they were single-line areas or a multi-lines blocks

    • However, note that if a line ends with blank chars and the next line begins, also, with blank characters, this regex S/R will keep one blank char, either, at the end of this line and at the beginning of the next line !

    • Finally, beware that, if words are separated with a mix of space and tabulation characters, only the final blank character will be kept !

    For instance from text :

    <p class="oyric">	   		Test	  	    #1  	    </p>  ( Last BLANK char = SPACE      ,before #1 )
    
    <p class="oyric">   	    Test  	  		#2		  	</p>  ( Last BLANK char = TABULATION, before #2 )
    

    You’ll obtain :

    <p class="oyric">Test #1</p>   ( SPACE      char between Test and #1 )
    
    <p class="oyric">Test	#2</p> ( TABULATION char between Test and #2 )
    

    Best Regards,

    guy038



  • GREAT ! thank you very much ;)



  • SEARCH (?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))

    REPLACE Leave EMPTY


    by the way, there is a little problem in your regex, guy038. Now I discover that.

    Seems that your regex selects all spaces outside the specified tag, and disturb
    all my other lines.

    See a print screen:

    https://snag.gy/fRX1ZO.jpg

    or check regex on this code on notepad: https://regex101.com/r/dDBYSk/1

    See what happen after “Replace all”. You will see that all tags with spaces before and after are modify, not only that particular tag I want. (<p class=“oyric”>)



  • i’m thinking about swiss file knife plugins to build.
    http://stahlworks.com/dev/swiss-file-knife.html



  • Hello, @Robin-cruise and All,

    Ah Yes ! My regex wasn’t enough accurate ! And worse, my formulation of the general regex solution was not exact too :-((


    So, the general regex solution, when you want to perform a Search/Replacement in a specific area, only, is :

    SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR        OR        (?s)(\G|BR)((?!ER).)*?\KSR

    REPLACE RR

    where :

    • BR ( Begining Regex ) is the regex which defines, either, the start of that specific area and the start for a possible Search Regex match

    • ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur

    • SR ( Search Regex ) is the regex which defines the expression to search, if , both, the Begining Regex and the Excluded Regex are TRUE

    • RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex


    In your case, we must look for unnecessary blank characters, in a <..........> area, without any < nor > inside. Hence, the excluded chars are , simply, the two symbols < and >

    Now, inside that area, possibly multi-lines, we’ll look for either:

    • Case A : All blank characters,at beginning of lines, in case of a correct area, split in several lines

    • Case B : All blank characters,at end of lines, in case of a correct area, split in several lines

    • Case C : All blank characters, right after the < symbol

    • Case D : All blank characters, right before the </p> ending tag

    • Case E : All ranges of, at least, two blank characters, not followed from, either, a blank char or a < symbol

    Theses 5 cases correspond to the different alternatives of the SR search regex, *separated with the | symbol

    So, we have :

    BR = <p class="oyric">

    ER = <|>

    SR = (?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

    RR = (?1$0)(?2\x20)

    Remarks :

    • The assertion \G is considered as true ( current position of caret), during the first run of the regex. So, if you move the cursor inside the leading spaces of a line, in purpose, before running the reges S/R, it could wrongly match the remaining leading spaces, located after the cursor position

    • So, in order to avoid theses matches, I added, for case E, the restrictive condition (\h{2,})(?=[^<\h]), which must be true, after the blank matched range !

    • Regarding the replacement :

      • If case A occurs, we must keep the leading spaces, stored in group1 So , we rewrite the entire match (?1$0)

      • If cases B, C or D occurs, we need to delete all these blank chars => Nothing is rewritten

      • If case E occurs, we just replace all the blank chars matched, stored in group2 with a single space character => (?2\x20)


    Finally, we get this new regex S/R :

    SEARCH (?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

    REPLACE (?1$0)(?2\x20)

    which should avoid the side-effects of my first attempt ;-))

    Beware

    • If you decide, before performing this regex S/R, to move the cursor, on purpose, inside the <.........> area of a <p> tag, with a class different from oyric, it will, also, match all the additional blank characters of that <.........> zone. Can’t do anything about this !

    • Luckily, once the caret is located after that first zone <.........>, the behavior of the regex is, again, as expected :-))

    Cheers,

    guy038

    PS :

    I say, above, regarding the ER ( Excluded Regex ), that it, implicitly, defines a zone, where the Search Regex may occur

    Just an example to explain this notion. Let’s consider these 3 simple regexes :

    • [^<>]+(?=\h) , which searches the greatest range of chars, different of < and >, if followed with a blank char

    • [^<]+(?=\h) , which searches the greatest range of chars, different of <, if followed with a blank char

    • [^>]+(?=\h) , which searches the greatest range of chars, different of >, if followed with a blank char

    Here is, below, the results, with any range of chars, underlined with - and the blank char, underlined with ^

                                        REGEX  [^<>]+(?=\h)
    
    <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
     -               ------------------------^    --^ -^              -------------^    -^ -^              ------^
    
    
    
                                        REGEX  [^<]+(?=\h)
    
    <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
     ----------------------------------------^ -----^ -----------------------------^ ----^ ----------------------^
    
    
    
                                        REGEX  [^>]+(?=\h)
    
    <p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
    --^              ------------------------^    -----^              -------------^    ----^              ------^
    

    As we use the greedy quantifier +, it easy to visualize the complete zones where it is allowed to look for a blank character, underlined with ^, in each case ;-))



  • well done, thank you very much



  • @guy038 said:

    (?1$0)(?2\x20)

    also, it can be replace with: (?{2}$1 ) such as:

    SEARCH: (?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

    REPLACE BY: (?{2}$1 )



  • @guy038 I just review this post, because I like it and remembered the same thing from the post today.

    SEARCH (?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))

    REPLACE (?1$0)(?2\x20)

    What if, in case I have another tag, like <em>

    So, @Robin Cruise scenario become:

    <p class="oyric"> Laurie Strode comes to her final confrontation <em> with Michael Myers, the masked figure who has haunted her </em>since she narrowly escaped. </p>

    So, In this case, your regex does not remove empty spaces because of those <em>



  • This post is deleted!

Log in to reply