how to remove empty spaces from a particular tag (regular expression)

Robin Cruise

good day, I made this regular expression to remove empty spaces from a particular tag that has many empty spaces and tabs:

(?-s)(\G(?!^)|<p\s+class="oyric">)((?!).)*?\K\s\s+ but it does not work well

<p class="oyric">  Laurie Strode comes to   her final confrontation		 with Michael Myers, the   masked figure  who has haunted her 	  since she narrowly escaped.  </p>

Output should be:

<p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>

guy038

Hi, @Robin-cruise and All,

You were not very far from the right solution ! The way to replace something :

In a particular tag section, as ........
In a particular tag section, with a particular class name, as Bla bla blah

has already been discussed in these posts :

https://notepad-plus-plus.org/community/topic/15058/regex-remove-particular-words-from-tags-in-several-text-pages/10

https://notepad-plus-plus.org/community/topic/15058/regex-remove-particular-words-from-tags-in-several-text-pages/12

So, the general regex solution, when you want to perform a Search/Replacement, ONLY in an area, which is located between two particular boundaries, is :

SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR OR (?s)(\G|BR)((?!ER).)*?\KSR

REPLACE RR

where :

BR ( Begining Regex ) is the regex which defines the start of the defined zone, for the search/replacement
ER ( Ending Regex ) is the regex which defines the end of the defined zone, for the search/replacement
SR ( Search Regex ) is the regex which defines the expression to search, in any defined zone
RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex, in any defined zone

In your case :

BR = 

ER = 

SR = ((?<=>)\h+|\h+(?=<|\h))

RR = Nothing

Notes :

SR is a search of any of the two alternatives, separated with the | symbol, and surrounded by parentheses, because of the lowest priority of the alternation symbol |
- (?<=>)\h+ which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if preceded with the > symbol
- \h+(?=<|\h)) which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if followed by, either, the < symbol or a final horizontal blank character
As all these blank characters matched have to be deleted, the replacement zone is just empty
First, the regex tries to find the string , followed by the shortest range, even null, of characters, .*?, till the search regex, explained above, with the condition that the string  must not located at any position of this range
Due to the \K syntax, the regex engine resets its working location and forgets any previous match. So the final match is ,simply, the part described above ((?<=>)\h+|\h+(?=<|\h)) ( SR )
After this first match, it can only match the zero-length assertion \G, followed, again, with a possible other shortest range, even null … … … as just above !
When the regex engine skips the ending boundary , the \G cannot be verified anymore and the only way to match something else is to grab, again, a  string, further on !
If you are only interested in single-line ranges BR.........ER, use the (?-s) modifier, at beginning of the search regex
If you may have some multi-lines ranges BR.........ER, use the (?s) modifier, at beginning of the search regex

So, Robin, let’s imagine the sample text, below :

<p class="oyric">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>

<p class="Tag_1">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>

<p class="oyric">            Laurie Strode         comes to her final confrontation with Michael     Myers, the masked figure  who has              haunted her since she            narrowly escaped.</p>

<p class="Tag_2">bla    blah     blah   </p>

<p class="oyric">  This is    a test </p>   <p class="Tag_3">bla       blah   blah </p>  <p class="oyric">  The    final   test   </p>


<p class="oyric">  Laurie Strode comes to   her final
 confrontation      
 with Michael Myers, the   masked
 figure  who has haunted her     since she
 narrowly escaped.  </p>

<p class="Tag_2">bla
    blah     
....blah   </p>

<p class="oyric">     This is    an           
     other  test to verify    if the      regex
           is correct         </p>

Using the following regex S/R :

SEARCH (?s)(\G|)((?!).)*?\K((?<=>)\h+|\h+(?=<|\h))

REPLACE Leave EMPTY

You should get the expected text, below :

<p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>

<p class="Tag_1">  Laurie Strode comes to   her final confrontation      with Michael Myers, the   masked figure  who has haunted her     since she narrowly escaped.  </p>

<p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>

<p class="Tag_2">bla    blah     blah   </p>

<p class="oyric">This is a test</p>   <p class="Tag_3">bla       blah   blah </p>  <p class="oyric">The final test</p>


<p class="oyric">Laurie Strode comes to her final
 confrontation 
 with Michael Myers, the masked
 figure who has haunted her since she
 narrowly escaped.</p>

<p class="Tag_2">bla
    blah     
....blah   </p>

<p class="oyric">This is an 
 other test to verify if the regex
 is correct</p>

Notes :

It’s easy to verify that blank characters have been removed, ONLY in all areas .........., whatever they were single-line areas or a multi-lines blocks
However, note that if a line ends with blank chars and the next line begins, also, with blank characters, this regex S/R will keep one blank char, either, at the end of this line and at the beginning of the next line !
Finally, beware that, if words are separated with a mix of space and tabulation characters, only the final blank character will be kept !

For instance from text :

<p class="oyric">	   		Test	  	    #1  	    </p>  ( Last BLANK char = SPACE      ,before #1 )

<p class="oyric">   	    Test  	  		#2		  	</p>  ( Last BLANK char = TABULATION, before #2 )

You’ll obtain :

<p class="oyric">Test #1</p>   ( SPACE      char between Test and #1 )

<p class="oyric">Test	#2</p> ( TABULATION char between Test and #2 )

Best Regards,

guy038

Robin Cruise

GREAT ! thank you very much ;)

Robin Cruise

SEARCH (?s)(\G|)((?!).)*?\K((?<=>)\h+|\h+(?=<|\h))

REPLACE Leave EMPTY

by the way, there is a little problem in your regex, guy038. Now I discover that.

Seems that your regex selects all spaces outside the specified tag, and disturb
all my other lines.

See a print screen:

https://snag.gy/fRX1ZO.jpg

or check regex on this code on notepad: https://regex101.com/r/dDBYSk/1

See what happen after “Replace all”. You will see that all tags with spaces before and after are modify, not only that particular tag I want. ()

rinku singh

i’m thinking about swiss file knife plugins to build.
http://stahlworks.com/dev/swiss-file-knife.html

guy038

Hello, @Robin-cruise and All,

Ah Yes ! My regex wasn’t enough accurate ! And worse, my formulation of the general regex solution was not exact too :-((

So, the general regex solution, when you want to perform a Search/Replacement in a specific area, only, is :

SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR OR (?s)(\G|BR)((?!ER).)*?\KSR

REPLACE RR

where :

BR ( Begining Regex ) is the regex which defines, either, the start of that specific area and the start for a possible Search Regex match
ER ( Excluded Regex ) is the regex which defines the characters and/or strings forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur
SR ( Search Regex ) is the regex which defines the expression to search, if , both, the Begining Regex and the Excluded Regex are TRUE
RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex

In your case, we must look for unnecessary blank characters, in a <..........> area, without any < nor > inside. Hence, the excluded chars are , simply, the two symbols < and >

Now, inside that area, possibly multi-lines, we’ll look for either:

Case A : All blank characters,at beginning of lines, in case of a correct area, split in several lines
Case B : All blank characters,at end of lines, in case of a correct area, split in several lines
Case C : All blank characters, right after the < symbol
Case D : All blank characters, right before the  ending tag
Case E : All ranges of, at least, two blank characters, not followed from, either, a blank char or a < symbol

Theses 5 cases correspond to the different alternatives of the SR search regex, *separated with the | symbol

So, we have :

BR = 

ER = <|>

SR = (?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=)|(\h{2,})(?=[^<\h]))

RR = (?1$0)(?2\x20)

Remarks :

The assertion \G is considered as true ( current position of caret), during the first run of the regex. So, if you move the cursor inside the leading spaces of a line, in purpose, before running the reges S/R, it could wrongly match the remaining leading spaces, located after the cursor position
So, in order to avoid theses matches, I added, for case E, the restrictive condition (\h{2,})(?=[^<\h]), which must be true, after the blank matched range !
Regarding the replacement :
- If case A occurs, we must keep the leading spaces, stored in group1 So , we rewrite the entire match (?1$0)
- If cases B, C or D occurs, we need to delete all these blank chars => Nothing is rewritten
- If case E occurs, we just replace all the blank chars matched, stored in group2 with a single space character => (?2\x20)

Finally, we get this new regex S/R :

SEARCH (?s)(?:\G|)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=)|(\h{2,})(?=[^<\h]))

REPLACE (?1$0)(?2\x20)

which should avoid the side-effects of my first attempt ;-))

Beware

If you decide, before performing this regex S/R, to move the cursor, on purpose, inside the <.........> area of a  tag, with a class different from oyric, it will, also, match all the additional blank characters of that <.........> zone. Can’t do anything about this !
Luckily, once the caret is located after that first zone <.........>, the behavior of the regex is, again, as expected :-))

Cheers,

guy038

PS :

I say, above, regarding the ER ( Excluded Regex ), that it, implicitly, defines a zone, where the Search Regex may occur

Just an example to explain this notion. Let’s consider these 3 simple regexes :

[^<>]+(?=\h) , which searches the greatest range of chars, different of < and >, if followed with a blank char
[^<]+(?=\h) , which searches the greatest range of chars, different of <, if followed with a blank char
[^>]+(?=\h) , which searches the greatest range of chars, different of >, if followed with a blank char

Here is, below, the results, with any range of chars, underlined with - and the blank char, underlined with ^

                                    REGEX  [^<>]+(?=\h)

<p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
 -               ------------------------^    --^ -^              -------------^    -^ -^              ------^



                                    REGEX  [^<]+(?=\h)

<p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
 ----------------------------------------^ -----^ -----------------------------^ ----^ ----------------------^



                                    REGEX  [^>]+(?=\h)

<p class="oyric">This     is    a   test  </p>   <p class="Tag_3">bla     blahh </p>  <p class="oyric">    a  test</p>
--^              ------------------------^    -----^              -------------^    ----^              ------^

As we use the greedy quantifier +, it easy to visualize the complete zones where it is allowed to look for a blank character, underlined with ^, in each case ;-))

Robin Cruise

well done, thank you very much

Neculai I. Fantanaru

@guy038 said:

(?1$0)(?2\x20)

also, it can be replace with: (?{2}$1 ) such as:

SEARCH: (?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=)|(\h{2,})(?=[^<\h]))

REPLACE BY: (?{2}$1 )

Neculai I. Fantanaru

@guy038 I just review this post, because I like it and remembered the same thing from the post today.

SEARCH (?s)(?:\G|)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=)|(\h{2,})(?=[^<\h]))

REPLACE (?1$0)(?2\x20)

What if, in case I have another tag, like 

So, @Robin Cruise scenario become:

 Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. 

So, In this case, your regex does not remove empty spaces because of those 

Neculai I. Fantanaru

This post is deleted!