how to remove empty spaces from a particular tag (regular expression)
-
good day, I made this regular expression to remove empty spaces from a particular tag that has many empty spaces and tabs:
(?-s)(\G(?!^)|<p\s+class="oyric">)((?!</p>).)*?\K\s\s+
but it does not work well<p class="oyric"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. </p>
Output should be:
<p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p>
-
Hi, @Robin-cruise and All,
You were not very far from the right solution ! The way to replace something :
-
In a particular tag section, as
<p>........</p>
-
In a particular tag section, with a particular class name, as
<p class="test">Bla bla blah</p>
has already been discussed in these posts :
So, the general regex solution, when you want to perform a Search/Replacement, ONLY in an area, which is located between two particular boundaries, is :
SEARCH
(?-s)(\G|
BR)((?!
ER).)*?\K
SR OR(?s)(\G|
BR)((?!
ER).)*?\K
SRREPLACE RR
where :
-
BR ( Begining Regex ) is the regex which defines the start of the defined zone, for the search/replacement
-
ER ( Ending Regex ) is the regex which defines the end of the defined zone, for the search/replacement
-
SR ( Search Regex ) is the regex which defines the expression to search, in any defined zone
-
RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex, in any defined zone
In your case :
BR =
<p class="oyric">
ER =
</p>
SR =
((?<=>)\h+|\h+(?=<|\h))
RR =
Nothing
Notes :
-
SR is a search of any of the two alternatives, separated with the
|
symbol, and surrounded by parentheses, because of the lowest priority of the alternation symbol|
-
(?<=>)\h+
which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if preceded with the>
symbol -
\h+(?=<|\h))
which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if followed by, either, the<
symbol or a final horizontal blank character
-
-
As all these blank characters matched have to be deleted, the replacement zone is just empty
-
First, the regex tries to find the string
<p class="oyric">
, followed by the shortest range, even null, of characters,.*?
, till the search regex, explained above, with the condition that the string</p>
must not located at any position of this range -
Due to the
\K
syntax, the regex engine resets its working location and forgets any previous match. So the final match is ,simply, the part described above((?<=>)\h+|\h+(?=<|\h))
( SR ) -
After this first match, it can only match the zero-length assertion
\G
, followed, again, with a possible other shortest range, even null … … … as just above ! -
When the regex engine skips the ending boundary
</p>
, the\G
cannot be verified anymore and the only way to match something else is to grab, again, a<p class="oyric">
string, further on ! -
If you are only interested in single-line ranges
BR.........ER
, use the(?-s)
modifier, at beginning of the search regex -
If you may have some multi-lines ranges
BR.........ER
, use the(?s)
modifier, at beginning of the search regex
So, Robin, let’s imagine the sample text, below :
<p class="oyric"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. </p> <p class="Tag_1"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. </p> <p class="oyric"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p> <p class="Tag_2">bla blah blah </p> <p class="oyric"> This is a test </p> <p class="Tag_3">bla blah blah </p> <p class="oyric"> The final test </p> <p class="oyric"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. </p> <p class="Tag_2">bla blah ....blah </p> <p class="oyric"> This is an other test to verify if the regex is correct </p>
Using the following regex S/R :
SEARCH
(?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))
REPLACE
Leave EMPTY
You should get the expected text, below :
<p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p> <p class="Tag_1"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. </p> <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p> <p class="Tag_2">bla blah blah </p> <p class="oyric">This is a test</p> <p class="Tag_3">bla blah blah </p> <p class="oyric">The final test</p> <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p> <p class="Tag_2">bla blah ....blah </p> <p class="oyric">This is an other test to verify if the regex is correct</p>
Notes :
-
It’s easy to verify that blank characters have been removed, ONLY in all areas
<p class="oyric">..........</p>
, whatever they were single-line areas or a multi-lines blocks -
However, note that if a line ends with blank chars and the next line begins, also, with blank characters, this regex S/R will keep one blank char, either, at the end of this line and at the beginning of the next line !
-
Finally, beware that, if words are separated with a mix of space and tabulation characters, only the final blank character will be kept !
For instance from text :
<p class="oyric"> Test #1 </p> ( Last BLANK char = SPACE ,before #1 ) <p class="oyric"> Test #2 </p> ( Last BLANK char = TABULATION, before #2 )
You’ll obtain :
<p class="oyric">Test #1</p> ( SPACE char between Test and #1 ) <p class="oyric">Test #2</p> ( TABULATION char between Test and #2 )
Best Regards,
guy038
-
-
GREAT ! thank you very much ;)
-
SEARCH
(?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))
REPLACE
Leave EMPTY
by the way, there is a little problem in your regex, guy038. Now I discover that.
Seems that your regex selects all spaces outside the specified tag, and disturb
all my other lines.See a print screen:
or check regex on this code on notepad: https://regex101.com/r/dDBYSk/1
See what happen after “Replace all”. You will see that all tags with spaces before and after are modify, not only that particular tag I want. (<p class=“oyric”>)
-
i’m thinking about swiss file knife plugins to build.
http://stahlworks.com/dev/swiss-file-knife.html -
Hello, @Robin-cruise and All,
Ah Yes ! My regex wasn’t enough accurate ! And worse, my formulation of the general regex solution was not exact too :-((
So, the general regex solution, when you want to perform a Search/Replacement in a specific area, only, is :
SEARCH
(?-s)(\G|
BR)((?!
ER).)*?\K
SR OR(?s)(\G|
BR)((?!
ER).)*?\K
SRREPLACE RR
where :
-
BR ( Begining Regex ) is the regex which defines, either, the start of that specific area and the start for a possible Search Regex match
-
ER ( Excluded Regex ) is the regex which defines the characters and/or strings
forbidden
, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur -
SR ( Search Regex ) is the regex which defines the expression to search, if , both, the Begining Regex and the Excluded Regex are TRUE
-
RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex
In your case, we must look for unnecessary blank characters, in a
<..........>
area, without any<
nor>
inside. Hence, the excluded chars are , simply, the two symbols<
and>
Now, inside that area, possibly multi-lines, we’ll look for either:
-
Case A : All blank characters,at beginning of lines, in case of a correct area, split in several lines
-
Case B : All blank characters,at end of lines, in case of a correct area, split in several lines
-
Case C : All blank characters, right after the
<
symbol -
Case D : All blank characters, right before the
</p>
ending tag -
Case E : All ranges of, at least, two blank characters, not followed from, either, a blank char or a
<
symbol
Theses
5
cases correspond to the different alternatives of the SR search regex, *separated with the|
symbolSo, we have :
BR =
<p class="oyric">
ER =
<|>
SR =
(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))
RR =
(?1$0)(?2\x20)
Remarks :
-
The assertion
\G
is considered as true ( current position of caret), during the first run of the regex. So, if you move the cursor inside the leading spaces of a line, in purpose, before running the reges S/R, it could wrongly match the remaining leading spaces, located after the cursor position -
So, in order to avoid theses matches, I added, for case E, the restrictive condition
(\h{2,})(?=[^<\h])
, which must be true, after the blank matched range ! -
Regarding the replacement :
-
If case A occurs, we must keep the leading spaces, stored in group
1
So , we rewrite the entire match(?1$0)
-
If cases B, C or D occurs, we need to delete all these blank chars =>
Nothing
is rewritten -
If case E occurs, we just replace all the blank chars matched, stored in group
2
with a single space character =>(?2\x20)
-
Finally, we get this new regex S/R :
SEARCH
(?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))
REPLACE
(?1$0)(?2\x20)
which should avoid the side-effects of my first attempt ;-))
Beware
-
If you decide, before performing this regex S/R, to move the cursor, on purpose, inside the
<.........>
area of a<p>
tag, with a class different fromoyric
, it will, also, match all the additional blank characters of that<.........>
zone. Can’t do anything about this ! -
Luckily, once the caret is located after that first zone
<.........>
, the behavior of the regex is, again, as expected :-))
Cheers,
guy038
PS :
I say, above, regarding the ER ( Excluded Regex ), that it, implicitly, defines a zone, where the Search Regex may occur
Just an example to explain this notion. Let’s consider these
3
simple regexes :-
[^<>]+(?=\h)
, which searches the greatest range of chars, different of<
and>
, if followed with a blank char -
[^<]+(?=\h)
, which searches the greatest range of chars, different of<
, if followed with a blank char -
[^>]+(?=\h)
, which searches the greatest range of chars, different of>
, if followed with a blank char
Here is, below, the results, with any range of chars, underlined with
-
and the blank char, underlined with^
REGEX [^<>]+(?=\h) <p class="oyric">This is a test </p> <p class="Tag_3">bla blahh </p> <p class="oyric"> a test</p> - ------------------------^ --^ -^ -------------^ -^ -^ ------^ REGEX [^<]+(?=\h) <p class="oyric">This is a test </p> <p class="Tag_3">bla blahh </p> <p class="oyric"> a test</p> ----------------------------------------^ -----^ -----------------------------^ ----^ ----------------------^ REGEX [^>]+(?=\h) <p class="oyric">This is a test </p> <p class="Tag_3">bla blahh </p> <p class="oyric"> a test</p> --^ ------------------------^ -----^ -------------^ ----^ ------^
As we use the greedy quantifier
+
, it easy to visualize the complete zones where it is allowed to look for a blank character, underlined with^
, in each case ;-)) -
-
well done, thank you very much
-
@guy038 said:
(?1$0)(?2\x20)
also, it can be replace with:
(?{2}$1 )
such as:SEARCH:
(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))
REPLACE BY:
(?{2}$1 )
-
@guy038 I just review this post, because I like it and remembered the same thing from the post today.
SEARCH
(?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))
REPLACE
(?1$0)(?2\x20)
What if, in case I have another tag, like
<em>
So, @Robin Cruise scenario become:
<p class="oyric"> Laurie Strode comes to her final confrontation <em> with Michael Myers, the masked figure who has haunted her </em>since she narrowly escaped. </p>
So, In this case, your regex does not remove empty spaces because of those
<em>
-
This post is deleted!