how to remove empty spaces from a particular tag (regular expression)
-
good day, I made this regular expression to remove empty spaces from a particular tag that has many empty spaces and tabs:
(?-s)(\G(?!^)|<p\s+class="oyric">)((?!</p>).)*?\K\s\s+but it does not work well<p class="oyric"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. </p>Output should be:
<p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p> -
Hi, @Robin-cruise and All,
You were not very far from the right solution ! The way to replace something :
-
In a particular tag section, as
<p>........</p> -
In a particular tag section, with a particular class name, as
<p class="test">Bla bla blah</p>
has already been discussed in these posts :
So, the general regex solution, when you want to perform a Search/Replacement, ONLY in an area, which is located between two particular boundaries, is :
SEARCH
(?-s)(\G|BR)((?!ER).)*?\KSR OR(?s)(\G|BR)((?!ER).)*?\KSRREPLACE RR
where :
-
BR ( Begining Regex ) is the regex which defines the start of the defined zone, for the search/replacement
-
ER ( Ending Regex ) is the regex which defines the end of the defined zone, for the search/replacement
-
SR ( Search Regex ) is the regex which defines the expression to search, in any defined zone
-
RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex, in any defined zone
In your case :
BR =
<p class="oyric">ER =
</p>SR =
((?<=>)\h+|\h+(?=<|\h))RR =
NothingNotes :
-
SR is a search of any of the two alternatives, separated with the
|symbol, and surrounded by parentheses, because of the lowest priority of the alternation symbol|-
(?<=>)\h+which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if preceded with the>symbol -
\h+(?=<|\h))which tries to match any non-null range of horizontal blank chars ( spaces or tabulations ), if followed by, either, the<symbol or a final horizontal blank character
-
-
As all these blank characters matched have to be deleted, the replacement zone is just empty
-
First, the regex tries to find the string
<p class="oyric">, followed by the shortest range, even null, of characters,.*?, till the search regex, explained above, with the condition that the string</p>must not located at any position of this range -
Due to the
\Ksyntax, the regex engine resets its working location and forgets any previous match. So the final match is ,simply, the part described above((?<=>)\h+|\h+(?=<|\h))( SR ) -
After this first match, it can only match the zero-length assertion
\G, followed, again, with a possible other shortest range, even null … … … as just above ! -
When the regex engine skips the ending boundary
</p>, the\Gcannot be verified anymore and the only way to match something else is to grab, again, a<p class="oyric">string, further on ! -
If you are only interested in single-line ranges
BR.........ER, use the(?-s)modifier, at beginning of the search regex -
If you may have some multi-lines ranges
BR.........ER, use the(?s)modifier, at beginning of the search regex
So, Robin, let’s imagine the sample text, below :
<p class="oyric"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. </p> <p class="Tag_1"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. </p> <p class="oyric"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p> <p class="Tag_2">bla blah blah </p> <p class="oyric"> This is a test </p> <p class="Tag_3">bla blah blah </p> <p class="oyric"> The final test </p> <p class="oyric"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. </p> <p class="Tag_2">bla blah ....blah </p> <p class="oyric"> This is an other test to verify if the regex is correct </p>Using the following regex S/R :
SEARCH
(?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))REPLACE
Leave EMPTYYou should get the expected text, below :
<p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p> <p class="Tag_1"> Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped. </p> <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p> <p class="Tag_2">bla blah blah </p> <p class="oyric">This is a test</p> <p class="Tag_3">bla blah blah </p> <p class="oyric">The final test</p> <p class="oyric">Laurie Strode comes to her final confrontation with Michael Myers, the masked figure who has haunted her since she narrowly escaped.</p> <p class="Tag_2">bla blah ....blah </p> <p class="oyric">This is an other test to verify if the regex is correct</p>Notes :
-
It’s easy to verify that blank characters have been removed, ONLY in all areas
<p class="oyric">..........</p>, whatever they were single-line areas or a multi-lines blocks -
However, note that if a line ends with blank chars and the next line begins, also, with blank characters, this regex S/R will keep one blank char, either, at the end of this line and at the beginning of the next line !
-
Finally, beware that, if words are separated with a mix of space and tabulation characters, only the final blank character will be kept !
For instance from text :
<p class="oyric"> Test #1 </p> ( Last BLANK char = SPACE ,before #1 ) <p class="oyric"> Test #2 </p> ( Last BLANK char = TABULATION, before #2 )You’ll obtain :
<p class="oyric">Test #1</p> ( SPACE char between Test and #1 ) <p class="oyric">Test #2</p> ( TABULATION char between Test and #2 )Best Regards,
guy038
-
-
GREAT ! thank you very much ;)
-
SEARCH
(?s)(\G|<p class="oyric">)((?!</p>).)*?\K((?<=>)\h+|\h+(?=<|\h))REPLACE
Leave EMPTY
by the way, there is a little problem in your regex, guy038. Now I discover that.
Seems that your regex selects all spaces outside the specified tag, and disturb
all my other lines.See a print screen:
or check regex on this code on notepad: https://regex101.com/r/dDBYSk/1
See what happen after “Replace all”. You will see that all tags with spaces before and after are modify, not only that particular tag I want. (<p class=“oyric”>)
-
i’m thinking about swiss file knife plugins to build.
http://stahlworks.com/dev/swiss-file-knife.html -
Hello, @Robin-cruise and All,
Ah Yes ! My regex wasn’t enough accurate ! And worse, my formulation of the general regex solution was not exact too :-((
So, the general regex solution, when you want to perform a Search/Replacement in a specific area, only, is :
SEARCH
(?-s)(\G|BR)((?!ER).)*?\KSR OR(?s)(\G|BR)((?!ER).)*?\KSRREPLACE RR
where :
-
BR ( Begining Regex ) is the regex which defines, either, the start of that specific area and the start for a possible Search Regex match
-
ER ( Excluded Regex ) is the regex which defines the characters and/or strings
forbidden, from the Begining Regex position until a next Search Regex match. It, implicitly, defines a zone, where the Search Regex may occur -
SR ( Search Regex ) is the regex which defines the expression to search, if , both, the Begining Regex and the Excluded Regex are TRUE
-
RR ( Replace Regex ) is the regex which defines the expression replacing the Search Regex
In your case, we must look for unnecessary blank characters, in a
<..........>area, without any<nor>inside. Hence, the excluded chars are , simply, the two symbols<and>Now, inside that area, possibly multi-lines, we’ll look for either:
-
Case A : All blank characters,at beginning of lines, in case of a correct area, split in several lines
-
Case B : All blank characters,at end of lines, in case of a correct area, split in several lines
-
Case C : All blank characters, right after the
<symbol -
Case D : All blank characters, right before the
</p>ending tag -
Case E : All ranges of, at least, two blank characters, not followed from, either, a blank char or a
<symbol
Theses
5cases correspond to the different alternatives of the SR search regex, *separated with the|symbolSo, we have :
BR =
<p class="oyric">ER =
<|>SR =
(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))RR =
(?1$0)(?2\x20)Remarks :
-
The assertion
\Gis considered as true ( current position of caret), during the first run of the regex. So, if you move the cursor inside the leading spaces of a line, in purpose, before running the reges S/R, it could wrongly match the remaining leading spaces, located after the cursor position -
So, in order to avoid theses matches, I added, for case E, the restrictive condition
(\h{2,})(?=[^<\h]), which must be true, after the blank matched range ! -
Regarding the replacement :
-
If case A occurs, we must keep the leading spaces, stored in group
1So , we rewrite the entire match(?1$0) -
If cases B, C or D occurs, we need to delete all these blank chars =>
Nothingis rewritten -
If case E occurs, we just replace all the blank chars matched, stored in group
2with a single space character =>(?2\x20)
-
Finally, we get this new regex S/R :
SEARCH
(?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))REPLACE
(?1$0)(?2\x20)which should avoid the side-effects of my first attempt ;-))
Beware
-
If you decide, before performing this regex S/R, to move the cursor, on purpose, inside the
<.........>area of a<p>tag, with a class different fromoyric, it will, also, match all the additional blank characters of that<.........>zone. Can’t do anything about this ! -
Luckily, once the caret is located after that first zone
<.........>, the behavior of the regex is, again, as expected :-))
Cheers,
guy038
PS :
I say, above, regarding the ER ( Excluded Regex ), that it, implicitly, defines a zone, where the Search Regex may occur
Just an example to explain this notion. Let’s consider these
3simple regexes :-
[^<>]+(?=\h), which searches the greatest range of chars, different of<and>, if followed with a blank char -
[^<]+(?=\h), which searches the greatest range of chars, different of<, if followed with a blank char -
[^>]+(?=\h), which searches the greatest range of chars, different of>, if followed with a blank char
Here is, below, the results, with any range of chars, underlined with
-and the blank char, underlined with^REGEX [^<>]+(?=\h) <p class="oyric">This is a test </p> <p class="Tag_3">bla blahh </p> <p class="oyric"> a test</p> - ------------------------^ --^ -^ -------------^ -^ -^ ------^ REGEX [^<]+(?=\h) <p class="oyric">This is a test </p> <p class="Tag_3">bla blahh </p> <p class="oyric"> a test</p> ----------------------------------------^ -----^ -----------------------------^ ----^ ----------------------^ REGEX [^>]+(?=\h) <p class="oyric">This is a test </p> <p class="Tag_3">bla blahh </p> <p class="oyric"> a test</p> --^ ------------------------^ -----^ -------------^ ----^ ------^As we use the greedy quantifier
+, it easy to visualize the complete zones where it is allowed to look for a blank character, underlined with^, in each case ;-)) -
-
well done, thank you very much
-
@guy038 said:
(?1$0)(?2\x20)
also, it can be replace with:
(?{2}$1 )such as:SEARCH:
(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))REPLACE BY:
(?{2}$1 ) -
@guy038 I just review this post, because I like it and remembered the same thing from the post today.
SEARCH
(?s)(?:\G|<p class="oyric">)(?:(?!<|>).)*?\K(?:(^\h+)|\h+$|(?<=>)\h+|\h+(?=</p>)|(\h{2,})(?=[^<\h]))REPLACE
(?1$0)(?2\x20)What if, in case I have another tag, like
<em>So, @Robin Cruise scenario become:
<p class="oyric"> Laurie Strode comes to her final confrontation <em> with Michael Myers, the masked figure who has haunted her </em>since she narrowly escaped. </p>So, In this case, your regex does not remove empty spaces because of those
<em> -
This post is deleted!
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login