Regex: How to remove enewline character from a particular html tags?

Robin Cruise · Feb 27, 2022, 8:48 PM

I have this html tag, which is interrupted by /n at some point after word masuri:

<p class="mb-40px">Aceasta este o melodie alcatuita din patru masuri:
reluata apoi de catre instrumentul solist cu un cintec popular.</p>

THE OUTPUT must be:

<p class="mb-40px">Aceasta este o melodie alcatuita din patru masuri: reluata apoi de catre instrumentul solist cu un cintec popular.</p>

I try this regex, but doesn’t work too good, because is also change the entire html code, not just that particular tag.

FIND: (?:|\G)(?:(?!).)*?\K(\r\n|\r|\n)

REPLACE BY: \x20

Also, I find a solution of @neil-schipper from a page on this forum, but I don’t know how to integrate with my html tag :

FIND: (?<=[^\r\n])\R(?=[^\r\n])
REPLACE BY: (LEAVE EMPTY)

Alan Kilborn · Feb 27, 2022, 8:53 PM

@robin-cruise

This is just a (by now) simple replace-but-only-between-delimiters problem; see HERE for the templatized solution.

Robin Cruise · Feb 28, 2022, 7:12 AM

@alan-kilborn THANKS, it works !!

Find: (?-i:|(?!\A)\G)(?s:(?!).)*?\K(?-i:(?<=[^\r\n])\R(?=[^\r\n]))

Replace by: \x20

Robin Cruise · Feb 28, 2022, 7:12 AM

Another solution: (\r\n|\r|\n)

FIND: ()+(.)+\K(\r\n|\r|\n)(?=.*<\/p>)

REPLACE BY: \x20

The below GENERIC regex formula can be much simple made then @guy038 made in many other of his GENERIC regex formulas:

(REGION-START)+(.)+\K(FIND REGEX)(?=.*REGION-FINAL)

Alan Kilborn · Mar 10, 2022, 1:25 PM

@robin-cruise said in Regex: How to remove enewline character from a particular html tags?:

The below GENERIC regex formula can be much simple made then @guy038 made

Why should you be believed over @guy038 ?

Hellena Crainicu · Mar 9, 2022, 7:10 AM

@alan-kilborn @guy038

another alternative of Robin’s generic, a better version, can be:

(REGION-START)+(.)+\K(FIND REGEX)(?s:(?=.*(REGION-FINAL)))

guy038 · Mar 10, 2022, 1:56 PM

Hello, @rovbin-cruise, @alan-kilborn, @hellena-crainicu and All,

Refering to my first blog post about a generic regex, below :

https://community.notepad-plus-plus.org/post/75007

and as Robin want to search for line-ending chars, we need to use, of course the complete generic regex S/R :

SEARCH (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)

REPLACE RR

and not the simplified single-line version

So :

The FR regex is just \R, as the non-capturing associated group, beginning with (?_si:..., is useless in this case
The RR regex is \x20
The BSR regex may be strictly the string  but may also be expressed as 
The ESR regex is, of course, the ending tag , which must never occurs before the next line-ending to replace

giving the functional regex S/R :

SEARCH (?-si:|(?!\A)\G)(?s-i:(?!).)*?\K\R

REPLACE \x20

Test it against that text :

<a href="https://www.w3schools.com/">We strongly suggest
to visit the
w3schools.com
site</a>

<p class="mb-40px">Aceasta
este o melodie alcatuita
din patru masuri:
reluata apoi de catre instrumentul solist
cu un cintec popular.</p>

<p class="Test">A SINGLE line</p>

<h1>this is
my very
first heading
</h1>

<p class="123-456 789">This is	
a quick
text to
verify if it
replaces line-endings
by a space char in <p>
tags ONLY</p>

ONLY the , multi-lines or not, should be concerned by the replacement !

Of course, these HTML commands do not represent a legal HTML file and are just used to verify the regex S/R !

Now, the generic variants, proposed by @Robin-cruise and @hellena-crainicu, with a final look-ahead only, containing the ESR region, will not work, most of a time :-(

SEARCH (?-si:BSR|(?!\A)\G).*?\K(?-si:FR)(?=(?s-i:.*?ESR))

In our case, the functional regex S/R becomes :

SEARCH (?-si:|(?!\A)\G).*?\K\R(?=(?s-i:.*?))

REPLACE \x20

But if you test it against, for instance :


<p class="Test">Several
consecutive
lines</p>

<h1>this is
my very
first heading
</h1>

<p class="Test">A SINGLE line</p>

<h2>this is
my second
heading
</h2>

It would concatenate all text till the last  of the file, just leaving the last <h2> tag untouched. You could say : But I did add a final question mark in order to get a lazy range of chars before  !

You’re right ! But remember that the regex engine tries, by all means, to get a solution. So, it matches the CRLF chars, which follow lines, because the regex engine considers that the .*? lazy range of chars begins immediately after the line-ending and continues till right before the third and final , so defining a correct look-ahead assertion !

Thus, testing if the ESR region is not reached at any position, till a NEXT FR match, seems the only method which works properly !

Best Regards

guy038

Reminder : Move to the very beginning of text before clicking on the Find Next or Replace All button !

Alan Kilborn · Mar 10, 2022, 1:25 PM

@alan-kilborn said in Regex: How to remove enewline character from a particular html tags?:

Why should you be believed over @guy038 ?

@guy038 said in Regex: How to remove enewline character from a particular html tags?:

Now, the generic variants, proposed by @Robin-cruise and @hellena-crainicu, with a final look-ahead only, containing the ESR region will not work, most of a time :-(

@Robin-cruise and @hellena-crainicu :

Be careful of posting simplifications.

Probably best to leave these things to the “Master”. :-)

Hellena Crainicu · Aug 12, 2024, 9:46 AM

The best solution is this:

(?-si:|(?!\A)\G)(?s-i:(?!).)*?\K\s+

General regex: (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\KFR