Regex Help to replace text except when it matches a string
-
I’m a beginner in using regex. I have a set of xml paras and I want to replace the paragraph and line breaks between paras with a single space unless the new para starts with an all caps name followed by a colon. The name could be two or more names all in caps. There could be dozens of instances of this in a given document.
So for example, if I start with:
<para>CARLYNN MAGLIANO SWEENEY: Thank you, Amy. And thank you everyone for joining us. As Amy mentioned, today we’re going to be talking about navigating your parental leave. My name also, Amy mentioned, is Carlynn Magliano Sweeney. And I’m the managing director of Preferred Transition Resources.</para>
<para>PTR is a career coaching, counseling, and outplacement company based in New York City. And we work mainly in the legal sector. Karen, why don’t you tell everybody a little bit about yourself?</para>
<para>KAREN RUBIN: Thank you. So I’m an executive coach. And I’ve been doing this for the past decade after doing a career pivot. I had a different career. And I actually off ramped in my career.</para>
<para>And one of the reasons I love this type of work is this is exactly the kind of help that I wish I had had when I was having my kids. So I see that this is obviously a joyful time in your life but it can also be a nerve wracking time.</para>
I want to end with:
<para>CARLYNN MAGLIANO SWEENEY: Thank you, Amy. And thank you everyone for joining us. As Amy mentioned, today we’re going to be talking about navigating your parental leave. My name also, Amy mentioned, is Carlynn Magliano Sweeney. And I’m the managing director of Preferred Transition Resources. PTR is a career coaching, counseling, and outplacement company based in New York City. And we work mainly in the legal sector. Karen, why don’t you tell everybody a little bit about yourself?</para><para>KAREN RUBIN: Thank you. So I’m an executive coach. And I’ve been doing this for the past decade after doing a career pivot. I had a different career. And I actually off ramped in my career. And one of the reasons I love this type of work is this is exactly the kind of help that I wish I had had when I was having my kids. So I see that this is obviously a joyful time in your life but it can also be a nerve wracking time.</para>
I could find a way to find the para breaks “</para>[\r\n]+<para>” but I don’t know how to restrict the search and replace to only those that are not followed by an all caps name and a colon.
I would appreciate any help in solving this.
Thanks,
Alfred -
I don’t know if it will get you ALL the way there, but this seems to work for your sample data, and could be used by you as a starting point for more complex cases:
Find what box:
(?-is)</para>\R+<para>([A-Z ]+?[a-z].*?)</para>
Replace with box:\1</para>
<—note the leading space
Wrap around checkbox: ticked
Search mode radiobutton: Regular expression
Press the Replace All button -
@Alan-Kilborn said in Regex Help to replace text except when it matches a string:
\1</para>
Thank you so much. That gets me a lot closer. It fixes a lot of cases except when there are several paras in a row that need the paras removed. If I try to run the expression again it continues to ignore them.
Here’s a later section of text where the paras are still in place, I don’t understand the syntax of regular expressions well enough to know where I would need to modify the expression and/or apply a second expression to account for these. The text below should just be two paras, with the second being the last line of the sample:
<para>KAREN RUBIN: Yeah. Absolutely. And just honestly, even just having this playbook calms a lot of anxiety and stress that people feel like, oh, OK, I sort of know what I should be doing. So for what it’s worth, just having this put together, this structure can be really helpful for a lot of people.But the first thing is when you have a better sense of the cases and projects that you’re going to be working on, put together a list of everything that’s going to need to be covered. And also think about your recommendations for staffing. So you’re closest to it. And you’re best positioned to know who those people might be.</para>
<para>So you want to be thinking about who’s got the bandwidth. Who has an affinity for this type of work, or maybe who would welcome the opportunity to stretch or get new ability? So like parental leave can be a great development opportunity for a more junior associate.And you also want to prioritize what needs to happen while you’re gone. Is there anything that could be put on hold? So once you’ve got this, and depending on if you have one partner or multiple partners, but once you get that agreement and buy in, they are the ones who should be presenting this coverage plan to your colleagues.</para>
<para>And the reason that’s important is if you are telling someone like, here, can you cover this in my absence, then it almost feels like you’re asking for a personal favor and you’re not. So you want to get this plan blessed from above so the more senior people can provide the air cover. And your role is to manage the training.</para>
<para>CARLYNN MAGLIANO SWEENEY: [INAUDIBLE].</para>
-
Actually, I just figures out the missing item. After I run your expression, I just put the same expression in again and just add a space before the “\R” and it takes care of all the remaining lines. I cannot thank you enough for the assistance.
-
Hello, @alfred-streich, @alan-kilborn and All,
As, seemingly, all lines, which must be joined to their previous line, do not contain any colon symbol (
:
), an alternate syntax, to the Alan’s one, could be :SEARCH
(?-i)</para>\h*\R+\h*<para>(?![^\r\n]+:)
REPLACE
\x20
Notes :
-
First, the in-line modifier
(?-i)
forces a non-insensitive search process -
The part
\h*\R+\h*
matches any range of horizontal blank char(s) (Space
andTab
), even null, followed with a non null range of line-breaks (\r\n
,\n
or\r
), itself followed with an other possible range of blank char(s) -
The part
(?![^\r\n]+:)
is a negative look-ahead structure, which defines a necessary condition for the overall regex to match, although not part of the final match, and looks for a line with does not contain any colon character, after the literal string <para> till its line-break -
Note that the
[^\r\n]+
defines a non-null of characters, different of EOL chars. So, any char after <para> till the colon symbol : ! -
In replacement, the syntax
\x20
is the hexadecimal representation of a space character and you may, as well, write a single space char in theReplace with:
zone
Best regards,
guy038
-