ReGex help removing data

Dev Petty

Hidetoshi Furuzawa
Born: 1912, California
Gardener
Topaz, Tule Lake Apr 1942–19 Feb 1946
Released to Sebastopol, California

Here is how I would like that data to look (“after” data):

Hidetoshi Furuzawa

*each entry begins with Born: and ends with California. I need everything between, other than the names, removed

Terry R

@Dev-Petty

A question about the before data. I see California is shown twice, and in both cases at the end of a line. Having that and other possible lines in between also containing the same word makes that an issue.

So is it always the “Born” line and exactly 3 lines after?
Or is there a blank line between “records”

If you could show multiple before “records” in their entirety it might help in the solution.

Terry

PeterJones

@Dev-Petty ,

each entry begins with Born: and ends with California

FIND = (?s)Born:.*?California
REPLACE = <leave empty>
SEARCH MODE = Regular Expression
REPLACE ALL

If there are too many newlines after the replacement, you could add a \R before teh Born: or after the California in the regex.

----

Useful References

guy038

Hello, @dev-petty and All,

So, I suppose that this regex should work nicely :

SEARCH (?s-i)^Born:.+?California\R
REPLACE Leave EMPTY
Check the Regular expression search mode

Best Regards,

guy038

PeterJones

@Dev-Petty ,

When @Terry-R wrote “I see California is shown twice,” I realized I hadn’t noticed that. My answer (and, I think, @guy038’s) would stop at that first California, rather than going to the end.

Like @Terry-R , I think it would be helpful if you showed a few more examples in the same set of data, so that we could see variations in things like the number of lines, or whether the replacement can ever stop on the same line that has Born: or whether it always has to end on a subsequent line to Born:.

Terry R

@PeterJones said in ReGex help removing data:

FIND = (?s)Born:.*?California

and @guy038 , I think both of your regexes, only pick the first line (Born). That was the reason for my questions.

Terry

guy038

Hi, @dev-petty, @peterjones, @terry-r and All,

Yes I was too rapid, directly answering, without testing in N++. My bad !

So one correct syntax could be :

SEARCH (?s-i)(?-s:^Born:.+).+?California\R
REPLACE Leave EMPTY
Check the Regular expression search mode

What means this regex, except for the literal strings Born: and California ?

The first part (?s-i) are initial modifiers which apply to the whole regex :
- The (?s) syntax means that any . regex char, found in the regex, may represent any single character, including the line-break \r and/or \n.
- The (?-i) syntax means that the search is done in an sensitive way ( so not insensitive ! ). Thus it will find the words Born and California but not the words born and california or BORN and CALIFORNIA. If an insensitive search is needed just use the (?si) syntax.
The second part is (?-s:^Born:.+) which is a non-capturing group ( a group whose we do not need the contents, further on, in search and/or replacement !) (?:.........) with the -s modifier which applies to this group only. Thus, this part looks for the word Born, with that exact case, at the beginning of line ^, followed with a colon, itself followed with any standard character ., repeated +, till the very end of current line as it stops at the line-breaks.
The third part is .+? which represents the smallest ? range of any character ., including \r and \r, repeated +, until …
The fourth part California\R which represents the word California, with this exact case, followed by \R which stands for any kind of line-break ( \r\n for Windows files, \n for Unix files or \r for Mac files ).
In replacement, as its zone is empty, the entire 4 lines matched are simply deleted !

BR

guy038