This might need a lookbehind which is beyond me, can someone provide the solution?
-
This is a sample of the line I need to edit:
Trial by Error (1997) by Mark Garland [only as by Mark A. Garland]The result should be:
Trial by Error (1997) by Mark A. GarlandFor most of the lines like this my expression:
find:( by)(.+)( [only as)(.+)(])
replace:\4has worked, however on the sample above it finds the first “by”, before the (1997) and incorrectly replaces part of the book title. The other issue is I cannot guarantee the (1997) will appear, so I’m thinking I need to search for “only as” first, then back up to find the preceding “by”, then replace using my expression.
Only issue is I cannot get my head around the “lookaround” functions. I’d welcome a solution, and maybe detail of how it works.
-
If everything you want to delete is in “[…]” at the end of a line, you could do this, I think. Search for:
\s*\[[^[]+$
Replace with nothing at all.
I believe this will find any whitespace that may precede the last “[” on a line and everything that follows it.
-
Notice: Terry doesn’t want to just delete what’s in the brackets… Terry wants to replace the unbracketed name with the name from the brackets, and get rid of the bracketed portion.
I had originally tried changing the first
(.+)
to a non-greedy(.+?)
, but it still matched the firstby
rather than the secondby
(probably because it “non-greedily” found the first match. I’ve been racking my brains to come up with “match a series of characters as long as it doesn’t contain ’ by '”… I’ve had some ideas, but none have panned out yet.For @Terry-R’s information, this FAQ contains links to some good regex info (in case Terry wants to do more research while waiting for us to answer.)
-
Oh, there, a couple more minutes of googling found this SO answer, which helped me get it to:
- Find What :
( by)((?! by ).)*?( \[only as)(.*)(\])
So the
(?! by )
negative lookahead says you cannot match that string
Putting that before the dot says that the “matching anycharacter” cannot be preceded by the " by "
Wrapping that in the parens followed by *? means non-greedily match all of those.What I don’t like about that is that
\2
then only contains the last character of the original name, rather than the full original name.Some more experimenting:
- Find What :
( by)((?:(?! by ).)*?)( \[only as)(.*)(\])
Made the original group#2 a non-capturing match by the
?:
, then wrapped a new capture group around it, to make the whole original name\2
again.Personally, I’m curious whether @Terry-R really wanted to capture all those others, since only \4 is used in the replacement shown.
If not, something more like
- FInd what :
by (?:(?! by ).)*? \[only as( by .*)\]
- Replace with :
\1
simplifies it.
With the data
Trial by Error (1997) by Mark Garland [only as by Mark A. Garland] Normal Song by Normal Artist Dated Song (1976) by Disco Artist Historical Song (1776) by Ancient Artist [only as by Anonymous] North by South by A Stream by No One [only as by Some Pseudonym]
That final expression gives me
Trial by Error (1997) by Mark A. Garland Normal Song by Normal Artist Dated Song (1976) by Disco Artist Historical Song (1776) by Anonymous North by South by A Stream by Some Pseudonym
- Find What :
-
The solution
FInd what : by (?:(?! by ).)? [only as( by .)]
Replace with : \1was very close. I just needed to add \s at the start of the find expression as it otherwise left a double space before the “by”. Now I’ve just got to understand how it works.
If anyone was interested in what I’m trying to achieve it’s the parsing of novel series from isfdg.org. As an example, the sample text I provided came from:
http://www.isfdb.org/cgi-bin/pe.cgi?314This is the Star Trek Universe novels. So I grabbed the text in the main window. I then use a series of regex’s to parse it into a CSV format for importing into Calibre.
This step was #6 in a sequence of 18 which “massage” the data into a format that can be imported. I do things such as removing all non-fiction (essay, interviews, etc), remove line numbers where they appear (that step still has to be run 1 replace at a time so I don’t remove titles with a number at the start). I also find variant titles and include those inside brackets on the first line. With translations, sometimes the primary novel is a foreign one, so I need to confirm the “English” title and replace, otherwise translations disappear. I also add in later steps the series name, add tags for Calibre to use.
I think I was doing pretty well given I’d just taken up Notepad++ recently (learning regex in the last month), but this one had me beat.
So very appreciative of the work you guys put in. The links provided I had mostly researched beforehand. This was my last stop as I’d exhausted all other options. I think I just don’t quite get the lookarounds (YET!). I knew enough to figure this was how it might be overcome, but not enough examples out there that got the final step.
So again, many thanks. Next I’ll try to put all/most my steps together in a macro. My first attempt there was a miserable failure, almost nothing left but a mangled wreckage of bits and bytes which needed scooping up and putting in the bin!
-
Sorry, yes, I had the extra-space problem, but had solved it with moving the space out of the rightmost parentheses:
by (?:(?! by ).)*? \[only as (by .*)\]
– but apparently the one in my post was still in my copy-buffer when I pasted it.(If you choose your own, with the
\s
, I would recommend switching to\h
, which only matches horizontal spaces (ie, not newline characters), or\x20
, which only matches the space character (ASCII 32).) -
Thanks for that update. i did wonder if the space before the “by” in last bracket area might have been a better choice. So I will choose your final (amended) version over mine.
-
Hi, @terry-r, @jim-dailey, @peterjones and All,
Why not this regex S/R, which does not need any look-around :
SEARCH
(?-s)^(.+ by ).+\[only as by (.+)\]
REPLACE
\1\2
Notes :
-
As you can see, this regex matches, first, everything from beginning of line till the last string
by
+ a space character, stored as group1
, which must be rewritten -
Then followed, further on, with the string [only as by + a space character.
-
And, finally, with the name, stored as group
2
, which must be rewritten, too + the]
symbol
And other form could be :
SEARCH
(?-s)^.+ by \K.+\[only as by (.+)\]
REPLACE
\1
But, with that second syntax, you must use, exlusively, the
Replace All
button !
Now, Peter, when you say :
So the (?! by ) negative lookahead says you cannot match that string
Putting that before the dot says that the “matching anycharacter” cannot be preceded by the " by "It would be best to say :
Putting that before the dot says that the “matching any-char” cannot be the leading space of the string
" by "
, ( without the quotes )
Just keep in mind that, at each step of the regex search, the regex engine is, let’s say, “between two” characters. So when it finds :
-
A positive or negative look-ahead,
(?=.....)
or(?!....)
, it examines if the condition is true or false,right after
-
A positive or negative look-behind,
(?<=.....)
or(?<!....)
, it examines if the condition is true or false,right before
Some examples :
-
A) If we consider the regex
(?-s)((?=456).)+
, Each time, in order to take in account the next standard character (.
), the regex engine asks, itself : “Is the string456
, right after ?”. Against the string 123456789, it just matches the string4
. Logical, as the location, between the digits3
and4
, it’s the unique location, where the string456
can be seen and, so, it just matches the digit4
( note that you could replace that regex by the simple regex4(?=56)
! ) -
B) Now, if we consider the regex
(?-s)(.(?=456))+
against the string 123456789, this time it matches the3
digit, as it’s the unique location where, right after it can see the456
string -
C) If we choose, this time, the negative look-ahead
(?-s)((?!456).)+
, against the same string, it matches the string 123, then the 56789 string. Indeed, only the4
digit must be avoided because, when the engine location is between the3
and4
digits, the forbidden 456 string is, right after, present ! -
D) And with the negative look-ahead
(?-s)(.(?!456))+
, it matches the 12 then the 456789 strings. Indeed, the2
digit is not followed with the456
string, as well as the4
digit !
Note that :
-
In cases A) and C), the condition is evaluated before examination of the present standard character, (
.
) -
In cases B) and D), the condition is evaluated after examination of the present standard character, (
.
)
It’s very important, also, to remember that the current location, of the regex engine, is not incremented when evaluating look-around conditions !
For instance, with the regex
^123(?=.*#)456
:-
First, the regex engine matches the string 123 from beginning of line
-
Then, the regex engine, located between digits
3
and4
, examines the positive look-ahead but, whatever the result true or false, its position does NOT change -
So, it can match the next string 456
-
Finally, if a
#
symbol is found somewhere, after the digit6
, the look-ahead condition has been evaluated as true and an overall match occurs !
Best Regards,
guy038
-
-
Hi @guy038, thanks for your input. I have tested your expression (first version) and it works for me. Even better, I can easily understand it, as you say it doesn’t involve a lookaround. Therefore I change my allegiance to you, sorry @peterjones.
What I’m not absolutely sure on is how it does work, getting the 2nd last “by” every time.
My original expression had similar motives but even with the greedy symbol “+” it wouldn’t grab that far into the line. I suppose the rest of the expression has some bearing on how the line is processed, forcing it back a small amount to guarantee the 2nd last “by”.I guess I still have lots to learn about the nuances of the regex ways.
Thanks
Terry -
Hi, @terry-r and All,
A very simplified version of my regex
(?-s)^(.+ by ).+\[only as by (.+)\]
, could be :^.+ by .+[only as by .+]
( BEWARE : This syntax is not functional, it’s just for explanations !)As the syntax
.+*
stands for the longest range of standard characters which can match the remaining part of the regex :Then, this regex searches, from beginning of line, for :
-
Any non-null range of standard characters, followed with the string by surrounded by two space characters
-
Then, followed with a second non-null range of standard characters, followed with the string [only as by, followed with a space
-
And, finally, with a third non-null range of standard characters, followed with the
]
symbol, ending the line
Don’t be afraid to ask the community if you feel some difficulties, when elaborating some advanced regexes !
On the other hand, you’ll get some regex documentation here
-