Regex - Positive Look Behind With *
-
Hello,
I am trying to do a find and replace on a line feed having the format of data type and variable name concatenated together, followed by a comma followed by a CR and LF.
Example lines would look like this.
intCustomerKey,
strCustomerName,
strAddress,
blnActive,
I want to insert data between the variable name and the comma using the data type and the comma without knowing what the variable name is. The result should look like this after replacing , with ‘’, for str data types.
intCustomerKey,
strCustomerName ‘’,
strAddress ‘’,
blnActive,
I constructed a regex using positive look behind which I hoped would match the comma on str data types in the example but I get “Find: Invalid regular expression” error on the following:
(?<=str.*),
Testing with different syntax leads me to believe that positive look behind doesn’t support certain quantifiers.
For example, the following matches the 2nd comma.
(?<=str…),
(?<=str.{12}),
However, I get invalid regular expression error with the following.
(?<=str.{1,50}),
(?<=str.+),Is there a different expression I can use to accomplish the matches?
I am on Notepad++ v7.8.4Thank you,
Ray -
How about trying this?:
Find what box:
^str(.*?),
Replace with box:str\1",
Match case checkbox: ticked
Wrap around checkbox: ticked
Search mode radiobutton: Regular expression
Press the Replace All button -
@Alan-Kilborn’s suggestion will hopefully work for you.
But if you’re wondering why
I get “Find: Invalid regular expression” error on the following:
(?<=str.*)
(?<=str.{1,50})
(?<=str.+)
Notepad++ uses the Boost regular expression library; the Boost docs say (emphasis added):
(?<=pattern)
consumes zero characters, only if pattern could be matched against the characters preceding the current position (pattern must be of fixed length).Thus, the lookbehind is documented to only work with fixed length, and
.*
,.{1,50}
, and.+
are all variable-length expressions (zero or more, 1 to 50, or one or more all have a range of lengths, not a fixed length). -
@Alan-Kilborn and @PeterJones,
Thank you both for your prompt reply. The suggested method works perfectly, and I appreciate the explanation.
Best,
Ray -
Plus, lookbehind is problematic with Notepad++ currently, in certain situations (example: at the beginning of a regex).
-
Hello, @ray-h, @alan-kilborn, @peterjones and All,
First, here is general information about positive/negative look-behinds and look-aheads, with the present N++ Boost regex engine :
- Firstly, a look-behind must necessarily match a fixed-length string, so any quantifier syntax, like
*
,+
,?
,{n,}
and{m,n}
, are forbidden inside a look-behind. The.{n}
syntax is only allowed !
=> For instance, the syntaxes
(?<=.{3})XYZ
or(?<=...)XYZ
or(?<=.ABC.)XYZ
are allowed=> But the syntaxes
(?<=.{3,})XYZ
or(?<=.{3,5})XYZ
or(?<=^A.+M)XYZ
are forbiddenTo overcome this limitation of look-behinds, you may use the special
\K
syntax, ( Keep text out of the match ) which resets the regex engine position and keeps the text matched so far, out of the overall regex match !=> For instance, the wrong syntax
(?-is)(?<=^A.+M)XYZ
could be replaced with the correct form(?-is)^A.+M\KXYZ
However, note that if you’re using such
\K
syntax, in a search/replacement, it’s worth pointing out that you’ll have to click on theReplace All
button, exclusively !
- Secondly, if a look-behind contains alternatives, each of them must be of the same length :
=> For instance, the syntaxes
(?<=AAA|BBB|CCC)XYZ
, or(?<=A...A|B.{3}B|C789C)XYZ
are allowed=> But the syntaxes
(?<=A|BB|CCC)XYZ
, or(?<=A...A|B.{5}C|C012345C)XYZ
are forbiddenA possible work-around is to use alternatives for each subset of the look-behind :
For instance, the wrong syntax
(?-i)(?<=A|BB|CCC)XYZ
can be replaced with the correct form(?-i)((?<=A)|(?<=BB)|(?<=CCC))XYZ
- Thirdly, look-arounds may be nested :
=> For instance, the regex
(?-i)(?<=(?<!GHI)PQR)XYZ
will match any string XYZ ONLY IF prededed with the string PQR, itself not preceded with the string GHI, with this exact caseIn a similar way, the regex
(?-i)ABC(?=GHI(?!XYZ))
would match any string ABC ONLY IF followed with the string GHI, itself not followed with the string XYZ, with this exact case
- Fourtly, a look-around is not itself a capturing group. If you want to match the regex, inside a look-around, or part of it, you’ll have to insert capturing parentheses.
=> For instance, the regex
(?-si)(?<=ABC(DEF)GHI)XYZ.+\1
would match the greatest range of standard characters, between the string XYZ and the string DEF, ONLY IF preceded with the string ABCDEFGHI and the regex(?-i)(?<=(ABC))XYZ\1
would match the string XYZABC, ONLY IF preceded with the string ABC, with all strings upper-case !
- Fifthly, look-arounds are an atomic structure. Once a look-around condition is satisfied, the regex engine will not backtrack inside a look-around to try other permutations !
=> For instance, the regex
(?=(\d+))\w+\1
, with a look-ahead, against the text 123x12, will never match ! Indeed, when the regex engine starts, the look-ahead regex\d+
matches the number 123, which is stored as group1
. As no backtracking process can occur, the value of the back-reference\1
will always be 123 ! Therefore, this regex always fails because, obviously, the end of the text is not equal to 123, despite of all the possible backtracking steps of the part\w+
, located outside the look-ahead structure !Now, if the look-ahead regex would NOT have been atomic, then a backtracking process could have occured. So, while the regex engine position is still right before the first digit
1
of the subject string, the look-ahead regex\d+
would have backtracked and matched the string 12. In that case, the part\w+
would have matched, after backtracting, from value 123x12 to the string 123x, in order that the back-reference\1
matches, of course, the remainder string 12 !On the contrary, with the other syntax
(\d+)\w+\1
, without the look-ahead, against the text 123x12, placed in a new tab, the part\d+
is not atomic. Therefore the process is :(\d+) \w+ \1 = 123 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 123 x12 EMPTY Does EMPTY matches \1 ( = 123 ) => NO => Backtracking on \w+ 123 x1 2 Does 2 matches \1 ( = 123 ) => NO => Backtracking on \w+ 123 x 12 Does 12 matches \1 ( = 123 ) => NO => As backtracking on \w+ IMPOSSIBLE => Backtracking on \d+ Then : (\d+) \w+ \1 = 12 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12 3x12 EMPTY Does EMPTY matches \1 ( = 12 ) => NO => Backtracking on \w+ 12 3x1 2 Does 2 matches \1 ( = 12 ) => NO => Backtracking on \w+ 12 3x 12 Does 12 matches \1 ( = 12 ) => Yes => 1 SUCCESSFUL match
So, the regex
(\d+)\w+\1
does match all the subject string 123x12Remark that the regex
(\d++)\w+\1
, with an atomic quantifier, at beginning, would not match anythging because the atomic part\d++
will never backtrack from value 123 to value 12 => The overall regex fails !
=> An other example : consider the regex
(\d+)\K\w+\1
, with the\K
syntax, against the text123x12
, placed in a new tab. Again, as the part\d+
is not atomic, the process is :(\d+) \K \w+ \1 = 123 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 123 | x12 EMPTY Does EMPTY matches \1 ( = 123 ) => NO => Backtracking on \w+ 123 | x1 2 Does 2 matches \1 ( = 123 ) => NO => Backtracking on \w+ 123 | x 12 Does 12 matches \1 ( = 123 ) => NO => As backtracking on \w+ IMPOSSIBLE => Backtracking on \d+ Then : (\d+) \K \w+ \1 = 12 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 12 | 3x12 EMPTY Is EMPTY matches \1 ( = 12 ) => NO => Backtracking on \w+ 12 | 3x1 2 Is 2 matches \1 ( = 12 ) => NO => Backtracking on \w+ 12 | 3x 12 Is 12 matches \1 ( = 12 ) => Yes => 1 SUCCESSFUL match
So, the regex
(\d+)\K\w+\1
does match the string 3x12 against the subject string 123x12Remark, again, that the regex
(\d++)\K\w+\1
, with an atomic quantifier, at beginning, would not match anythging because the atomic part\d++
will never backtrack from value 123 to value 12 => The overall regex fails !Of course, the regex
(?<=(\d+))\w+\1
is not correct, due to the non-fixed regex\d+
, inside the look-behind. But EVEN IF this syntax would have been correct, this regex would never had matched the string 123x12, due to atomic state of look-arounds !
Now, @Ray-H, let’s go back to your problem !
From your intital text :
intCustomerKey, strCustomerName, strAddress, blnActive,
An alternate solution to the @alan-kilborn one, could be :
-
SEARCH
(?-si)^str.+\K(?=,)
-
REPLACE
''
-
Click exclusively on the
Replace All
button ( Due to the\K
syntax, do not use theReplace
button )
You get your expected text :
intCustomerKey, strCustomerName'', strAddress'', blnActive,
Notes :
-
First the in-line modifier
(?-si)
forces the regex engine :-
To do the search in a non-insensitive way (
-i
) -
To suppose that the dot
.
matches a single standard character and not any EOL character (-s
)
-
-
Then the part
^str
, looks for the string str, with that exact case, at beginning of current line (^
) -
Now, the part
.+
matches the greatest non-null range of chars… -
Till a comma
,
symbol, due to the positive look-ahead structure(?=,)
which defines a condition which must be true to satisfy the overall regex -
Finally, because of the
\K
syntax, ONLY the empty string, between all the characters, after the string str, and the ending comma symbol, is matched ! -
As the replacement zone is
''
, this zero-length zone is simply replaced with two single quotes'
Best Regards,
guy038
P.S. :
Do you know that you may even place a look-behind AFTER the string to search for or place a look-ahead BEFORE the string to search for !?
For instance, from my complete name
Guy THEVENOT
, with a space char before forename and name :-
The regex
(?-i)(?=....THE)Guy
would find the string Guy, if followed with the string ’ THE’, without quotes, with this exact case -
The regex
(?-i)THE(?<=Guy....)
would find the string THE, if preceded with the string 'Guy ', without quotes, with this exact case
Of course, better not use these academic examples, in normal production ;-))
- Firstly, a look-behind must necessarily match a fixed-length string, so any quantifier syntax, like
-
Hello @guy038,
Thank you for the detailed breakdown of the problem. There is a lot of useful knowledge here. I especially appreciate you pointing out the subtleties brought upon the look-arounds by their atomic structure.
Best,
Ray