Regex - Positive Look Behind With *



  • Hello,

    I am trying to do a find and replace on a line feed having the format of data type and variable name concatenated together, followed by a comma followed by a CR and LF.
    Example lines would look like this.
    intCustomerKey,
    strCustomerName,
    strAddress,
    blnActive,
    I want to insert data between the variable name and the comma using the data type and the comma without knowing what the variable name is. The result should look like this after replacing , with ‘’, for str data types.
    intCustomerKey,
    strCustomerName ‘’,
    strAddress ‘’,
    blnActive,
    I constructed a regex using positive look behind which I hoped would match the comma on str data types in the example but I get “Find: Invalid regular expression” error on the following:
    (?<=str.*),
    Testing with different syntax leads me to believe that positive look behind doesn’t support certain quantifiers.
    For example, the following matches the 2nd comma.
    (?<=str…),
    (?<=str.{12}),
    However, I get invalid regular expression error with the following.
    (?<=str.{1,50}),
    (?<=str.+),

    Is there a different expression I can use to accomplish the matches?
    I am on Notepad++ v7.8.4

    Thank you,
    Ray



  • @Ray-H

    How about trying this?:

    Find what box: ^str(.*?),
    Replace with box: str\1",
    Match case checkbox: ticked
    Wrap around checkbox: ticked
    Search mode radiobutton: Regular expression
    Press the Replace All button



  • @Ray-H,

    @Alan-Kilborn’s suggestion will hopefully work for you.

    But if you’re wondering why

    I get “Find: Invalid regular expression” error on the following:
    (?<=str.*)
    (?<=str.{1,50})
    (?<=str.+)

    Notepad++ uses the Boost regular expression library; the Boost docs say (emphasis added):

    (?<=pattern) consumes zero characters, only if pattern could be matched against the characters preceding the current position (pattern must be of fixed length).

    Thus, the lookbehind is documented to only work with fixed length, and .*, .{1,50}, and .+ are all variable-length expressions (zero or more, 1 to 50, or one or more all have a range of lengths, not a fixed length).



  • @Alan-Kilborn and @PeterJones,

    Thank you both for your prompt reply. The suggested method works perfectly, and I appreciate the explanation.

    Best,
    Ray



  • @PeterJones

    Plus, lookbehind is problematic with Notepad++ currently, in certain situations (example: at the beginning of a regex).



  • Hello, @ray-h, @alan-kilborn, @peterjones and All,

    First, here is general information about positive/negative look-behinds and look-aheads, with the present N++ Boost regex engine :

    • Firstly, a look-behind must necessarily match a fixed-length string, so any quantifier syntax, like * , + , ? , {n,} and {m,n}, are forbidden inside a look-behind. The .{n} syntax is only allowed !

    => For instance, the syntaxes (?<=.{3})XYZ or (?<=...)XYZ or (?<=.ABC.)XYZ are allowed

    => But the syntaxes (?<=.{3,})XYZ or (?<=.{3,5})XYZ or (?<=^A.+M)XYZ are forbidden

    To overcome this limitation of look-behinds, you may use the special \K syntax, ( Keep text out of the match ) which resets the regex engine position and keeps the text matched so far, out of the overall regex match !

    => For instance, the wrong syntax (?-is)(?<=^A.+M)XYZ could be replaced with the correct form (?-is)^A.+M\KXYZ

    However, note that if you’re using such \K syntax, in a search/replacement, it’s worth pointing out that you’ll have to click on the Replace All button, exclusively !


    • Secondly, if a look-behind contains alternatives, each of them must be of the same length :

    => For instance, the syntaxes (?<=AAA|BBB|CCC)XYZ, or (?<=A...A|B.{3}B|C789C)XYZ are allowed

    => But the syntaxes (?<=A|BB|CCC)XYZ, or (?<=A...A|B.{5}C|C012345C)XYZ are forbidden

    A possible work-around is to use alternatives for each subset of the look-behind :

    For instance, the wrong syntax (?-i)(?<=A|BB|CCC)XYZ can be replaced with the correct form (?-i)((?<=A)|(?<=BB)|(?<=CCC))XYZ


    • Thirdly, look-arounds may be nested :

    => For instance, the regex (?-i)(?<=(?<!GHI)PQR)XYZ will match any string XYZ  ONLY IF prededed with the string PQR, itself not preceded with the string GHI, with this exact case

    In a similar way, the regex (?-i)ABC(?=GHI(?!XYZ)) would match any string ABC  ONLY IF followed with the string GHI, itself not followed with the string XYZ, with this exact case


    • Fourtly, a look-around is not itself a capturing group. If you want to match the regex, inside a look-around, or part of it, you’ll have to insert capturing parentheses.

    => For instance, the regex (?-si)(?<=ABC(DEF)GHI)XYZ.+\1 would match the greatest range of standard characters, between the string XYZ and the string DEF,  ONLY IF preceded with the string ABCDEFGHI and the regex (?-i)(?<=(ABC))XYZ\1 would match the string XYZABC,  ONLY IF preceded with the string ABC, with all strings upper-case !


    • Fifthly, look-arounds are an atomic structure. Once a look-around condition is satisfied, the regex engine will not backtrack inside a look-around to try other permutations !

    => For instance, the regex (?=(\d+))\w+\1, with a look-ahead, against the text 123x12, will never match ! Indeed, when the regex engine starts, the look-ahead regex \d+ matches the number 123, which is stored as group 1. As no backtracking process can occur, the value of the back-reference \1 will always be 123 ! Therefore, this regex always fails because, obviously, the end of the text is not equal to 123, despite of all the possible backtracking steps of the part \w+, located outside the look-ahead structure !

    Now, if the look-ahead regex would NOT have been atomic, then a backtracking process could have occured. So, while the regex engine position is still right before the first digit 1 of the subject string, the look-ahead regex \d+ would have backtracked and matched the string 12. In that case, the part \w+ would have matched, after backtracting, from value 123x12 to the string 123x, in order that the back-reference \1 matches, of course, the remainder string 12 !

    On the contrary, with the other syntax (\d+)\w+\1, without the look-ahead, against the text 123x12, placed in a new tab, the part \d+ is not atomic. Therefore the process is :

    (\d+)    \w+      \1 = 123
    --------------------------------------------------------------------------------------------------------------------
    123      x12       EMPTY      Does EMPTY matches \1 ( = 123 ) => NO => Backtracking on \w+
    123      x1        2          Does 2     matches \1 ( = 123 ) => NO => Backtracking on \w+
    123      x         12         Does 12    matches \1 ( = 123 ) => NO => As backtracking on \w+ IMPOSSIBLE => Backtracking on \d+
    
    Then :
    
    (\d+)    \w+      \1 = 12
    --------------------------------------------------------------------------------------------------------------
    12       3x12      EMPTY      Does EMPTY matches \1 ( = 12 ) => NO => Backtracking on \w+
    12       3x1       2          Does 2     matches \1 ( = 12 ) => NO => Backtracking on \w+
    12       3x        12         Does 12    matches \1 ( = 12 ) => Yes => 1 SUCCESSFUL match
    

    So, the regex (\d+)\w+\1 does match all the subject string 123x12

    Remark that the regex (\d++)\w+\1, with an atomic quantifier, at beginning, would not match anythging because the atomic part \d++ will never backtrack from value 123 to value 12 => The overall regex fails !


    => An other example : consider the regex (\d+)\K\w+\1, with the \K syntax, against the text 123x12, placed in a new tab. Again, as the part \d+ is not atomic, the process is :

    (\d+)   \K    \w+    \1 = 123
    --------------------------------------------------------------------------------------------------------------------
    123     |     x12      EMPTY     Does EMPTY matches \1 ( = 123 ) => NO => Backtracking on \w+
    123     |     x1       2         Does 2     matches \1 ( = 123 ) => NO => Backtracking on \w+
    123     |     x        12        Does 12    matches \1 ( = 123 ) => NO => As backtracking on \w+ IMPOSSIBLE => Backtracking on \d+
    
    Then :
    
    (\d+)   \K    \w+     \1 = 12
    ----------------------------------------------------------------------------------------------------------------
    12      |     3x12     EMPTY     Is EMPTY matches \1 ( = 12 ) => NO => Backtracking on \w+
    12      |     3x1      2         Is 2     matches \1 ( = 12 ) => NO => Backtracking on \w+
    12      |     3x       12        Is 12    matches \1 ( = 12 ) => Yes => 1 SUCCESSFUL match
    

    So, the regex (\d+)\K\w+\1 does match the string 3x12 against the subject string 123x12

    Remark, again, that the regex (\d++)\K\w+\1, with an atomic quantifier, at beginning, would not match anythging because the atomic part \d++ will never backtrack from value 123 to value 12 => The overall regex fails !

    Of course, the regex (?<=(\d+))\w+\1 is not correct, due to the non-fixed regex \d+, inside the look-behind. But  EVEN IF this syntax would have been correct, this regex would never had matched the string 123x12, due to atomic state of look-arounds !


    Now, @Ray-H, let’s go back to your problem !

    From your intital text :

    intCustomerKey,
    strCustomerName,
    strAddress,
    blnActive,
    

    An alternate solution to the @alan-kilborn one, could be :

    • SEARCH (?-si)^str.+\K(?=,)

    • REPLACE ''

    • Click exclusively on the Replace All button ( Due to the \K syntax, do not use the Replace button )

    You get your expected text :

    intCustomerKey,
    strCustomerName'',
    strAddress'',
    blnActive,
    

    Notes :

    • First the in-line modifier (?-si) forces the regex engine :

      • To do the search in a non-insensitive way ( -i )

      • To suppose that the dot . matches a single standard character and not any EOL character ( -s )

    • Then the part ^str, looks for the string str, with that exact case, at beginning of current line ( ^ )

    • Now, the part .+ matches the greatest non-null range of chars…

    • Till a comma , symbol, due to the positive look-ahead structure (?=,) which defines a condition which must be true to satisfy the overall regex

    • Finally, because of the \K syntax, ONLY  the empty string, between all the characters, after the string str, and the ending comma symbol, is matched !

    • As the replacement zone is '', this zero-length zone is simply replaced with two single quotes '

    Best Regards,

    guy038

    P.S. :

    Do you know that you may even place a look-behind AFTER  the string to search for or place a look-ahead BEFORE  the string to search for !?

    For instance, from my complete name Guy THEVENOT, with a space char before forename and name :

    • The regex (?-i)(?=....THE)Guy would find the string Guy, if followed with the string ’ THE’, without quotes, with this exact case

    • The regex (?-i)THE(?<=Guy....) would find the string THE, if preceded with the string 'Guy ', without quotes, with this exact case

    Of course, better not use these academic examples, in normal production ;-))



  • Hello @guy038,

    Thank you for the detailed breakdown of the problem. There is a lot of useful knowledge here. I especially appreciate you pointing out the subtleties brought upon the look-arounds by their atomic structure.

    Best,
    Ray


Log in to reply