Regex query - match a block of several lines starting and ending with (but not including) the same string



  • Please forgive me if this has been asked before, but I couldn’t find exactly what I’m looking for.

    My original text is of the form

    === Line1 ===
    ab
    de
    ef
    === Line 2 ===
    gh
    ij
    kl
    mn
    op
    === Line 3 ===
    qr
    stu
    vw
    xyz
    == Line 4 ==
    zyx
    wv

    In this instance I want to match 3 cases but there can be a variable number of matches. This time the matches would be:

    === Line1 ===>>== Line 2 ===>>=== Line 3 ===
    ab>>>>>>>>>>>gh>>>>>>>>>>qr
    de>>>>>>>>>>>ij>>>>>>>>>>>stu
    ef>>>>>>>>>>>kl>>>>>>>>>>>vw
    ^ >>>>>>>>>>>mn>>>>>>>>>>xyz
    ^ >>>>>>>>>>>op

    (ignore the ^ and > they are just to break up the blocks in this post)
    My regex would start something like
    ^([=]{2,3}[ \s\S]+[=]{2,3}$)+
    but that always results in one match, including the == Line 4 ==. Clearly I need to look ahead to a (\n[=]{2,3}) but not include it but I can’t work out how to do that. I’m not being lazy or greedy by asking (puns intended) but I’ve spent hours looking around (another pun?) for a solution.

    Thanks in anticipation.



  • @John-Slee said in Regex query - match a block of several lines starting and ending with (but not including) the same string:

    === Line1 ===>>== Line 2 ===>>=== Line 3 ===
    ab>>>>>>>>>>>gh>>>>>>>>>>qr
    de>>>>>>>>>>>ij>>>>>>>>>>>stu
    ef>>>>>>>>>>>kl>>>>>>>>>>>vw
    ^ >>>>>>>>>>>mn>>>>>>>>>>xyz
    ^ >>>>>>>>>>>op

    I don’t understand how you get this from the original input, but then again I’m am utterly confused by most of it.

    I think you tried to emulate spaces and tabs by using other characters but in reality it caused more problems. To show data so that the interpreter doesn’t affect it consider reading up in our FAQ, specifically the one that states “request for help without sufficient information…”. In there it suggests putting data like yours in:

    Bloc
    ~ ~ ~
    and then we can more easily see what needs to happen. Could you do that for both your input and resulting answer to the example and it may become more apparent what you need.
    
    Terry


  • OK, I’ll not try laying out in three columns for the three matches.

    The 3 matches I want are:

    === Line1 ===
    ab
    de
    ef

    • and

    === Line 2 ===
    gh
    ij
    kl
    mn
    op

    • and

    === Line 3 ===
    qr
    stu
    vw
    xyz



  • Hello, @John-slee, @terry-r and All,

    I think that a suitable regex could be :

    SEARCH (?-s)^={2,3}.+\R(?s:.+?)(?==|\Z)

    If correct, I’ll explain some details, next time !

    Best Regards,

    guy038



  • @guy038 said in Regex query - match a block of several lines starting and ending with (but not including) the same string:

    (?-s)^={2,3}.+\R(?s:.+?)(?==|\Z)

    Brilliant, thank you. Nearly right!
    I need to change my sample data to include lines which contain one or more single =

    e.g.
    === Line 3 ===
    qr
    stu = ignore this but include in capture
    vw
    xyz



  • @John-Slee
    I should probably say that I am using the Python Script plugin for Notepad++

    This is the processor that is ignoring text after the single = in a capture



  • Hi, @John-slee, @terry-r and All,

    Ah, OK ! Assuming this new condition, I also tried to simplify a bit that search regex. So, a correct solution could be :

    SEARCH (?-s)^==.+\R(?s:.+?)(?===|\Z)

    Notes :

    • First, the (?-s) in-line modifier forces the regex engine to consider that the special . symbol matches only a single standard character ( not EOL chars )

    • Then the part ^== looks for, at least two = signs, beginning a line

    • Now, the part .+\R matches all the remaining characters of current line ( .+ ), followed with their EOL char(s) ( \R )

    • The (?s:.+?) is a non-capturing group (?:.....) containing the in-line modifier s which means that the part .+? will match the shortest non-null range of any char, even EOL ones…

    • But only if  it is followed with, either, at least two consecutive = signs, of the next block, or the end of file, possibly preceded with some line-breaks only ( \Z ), due to the positive look-ahead structure (?=.......)


    This new regex should work against the example text, below :

    === Line1 ===
    ab
    de
    ef
    === Line 2 ===
    gh
    ij
    kl
    mn
    op
    === Line 3 ===
    qr
    stu = ignore this but include in capture
    vw
    xyz
    == Line 4 ==
    zyx
    wv
    

    Cheers,

    guy038



  • @guy038 Thank you so much. You have cracked it and stopped me cracking under the strain ;-)
    SOLVED!



  • Just a note to add that I have modified it very slightly so that it captures

    the Header line (without the EndofLine)

    and the remaining text as a second capture. The revised expression is thus:

    (?-s)(^==.+\R)((?s:.+?)(?===|\Z))



  • Hello, @John-slee,

    Ah, of course, if you want to capture part(s) of the regex, needed in replacement, you need to surround these parts with parentheses. In that case we have to change the non-capturing group (?s:.+?) into a capturing group, with the in-line modifier inside so the new syntax ((?s).+?)

    So the final regex would be :

    SEARCH (?-s)(^==.+\R)((?s).+?)(?===|\Z)

    with two groups :

    • The header line with its EOL ( group 1 )

    • The subsequent lines of each block, with their line-breaks ( group 2 )


    However, as you said :

    so that it captures the Header line (without the EndofLine)

    This final regex should be :

    SEARCH (?-s)(^==.+)\R((?s).+?)(?===|\Z)

    Best Regards

    guy038


Log in to reply