Hello, @ian-oh, @peterjones, @alan-kilborn and All,
Sorry for not being very responsive, but I am currently on vacation and consult our forum rather rarely !
Before presenting the regexes which allow this kind of search, paste the following five lines, in a new tab and let’s study some concepts. Note that line 5 contains a full period
Start blah
“blah blah”
“imbalanced open “blah”
“imbalanced close” blah”
This is a ““Te”st”“““abc”de”fgh. Ijkl”“mnop”” End
If we’re going to search the longest area, with well balanced delimiters “ and ”, we must, first, consider the total range where to search these areas. Let me explain :
Case A : We may suppose that this range is the file contents ( Default case )
Case B : We may suppose that this range is limited to current line contents
Case C : Finally, we may suppose that this range is limited to a single sentence contents, within current line
If, by convention, any text, without double curly quotes, is considered as well balanced ( zero “ char and zero ” char ), we can say that :
Regarding case A :
The multi-line area, beginning from word Start, in line 1, till the string “mnop”, in line 5, forms an area with the same number of opening and closing double curly quotes ! ( The last ”, before End, is not included )
Then, the final End word, preceded with a space char, is a well-balanced area, by default !
Regarding case B :
Line 1 and 2 are well balanced
Line 3 contains the well balanced area imbalanced open “blah”
Line 4 contains the well balanced area “imbalanced close” blah
Line 5 contains the well balanced area This is a ““Te”st”“““abc”de”fgh. Ijkl”“mnop”
Regarding case C :
Line 1 to 4 are identical to case B
Line 5 contains :
The well balanced area This is a ““Te”st”, in the first sentence
The well balanced area ““abc”de”fgh, right before the period
The well balanced area ijkl, preceded with a space char, in the second sentence
The well balanced area “mnop”
The well balanced area End, preceded with a space char
So, if we add +1 for any opening double curly quote and -1 for any closing double curly quote, we get this table, where any • char refers to an unmatched double curly quote !
•--------•---------------------------------------------------•
| Line 1 | Start blah |
| Count | |
•--------•---------------------------------------------------•
| Line 2 | “blah blah” |
| Count | 1 0 |
•--------•---------------------------------------------------•
| Line 3 | “imbalanced open “blah” |
| Count | • 1 0 |
•--------•---------------------------------------------------•
| Line 4 | “imbalanced close” blah” |
| Count | 1 0 • |
•--------•---------------------------------------------------•
| Line 5 | This is a ““Te”st”“““abc”de”fgh. Ijkl”“mnop”” End |
| Count | 12 1 0•12 1 0 . •1 0• |
•--------•---------------------------------------------------•
Thus, @ian-oh, according to the case A, B or C, you"ll execute the following recursive regexes, in free-spacing mode, beginning at (?x), till the (?1)+ syntax :
Case A : (?x) (?: ( [^“”] )* ( “ (?: (?1)++ | (?2) )* ” ) )+ (?1)* | (?1)+
Case B : (?x) (?: ( [^“”\r\n] )* ( “ (?: (?1)++ | (?2) )* ” ) )+ (?1)* | (?1)+
Case C : (?x) (?: ( [^“”.!?\r\n] )* ( “ (?: (?1)++ | (?2) )* ” ) )+ (?1)* | (?1)+
^ ^
| |
Groups ---------> 1 2
Notes :
These regexes are derived from the
end of this
article, in the
official N++ documentation, which explains how to search for
well balanced regions with
parentheses :
(?x) (?: [^()]* ( \( (?: [^()]++ | (?1) )* \) ) )+ [^()]* | [^()]+
For case C, I considered that a sentence ends at a full period and at an interrogation or exclamation mark. Add other characters to this list if necessary !
These regexes are mainly composed of non-capturing groups and contain only two groups :
A non-recursive group 1 which refers to the allowed characters, for each case ( not included the double curly quotes )
A recursive group 2, as one reference (?2) is located inside the group 2 itself, which adds some intelligence to the overall search by a recursive evaluation of the text
Note that, in case of incoherent results, it is advised to replace any (?1) syntax by its true value ( [^“”] ), or ( [^“”\r\n] ) or ( [^“”.!?\r\n] ). This may helps !
Don’t try to perform backward searches : it won’t work !
Here is an other text of 7 lines, whith a lot of double curly quotes, for additional tests of these 3 recursive regexes :
““““ab“““cd““ef”””gh”.ij””klm””””
““ab““““cd“““ef”””gh””””ijkl””””
““““““ab“cd“ef“”””gh”ijkl””?”mn”””””
““--ab“cd“ef--gh“ij--kl”mn”o!p““qr--st”uv\wx”--”yz””------”abc
abcd--------““efghi----jk”----”lmnop------
““--ab”cdef--ghi.j--klmn---op““qr--stu”vwx--”yz”----
---abcde-----““qrs”tu--”vwxyz---
If you paste these 7 lines in a new tab, you’ll verify that, with the regexA syntax, the last match is all the well-balanced area, below :
abc
abcd--------““efghi----jk”----”lmnop------
““--ab”cdef--ghi.j--klmn---op““qr--stu”vwx--”yz”----
---abcde-----““qrs”tu--”vwxyz---
Which contains, exactly, eight “ opening characters and eight ” closing characters !
Best Regards
guy038