Entering curly quote marks as UDL operators, or keywords

Ian Oh

Hi

When I enter “” or ‘’ … the curly counterparts of the quotations marks on our keyboards into any of the Operators of the UDL, to help me find unbalanced curly quotes, it does not register. Entering the straight quotes work. I can see clearly where the text has missed a closed quote. Is there a way for Notepad++ UDL to incorporate the curly quotes?

Thanks.

PeterJones

@Ian-Oh ,

The UDL implementation isn’t very strong when it comes to non-ASCII Unicode characters. There have been many bug-reports/feature-requests to improve Unicode handling in the UDL, but none of them have been addressed. Sorry.

However, you might be able to get something good enough for your purposes by using the Search > Mark dialog:

"normal"
“blah blah”
“imbalanced open “blah”
“imbalanced close” blah”

FIND = “[^”]*“[^”]*”|“[^”]*”[^“]*”, purge each search, search-mode = regular expression

That find looks for (1) an open double-quote followed by 0 or more characters that aren’t a close, followed by another open, followed by 0 or more characters that aren’t a close, followed by a close (thus finding an extra imbalanced open quote); OR (2) open followed by 0-or-more non-close, followed by close, followed by 0-or-more non-open, followed by another close (thus finding an extra imbalanced close quote).

Ian Oh

@PeterJones

Wow! That’s a thing of beauty… or as we’d say down under … Ubewdy! Thank you very much.

And thanks for explaining the limitations of UDL to ASCII.

guy038

Hello, @ian-oh, @peterjones, @alan-kilborn and All,

Sorry for not being very responsive, but I am currently on vacation and consult our forum rather rarely !

Before presenting the regexes which allow this kind of search, paste the following five lines, in a new tab and let’s study some concepts. Note that line 5 contains a full period

Start blah
“blah blah”
“imbalanced open “blah”
“imbalanced close” blah”
This is a ““Te”st”“““abc”de”fgh. Ijkl”“mnop”” End

If we’re going to search the longest area, with well balanced delimiters “ and ”, we must, first, consider the total range where to search these areas. Let me explain :

Case A : We may suppose that this range is the file contents ( Default case )
Case B : We may suppose that this range is limited to current line contents
Case C : Finally, we may suppose that this range is limited to a single sentence contents, within current line

If, by convention, any text, without double curly quotes, is considered as well balanced ( zero “ char and zero ” char ), we can say that :

Regarding case A :
- The multi-line area, beginning from word Start, in line 1, till the string “mnop”, in line 5, forms an area with the same number of opening and closing double curly quotes ! ( The last ”, before End, is not included )
- Then, the final End word, preceded with a space char, is a well-balanced area, by default !
Regarding case B :
- Line 1 and 2 are well balanced
- Line 3 contains the well balanced area imbalanced open “blah”
- Line 4 contains the well balanced area “imbalanced close” blah
- Line 5 contains the well balanced area This is a ““Te”st”“““abc”de”fgh. Ijkl”“mnop”
Regarding case C :
- Line 1 to 4 are identical to case B
- Line 5 contains :
  - The well balanced area This is a ““Te”st”, in the first sentence
  - The well balanced area ““abc”de”fgh, right before the period
  - The well balanced area ijkl, preceded with a space char, in the second sentence
  - The well balanced area “mnop”
  - The well balanced area End, preceded with a space char

So, if we add +1 for any opening double curly quote and -1 for any closing double curly quote, we get this table, where any • char refers to an unmatched double curly quote !

•--------•---------------------------------------------------•
| Line 1 | Start blah                                        |
| Count  |                                                   |
•--------•---------------------------------------------------•
| Line 2 | “blah blah”                                       |
| Count  | 1         0                                       |
•--------•---------------------------------------------------•
| Line 3 | “imbalanced open “blah”                           |
| Count  | •                1    0                           |
•--------•---------------------------------------------------•
| Line 4 | “imbalanced close” blah”                          |
| Count  | 1                0     •                          |
•--------•---------------------------------------------------•
| Line 5 | This is a ““Te”st”“““abc”de”fgh. Ijkl”“mnop”” End |
| Count  |           12  1  0•12   1  0   .     •1    0•     |
•--------•---------------------------------------------------•

Thus, @ian-oh, according to the case A, B or C, you"ll execute the following recursive regexes, in free-spacing mode, beginning at (?x), till the (?1)+ syntax :


 Case A :  (?x) (?: ( [^“”] )*        ( “ (?:  (?1)++ | (?2) )* ” ) )+ (?1)*  | (?1)+

 Case B :  (?x) (?: ( [^“”\r\n] )*    ( “ (?:  (?1)++ | (?2) )* ” ) )+ (?1)*  | (?1)+

 Case C :  (?x) (?: ( [^“”.!?\r\n] )* ( “ (?:  (?1)++ | (?2) )* ” ) )+ (?1)*  | (?1)+
                    ^                 ^
                    |                 |
Groups  --------->  1                 2

Notes :

These regexes are derived from the end of this article, in the official N++ documentation, which explains how to search for well balanced regions with parentheses :

(?x) (?: [^()]* ( \( (?: [^()]++ | (?1) )* \) ) )+ [^()]* | [^()]+

For case C, I considered that a sentence ends at a full period and at an interrogation or exclamation mark. Add other characters to this list if necessary !
These regexes are mainly composed of non-capturing groups and contain only two groups :
- A non-recursive group 1 which refers to the allowed characters, for each case ( not included the double curly quotes )
- A recursive group 2, as one reference (?2) is located inside the group 2 itself, which adds some intelligence to the overall search by a recursive evaluation of the text
Note that, in case of incoherent results, it is advised to replace any (?1) syntax by its true value ( [^“”] ), or ( [^“”\r\n] ) or ( [^“”.!?\r\n] ). This may helps !
Don’t try to perform backward searches : it won’t work !

Here is an other text of 7 lines, whith a lot of double curly quotes, for additional tests of these 3 recursive regexes :

““““ab“““cd““ef”””gh”.ij””klm””””
““ab““““cd“““ef”””gh””””ijkl””””
““““““ab“cd“ef“”””gh”ijkl””?”mn”””””
““--ab“cd“ef--gh“ij--kl”mn”o!p““qr--st”uv\wx”--”yz””------”abc
abcd--------““efghi----jk”----”lmnop------
““--ab”cdef--ghi.j--klmn---op““qr--stu”vwx--”yz”----
---abcde-----““qrs”tu--”vwxyz---

If you paste these 7 lines in a new tab, you’ll verify that, with the regexA syntax, the last match is all the well-balanced area, below :

abc
abcd--------““efghi----jk”----”lmnop------
““--ab”cdef--ghi.j--klmn---op““qr--stu”vwx--”yz”----
---abcde-----““qrs”tu--”vwxyz---

Which contains, exactly, eight “ opening characters and eight ” closing characters !

Best Regards

guy038