determining how many capture groups there are programmatically

Mark Olson

Hi all, I’m currently working on improving the NotepadPlusPlusPluginPack.Net to have something resembling editor.re{search/replace} in PythonScript, but I’m running into an issue with counting the number of distinct capture groups.

I understand that you can use SCI_GETTAG to get the value of a capture group, but I don’t see any normal Scintilla message that you can use to get the number of capture groups. Without such a message, your only choices seem to be:

you know how many capture groups there are because you’re using some regex that your program generated (e.g., a CSV-parsing regex that’s generated from a user-specified number of columns)
you only care about the first N capture groups, and declare that all regexes your program works with must have at least that many
you do i = 0; string tag = SCI_GETTAG(i); i++ until the returned tag has length 0, since SCI_GETTAG returns an empty string for an invalid capture group.

I’ve poked around in the PythonScript source code, since PythonScript always knows how many capture groups there are, but I’m having trouble figuring out how PythonScript does it. Anyone care to enlighten me?

Coises

@Mark-Olson

While working on my Columns++ plugin, I stumbled across the fact that SCI_GETTAG will only get the first nine tags. (That’s not from experiment, but from reading the Scintilla code.)

I decided the only sensible thing to do was to include Boost regex in my plugin and do the searches directly, rather than through Scintilla’s interface. I’m still working on it, but at present it seems to be operating correctly. This way I get access to the number of capture groups and their content. Testing counting on large utf-8 files, it appears to be about twice as fast as the native search in Notepad++.

I suspect PythonScript is using Python’s own regular expressions, not going through the Scintilla interface.

I’m still working on the release of Columns++ that will incorporate this, so it’s not yet on GitHub. (There’s a branch that’s in progress, but it’s rarely up to date.) My approach is to use the Scintilla SCI_GETGAPPOSITION and two calls to SCI_GETRANGEPOINTER (one for the part before the gap and one for the part after) and then instantiate boost::match_results with custom iterators — one for Ansi, dereferencing to char, and one for utf-8, dereferencing to wchar_t. A better choice would be to dereference utf-8 to char32_t, but boost::regex only handles char32_t when coupled with the ICU4C library, and I could not figure out a simple, straightforward way to incorporate that in a GitHub project that just… builds… without a lot of confusing rigamarole.

Mark Olson

@Coises said in determining how many capture groups there are programmatically:

I suspect PythonScript is using Python’s own regular expressions, not going through the Scintilla interface.

No, PythonScript uses Boost for editor.{research/rereplace}.

In any case, thanks for the thoughts, and sorry it’s been a bumpy road. What I’m doing is in C#, and it’s unclear to me whether that will make this task easier or harder than it’s been for you.

Coises

@Mark-Olson said in determining how many capture groups there are programmatically:

No, PythonScript uses Boost for editor.{research/rereplace}.

Ah, I see. I can’t follow everything, but when I see:
https://github.com/bruderstein/PythonScript/blob/master/PythonScript/src/UTF8Iterator.h
it’s clear they’re taking a similar approach to what I’m doing… except it looks like they figured out how to define regex_traits for 32-bit Unicode characters without ICU. If I’m seeing what I think I’m seeing, then the PythonScript regular expression code interprets any Unicode code point as a single character, even if it is outside the Basic Multilingual Plane. (In Notepad++ search, non-BMP code points appear as two “characters” to regex, the high surrogate and the low surrogate.) Now I have another job… to see if I can grasp how they did it, and if I can, rewrite my utf8 iterator for about the sixth time…

In any case, thanks for the thoughts, and sorry it’s been a bumpy road. What I’m doing is in C#, and it’s unclear to me whether that will make this task easier or harder than it’s been for you.

If folks using your framework don’t care about using precisely the same regular expression language Notepad++ implements, they could probably use C# regular expressions; they’d have to copy the data to be searched to a buffer (translating to wide characters if it’s utf-8). I don’t know C# at all, but I get the impression you can’t template the regex classes like we do in C++, and hence wouldn’t be able to define a custom iterator, so using the Scintilla data “in place” wouldn’t be an option (at least for utf-8, unless C# has a built-in regular expression class for utf-8).

To use the Boost regex language, I guess you’d have to do whatever is done in C# to call a set of C++ routines that could incorporate Boost regex… or just accept the limitations of using Scintilla commands (including no access to capture groups past 9, and no way to get the replacement string without actually doing a replacement).