@guy038 said in Columns++ version 1.3: All Unicode, all the time:
Note that the \p{Hex_Digit} regex is erroneous ! The right one is \p{xdigit}, at least, within Columns++
What’s going on there is that I followed the structure of Boost::regex character classes:
Character Classes that are Always Supported
Character classes that are supported by Unicode Regular Expressions
which are mainly the POSIX character classes plus Unicode General Categories interpreted as character classes. Also, note that in Boost::regex, character classes and character properties are the same thing. I didn’t make any attempt to change that. I believe this is different both from Unicode regular expressions and from PCRE.
(I did add a couple new character classes unique to Columns++: [:defined:] and [:invalid:], and aliases \i, \o and \y for [:invalid:], [:ASCII:] and [:defined:]. Also, Columns++ does not support [:Cs:]/[:Surrogate:] since Unicode in Scintilla can only be UTF-8, which cannot contain surrogates — though it can contain invalid byte sequences which appear to encode surrogates, as in WTF-8; Scintilla treats these as invalid UTF-8 bytes, and so does Columns++.)
Hex_Digit isn’t one of the Boost::regex character classes, and I never defined it. Defining it to be equivalent to xdigit would be trivial; re-defining xdigit to include non-ASCII characters is a bit more complicated:
I’ve found out a small anomaly concerning hexadecimal characters :
If I use the native Notepad++ search to match any hexadecimal character, with the regex [[:xdigit:]], against my Total_Chars.txt file, it returns 44 matches
If I use the Columns++ search to match any hexadecimal character, with the regex [[:xdigit:]], against my Total_Chars.txt file, it returns 22 matches
I suppose that the N++ answer is the right one. Indeed, in the https://www.unicode.org/reports/tr18/#Compatibility_Properties article , ( Annexe C about UNICODE REGULAR EXPRESSIONS ), it is said :
Hex_Digit contains 0-9 A-F fullwidth and halfwidth, upper and lowercase
Yes, it would seem the standard is to include those non-ASCII characters as hex digits. Further, the comments at your link under lower and upper are troublesome, as Columns++ treats them as aliases for Ll and Lu. Word and word boundaries are probably faulty as well.
I followed the Boost::regex principle that to extend the traditional POSIX mappings, the only Unicode property that is used to determine membership in a character class is the General Category.
I hard-coded (that is, they are written explicitly rather than being derived from Unicode tables) the POSIX mappings for ASCII characters, since that’s the only place they are really well-defined; plus there is a hard-coded exception for the non-ASCII character U+0085, the Next Line control character, because it should be part of \v, which is implemented in Boost::regex as [[:v:]]. I don’t see any reason [[:xdigit:]] can’t be extended with similar hard-coded logic; I just didn’t know until now that I should do it.
The other parts, though: whatever they are saying is supposed to be included in [:lower:] and [:upper:] besides letters, and whatever they are talking about in regard to word characters and boundaries… that might be problematic. I have a condensed set of tables built from a few Unicode files, instead of trying to import the ghastly large and complex ICU. Those tables include the General Category, but if that is not enough to determine membership in a character class… reorganizing them to include whatever additional information I need (it’s not yet clear to me what that will be) is not likely to be simple.
Thank you for your observation. Indeed, there are flaws. It is not yet clear to me if and how it will be practical to address them, though I can probably fix the [:xdigit:] behavior without much difficulty.