@guy038 said in Columns++ version 1.3: All Unicode, all the time:
So, I don’t see exactly which rule should be applied, regarding the word definition !?
and in Columns++ version 1.3: All Unicode, all the time:
Again, I don’t understand clearly these differences between the two last columns !
This is not going to be a complete response yet, but some further explanation.
Even when using ICU, Boost::regex does not implement the same regex language as described in Unicode Technical Standard #18: Unicode Regular Expressions. Some of the differences are more-or-less dictated by the architecture of Boost::regex; others appear to be choices.
This is a list of category definitions used by Boost::regex when using ICU; the table comes from matching up char_pointer_range in get_default_class_id and char_class_type in lookup_classname:
alnum U_GC_L_MASK | U_GC_ND_MASK
alpha U_GC_L_MASK
blank mask_blank
cntrl U_GC_CC_MASK | U_GC_CF_MASK | U_GC_ZL_MASK | U_GC_ZP_MASK
d U_GC_ND_MASK
digit U_GC_ND_MASK
graph (0x3FFFFFFFu) & ~(U_GC_CC_MASK | U_GC_CF_MASK | U_GC_CS_MASK | U_GC_CN_MASK | U_GC_Z_MASK)
h mask_horizontal
l U_GC_LL_MASK
lower U_GC_LL_MASK
print ~(U_GC_C_MASK)
punct U_GC_P_MASK
s U_GC_Z_MASK | mask_space
space U_GC_Z_MASK | mask_space
u U_GC_LU_MASK
unicode mask_unicode
upper U_GC_LU_MASK
v mask_vertical
w U_GC_L_MASK | U_GC_ND_MASK | U_GC_MN_MASK | mask_underscore
word U_GC_L_MASK | U_GC_ND_MASK | U_GC_MN_MASK | mask_underscore
xdigit U_GC_ND_MASK | mask_xdigit
Comparison with the table you referenced shows that Boost::regex does not use the same definitions. In particular, lower and upper are defined to be identical to General Categories Ll and Lu, alpha is defined to be identical to General Category L, and word does not contain all the characters mentioned in the Unicode specification.
For the most part, Columns++ follows the Boost::regex definitions, though I did not include Mn in word. Also the Boost::regex code for isctype implements some of the classifications directly; I think I am close, but not necessarily identical, for those. It looks as if Boost::regex does define xdigit according to the Unicode spec.
I think that Boost::regex defines word boundaries in terms of word characters (i.e. \b is equivalent to (?<!\w)(?=\w)|(?<=\w)(?!\w)) and that I wouldn’t be able to change that without forking and modifying Boost::regex code.
I think the questions are whether Boost::regex is more accurately considered wrong, or just different in its implementation of character classes; and if the latter, which is preferable.
At present, my estimation is that it would be time-consuming, but not impossible or fragile, to implement the Unicode definitions (aside from word boundaries) as listed in Annex C: Compatibility Properties in Columns++.
Whether that’s what should be done might still be an open question.