Columns++ version 1.3: All Unicode, all the time
-
Hello, @coises, @thomas-knoefel, @peterjones and All,
@coises, many thanks for your additional info. But, please, don’t be too upset by these regex oddities ! Of course, some class definitions seems different but, in all cases,
Columns++gives more accurate results than native N++ search, anyway !In fact, I did all these researches on the Unicode world as I wanted to clarify the status about identifiers, particularly with Perl, in order to find out a simplified formulation for the
Function ListPerl parser created by @peterjones and improved with your help, by using atomic structures !My first attempt was clearly insufficient because I only took
ASCIIcharacters into account. Peter adviced me to refer to the article, below :https://perldoc.perl.org/perldata#Identifier-parsing
which explains that, when using
UTF-8, thePerlidentifier syntax should be :/ (?[ ( \p{Word} & \p{XID_Start} ) + [_] ]) (?[ ( \p{Word} & \p{XID_Continue} ) ]) * /x or in a SINGLE line (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])(?[ ( \p{Word} & \p{XID_Continue} ) ]) *Although the properties
\p{XID_Start}and\p{XID_Continue}are NOT part of theGeneral Categorylist and are not functional with theBoostregex engine, thisPerlsyntax could be expressed, in theory, with ourBoostregex engine as :(?:(?=\p{XID_Start})\w|_)(?=\p{XID_Continue})\w*
Now, with the
v17.0release of BabelMap software, I was able to get the complete and exact list of these properties :\p{WORD},\p{ID_Start},\p{ID_Continue},\p{XID_Start},\p{XID_Continue},Then, from these lists, I could deduce the Unicode characters count of the regexes
(?:(?=\p{XID_Start})\w|_)and(?=\p{XID_Continue})\w. Refer below :# ================================================================================================== # # Unicode 17.0.0 # # From article https://unicode.org/reports/tr18/tr18-23.html#word # # # Derived Property WORD : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 = \p{lettter} or [[:alpha:]] # # + Decimal_Number # Nd 770 = \p{Decimal Digit Number} # ----------- # Total : 146,442 = Columns++ WORD chars - \x{005F} # # + Mc + Me + Mn # M* 2,543 = \p{Mark} # # + Connector_Punctuation # Pc 10 ( including the LOW LINE character \x{005F} ) # # + 200C ; Other_ID_Continue # Cf 1 ZERO WIDTH NON-JOINER ( JOIN-CONTROL character ) # # + 200D ; Other_ID_Continue # Cf 1 ZERO WIDTH JOINER ( JOIN-CONTROL character ) # # => Total = 148,997 characters # # ================================================================================================== # # From file 'DerivedCoreProperties.txt' : # # https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt # # # Derived Property ID_Start : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 ( = [[:alpha:]] ) # # + Letter_Number # Nl 239 # # + 1885 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI BALUDA # # + 1886 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI THREE BALUDA # # + 2118 ; Other_ID_Start # Sm 1 SCRIPT CAPITAL P # # + 212E ; Other_ID_Start # So 1 ESTIMATED SYMBOL # # + 309B ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA VOICED SOUND MARK # # + 309C ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # # - 2E2F ; # Lm 1 VERTICAL TILDE ( as INCLUDED in L* ) # # => Total = 145,916 characters # # ================================================================================================== # # Derived Property XID_Start ( ID_Start MODIFIED for closure under NFKx ) : # # # ID_Start 145,916 # # - 037A ; ID_Start # Lm 1 GREEK YPOGEGRAMMENI # # - 0E33 ; ID_Start # Lo 1 THAI CHARACTER SARA AM # # - 0EB3 ; ID_Start # Lo 1 LAO VOWEL SIGN AM # # - 309B ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA VOICED SOUND MARK # # - 309C ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # # - FC5E ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # # - FDFA ; ID_Start # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Start # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Start # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Start # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Start # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Start # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Start # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Start # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Start # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Start # Lo 1 ARABIC SUKUN ISOLATED FORM # # - FF9E ; ID_Start # Lm 1 HALFWIDTH KATAKANA VOICED SOUND MARK # - FF9F ; ID_Start # Lm 1 HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK # # => Total = 145,893 characters # # ================================================================================================== # # Derived Property ID_Continue : # # # ID_Start = 145,916 # # - 1885 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI BALUDA # # - 1886 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI THREE BALUDA # # The TWO characters above must be SUBTRACTED because they are, both, INCLUDED in 'Other_ID_Start' and in 'Nonspacing Mark' # # + Nonspacing_Mark # Mn 2,059 # # + Spacing_Mark # Mc 471 # # + Decimal_Number # Nd 770 # # + Connector_Punctuation # Pc 10 ( including the LOW LINE char : 005F _ ) # # + 00B7 ; Other_ID_Continue # Po 1 MIDDLE DOT # + 0387 ; Other_ID_Continue # Po 1 GREEK ANO TELEIA # + 1369 ; Other_ID_Continue # No 1 ETHIOPIC DIGIT ONE # + 136A ; Other_ID_Continue # No 1 ETHIOPIC DIGIT TWO # + 136B ; Other_ID_Continue # No 1 ETHIOPIC DIGIT THREE # + 136C ; Other_ID_Continue # No 1 ETHIOPIC DIGIT FOUR # + 136D ; Other_ID_Continue # No 1 ETHIOPIC DIGIT FIVE # + 136E ; Other_ID_Continue # No 1 ETHIOPIC DIGIT SIX # + 136F ; Other_ID_Continue # No 1 ETHIOPIC DIGIT SEVEN # + 1370 ; Other_ID_Continue # No 1 ETHIOPIC DIGIT EIGHT # + 1371 ; Other_ID_Continue # No 1 ETHIOPIC DIGIT NINE # + 19DA ; Other_ID_Continue # No 1 NEW TAI LUE THAM DIGIT ONE # + 200C ; Other_ID_Continue # Cf 1 ZERO WIDTH NON-JOINER # + 200D ; Other_ID_Continue # Cf 1 ZERO WIDTH JOINER # + 30FB ; Other_ID_Continue # Po 1 KATAKANA MIDDLE DOT # + FF65 ; Other_ID_Continue # Po 1 HALFWIDTH KATAKANA MIDDLE DOT # # => Total = 149,240 characters # # ================================================================================================== # # Derived Property XID_Continue ( ID_Continue MODIFIED for closure under NFKx ) : # # # ID_Continue 149,240 # # - 037A ; ID_Continue # Lm 1 GREEK YPOGEGRAMMENI # # - 309B ; ID_Continue # Sk 1 KATAKANA-HIRAGANA VOICED SOUND MARK # # - 309C ; ID_Continue # Sk 1 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # # - FC5E ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # - FDFA ; ID_Continue # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Continue # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Continue # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Continue # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Continue # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Continue # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Continue # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Continue # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Continue # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Continue # Lo 1 ARABIC SUKUN ISOLATED FORM # # => Total = 149,221 characters # # ================================================================================================== # # From https://perldoc.perl.org/perldate/#identifier-parsing # # # Intersection of WORD and XID_Start properties + LOW LINE char : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 ( = \p{lettter} or [[:alpha:]] ) # # # + 005F ; Connector_Punctuation # Pc 1 LOW LINE # # + 1885 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI BALUDA ( NON-SPACING mark, common in WORD and XID_Start ) # # + 1886 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI THREE BALUDA ( NON-SPACING mark, common in WORD and XID_Start ) # # # - 037A ; ID_Start # Lm 1 GREEK YPOGEGRAMMENI # # - 0E33 ; ID_Start # Lo 1 THAI CHARACTER SARA AM # # - 0EB3 ; ID_Start # Lo 1 LAO VOWEL SIGN AM # # - 2E2F ; # Lm 1 VERTICAL TILDE ( as ALREADY included in L* ) # # - FC5E ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # # - FDFA ; ID_Start # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Start # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Start # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Start # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Start # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Start # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Start # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Start # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Start # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Start # Lo 1 ARABIC SUKUN ISOLATED FORM # # - FF9E ; ID_Start # Lm 1 HALFWIDTH KATAKANA VOICED SOUND MARK # - FF9F ; ID_Start # Lm 1 HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK # # => Total = 145,653 characters, which can START an IDENTIFIER # # ================================================================================================== # # From https://perldoc.perl.org/perldate/#identifier-parsing # # # Intersection of WORD and XID_Continue properties : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 ( = \p{lettter} or [[:alpha:]] ) # # + Nonspacing_Mark # Mn 2,059 # # + Spacing_Mark # Mc 471 # # + Decimal_Number # Nd 770 # # + Connector_Punctuation # Pc 10 ( including the LOW LINE char : 005F _ ) # # + 200C ; Other_ID_Continue # Cf 1 ZERO WIDTH NON-JOINER ( FORMAT character, common in common in WORD and XID_Continue ) # # + 200D ; Other_ID_Continue # Cf 1 ZERO WIDTH JOINER ( FORMAT character, common in common in WORD and XID_Continue ) # # # - 037A ; ID_Continue # Lm 1 GREEK YPOGEGRAMMENI # # - 2E2F ; # Lm 1 VERTICAL TILDE ( as ALREADY included in L* ) # # - FC5E ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # - FDFA ; ID_Continue # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Continue # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Continue # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Continue # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Continue # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Continue # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Continue # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Continue # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Continue # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Continue # Lo 1 ARABIC SUKUN ISOLATED FORM # # => Total = 148,966 characters, which can CONTINUE an IDENTIFIER #
However, the last two results
(?:(?=\p{XID_Start})\w|_)and(?=\p{XID_Continue})\w, above, are true ONLY IF the regex engine would respect all Unicode properties. Unfortunately, from a Boost point of view, which :-
Only considers that word characters are all in the BMP
-
Generally considers that word characters are those defined prior to the Unicode
5.3release !
I verified that, presently, only
47,681characters can begin an PERL identifier and only48,011characters can continue a PERL identifier !So, @Peterjones, in all cases, the regex rules, used in
Function Listfor Perl, are a rough approximation of what they should be !Now, Peter, the goal is to get a
Perlparser using the approximative BOOST\wdefinition, without the help of atomic structures.Refer to https://community.notepad-plus-plus.org/post/104861
Best Regards,
guy038
-