Columns++ version 1.3: All Unicode, all the time
-
Hello, @coises, @thomas-knoefel, @peterjones and All,
@coises, many thanks for your additional info. But, please, don’t be too upset by these regex oddities ! Of course, some class definitions seems different but, in all cases,
Columns++gives more accurate results than native N++ search, anyway !In fact, I did all these researches on the Unicode world as I wanted to clarify the status about identifiers, particularly with Perl, in order to find out a simplified formulation for the
Function ListPerl parser created by @peterjones and improved with your help, by using atomic structures !My first attempt was clearly insufficient because I only took
ASCIIcharacters into account. Peter adviced me to refer to the article, below :https://perldoc.perl.org/perldata#Identifier-parsing
which explains that, when using
UTF-8, thePerlidentifier syntax should be :/ (?[ ( \p{Word} & \p{XID_Start} ) + [_] ]) (?[ ( \p{Word} & \p{XID_Continue} ) ]) * /x or in a SINGLE line (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])(?[ ( \p{Word} & \p{XID_Continue} ) ]) *Although the properties
\p{XID_Start}and\p{XID_Continue}are NOT part of theGeneral Categorylist and are not functional with theBoostregex engine, thisPerlsyntax could be expressed, in theory, with ourBoostregex engine as :(?:(?=\p{XID_Start})\w|_)(?=\p{XID_Continue})\w*
Now, with the
v17.0release of BabelMap software, I was able to get the complete and exact list of these properties :\p{WORD},\p{ID_Start},\p{ID_Continue},\p{XID_Start},\p{XID_Continue},Then, from these lists, I could deduce the Unicode characters count of the regexes
(?:(?=\p{XID_Start})\w|_)and(?=\p{XID_Continue})\w. Refer below :# ================================================================================================== # # Unicode 17.0.0 # # From article https://unicode.org/reports/tr18/tr18-23.html#word # # # Derived Property WORD : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 = \p{lettter} or [[:alpha:]] # # + Decimal_Number # Nd 770 = \p{Decimal Digit Number} # ----------- # Total : 146,442 = Columns++ WORD chars - \x{005F} # # + Mc + Me + Mn # M* 2,543 = \p{Mark} # # + Connector_Punctuation # Pc 10 ( including the LOW LINE character \x{005F} ) # # + 200C ; Other_ID_Continue # Cf 1 ZERO WIDTH NON-JOINER ( JOIN-CONTROL character ) # # + 200D ; Other_ID_Continue # Cf 1 ZERO WIDTH JOINER ( JOIN-CONTROL character ) # # => Total = 148,997 characters # # ================================================================================================== # # From file 'DerivedCoreProperties.txt' : # # https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt # # # Derived Property ID_Start : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 ( = [[:alpha:]] ) # # + Letter_Number # Nl 239 # # + 1885 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI BALUDA # # + 1886 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI THREE BALUDA # # + 2118 ; Other_ID_Start # Sm 1 SCRIPT CAPITAL P # # + 212E ; Other_ID_Start # So 1 ESTIMATED SYMBOL # # + 309B ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA VOICED SOUND MARK # # + 309C ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # # - 2E2F ; # Lm 1 VERTICAL TILDE ( as INCLUDED in L* ) # # => Total = 145,916 characters # # ================================================================================================== # # Derived Property XID_Start ( ID_Start MODIFIED for closure under NFKx ) : # # # ID_Start 145,916 # # - 037A ; ID_Start # Lm 1 GREEK YPOGEGRAMMENI # # - 0E33 ; ID_Start # Lo 1 THAI CHARACTER SARA AM # # - 0EB3 ; ID_Start # Lo 1 LAO VOWEL SIGN AM # # - 309B ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA VOICED SOUND MARK # # - 309C ; Other_ID_Start # Sk 1 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # # - FC5E ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # # - FDFA ; ID_Start # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Start # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Start # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Start # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Start # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Start # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Start # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Start # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Start # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Start # Lo 1 ARABIC SUKUN ISOLATED FORM # # - FF9E ; ID_Start # Lm 1 HALFWIDTH KATAKANA VOICED SOUND MARK # - FF9F ; ID_Start # Lm 1 HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK # # => Total = 145,893 characters # # ================================================================================================== # # Derived Property ID_Continue : # # # ID_Start = 145,916 # # - 1885 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI BALUDA # # - 1886 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI THREE BALUDA # # The TWO characters above must be SUBTRACTED because they are, both, INCLUDED in 'Other_ID_Start' and in 'Nonspacing Mark' # # + Nonspacing_Mark # Mn 2,059 # # + Spacing_Mark # Mc 471 # # + Decimal_Number # Nd 770 # # + Connector_Punctuation # Pc 10 ( including the LOW LINE char : 005F _ ) # # + 00B7 ; Other_ID_Continue # Po 1 MIDDLE DOT # + 0387 ; Other_ID_Continue # Po 1 GREEK ANO TELEIA # + 1369 ; Other_ID_Continue # No 1 ETHIOPIC DIGIT ONE # + 136A ; Other_ID_Continue # No 1 ETHIOPIC DIGIT TWO # + 136B ; Other_ID_Continue # No 1 ETHIOPIC DIGIT THREE # + 136C ; Other_ID_Continue # No 1 ETHIOPIC DIGIT FOUR # + 136D ; Other_ID_Continue # No 1 ETHIOPIC DIGIT FIVE # + 136E ; Other_ID_Continue # No 1 ETHIOPIC DIGIT SIX # + 136F ; Other_ID_Continue # No 1 ETHIOPIC DIGIT SEVEN # + 1370 ; Other_ID_Continue # No 1 ETHIOPIC DIGIT EIGHT # + 1371 ; Other_ID_Continue # No 1 ETHIOPIC DIGIT NINE # + 19DA ; Other_ID_Continue # No 1 NEW TAI LUE THAM DIGIT ONE # + 200C ; Other_ID_Continue # Cf 1 ZERO WIDTH NON-JOINER # + 200D ; Other_ID_Continue # Cf 1 ZERO WIDTH JOINER # + 30FB ; Other_ID_Continue # Po 1 KATAKANA MIDDLE DOT # + FF65 ; Other_ID_Continue # Po 1 HALFWIDTH KATAKANA MIDDLE DOT # # => Total = 149,240 characters # # ================================================================================================== # # Derived Property XID_Continue ( ID_Continue MODIFIED for closure under NFKx ) : # # # ID_Continue 149,240 # # - 037A ; ID_Continue # Lm 1 GREEK YPOGEGRAMMENI # # - 309B ; ID_Continue # Sk 1 KATAKANA-HIRAGANA VOICED SOUND MARK # # - 309C ; ID_Continue # Sk 1 KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK # # - FC5E ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # - FDFA ; ID_Continue # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Continue # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Continue # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Continue # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Continue # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Continue # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Continue # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Continue # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Continue # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Continue # Lo 1 ARABIC SUKUN ISOLATED FORM # # => Total = 149,221 characters # # ================================================================================================== # # From https://perldoc.perl.org/perldate/#identifier-parsing # # # Intersection of WORD and XID_Start properties + LOW LINE char : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 ( = \p{lettter} or [[:alpha:]] ) # # # + 005F ; Connector_Punctuation # Pc 1 LOW LINE # # + 1885 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI BALUDA ( NON-SPACING mark, common in WORD and XID_Start ) # # + 1886 ; Other_ID_Start # Mn 1 MONGOLIAN LETTER ALI GALI THREE BALUDA ( NON-SPACING mark, common in WORD and XID_Start ) # # # - 037A ; ID_Start # Lm 1 GREEK YPOGEGRAMMENI # # - 0E33 ; ID_Start # Lo 1 THAI CHARACTER SARA AM # # - 0EB3 ; ID_Start # Lo 1 LAO VOWEL SIGN AM # # - 2E2F ; # Lm 1 VERTICAL TILDE ( as ALREADY included in L* ) # # - FC5E ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Start # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # # - FDFA ; ID_Start # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Start # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Start # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Start # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Start # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Start # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Start # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Start # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Start # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Start # Lo 1 ARABIC SUKUN ISOLATED FORM # # - FF9E ; ID_Start # Lm 1 HALFWIDTH KATAKANA VOICED SOUND MARK # - FF9F ; ID_Start # Lm 1 HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK # # => Total = 145,653 characters, which can START an IDENTIFIER # # ================================================================================================== # # From https://perldoc.perl.org/perldate/#identifier-parsing # # # Intersection of WORD and XID_Continue properties : # # # Lu + Ll + Lt + Lm + Lo = # L* 145,672 ( = \p{lettter} or [[:alpha:]] ) # # + Nonspacing_Mark # Mn 2,059 # # + Spacing_Mark # Mc 471 # # + Decimal_Number # Nd 770 # # + Connector_Punctuation # Pc 10 ( including the LOW LINE char : 005F _ ) # # + 200C ; Other_ID_Continue # Cf 1 ZERO WIDTH NON-JOINER ( FORMAT character, common in WORD and XID_Continue ) # # + 200D ; Other_ID_Continue # Cf 1 ZERO WIDTH JOINER ( FORMAT character, common in WORD and XID_Continue ) # # # - 037A ; ID_Continue # Lm 1 GREEK YPOGEGRAMMENI # # - 2E2F ; # Lm 1 VERTICAL TILDE ( as ALREADY included in L* ) # # - FC5E ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM # - FC5F ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM # - FC60 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM # - FC61 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM # - FC62 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM # - FC63 ; ID_Continue # Lo 1 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM # # - FDFA ; ID_Continue # Lo 1 ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM # - FDFB ; ID_Continue # Lo 1 ARABIC LIGATURE JALLAJALALOUHOU # # - FE70 ; ID_Continue # Lm 1 ARABIC FATHATAN ISOLATED FORM # - FE72 ; ID_Continue # Lo 1 ARABIC DAMMATAN ISOLATED FORM # - FE74 ; ID_Continue # Lo 1 ARABIC KASRATAN ISOLATED FORM # - FE76 ; ID_Continue # Lo 1 ARABIC FATHA ISOLATED FORM # - FE78 ; ID_Continue # Lo 1 ARABIC DAMMA ISOLATED FORM # - FE7A ; ID_Continue # Lo 1 ARABIC KASRA ISOLATED FORM # - FE7C ; ID_Continue # Lo 1 ARABIC SHADDA ISOLATED FORM # - FE7E ; ID_Continue # Lo 1 ARABIC SUKUN ISOLATED FORM # # => Total = 148,966 characters, which can CONTINUE an IDENTIFIER #
However, the last two results
(?:(?=\p{XID_Start})\w|_)and(?=\p{XID_Continue})\w, above, are true ONLY IF the regex engine would respect all Unicode properties. Unfortunately, from a Boost point of view, which :-
Only considers that word characters are all in the BMP
-
Generally considers that word characters are those defined prior to the Unicode
5.3release !
I verified that, presently, only
47,681characters can begin an PERL identifier and only48,011characters can continue a PERL identifier !So, @Peterjones, in all cases, the regex rules, used in
Function Listfor Perl, are a rough approximation of what they should be !Now, Peter, the goal is to get a
Perlparser using the approximative BOOST\wdefinition, without the help of atomic structures.Refer to https://community.notepad-plus-plus.org/post/104861
Best Regards,
guy038
-
-
G guy038 referenced this topic on
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login