Community
    • Login

    Columns++ version 1.3: All Unicode, all the time

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    21 Posts 2 Posters 1.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hello, @coises, @thomas-knoefel, @peterjones and All,

      @coises, many thanks for your additional info. But, please, don’t be too upset by these regex oddities ! Of course, some class definitions seems different but, in all cases, Columns++ gives more accurate results than native N++ search, anyway !

      In fact, I did all these researches on the Unicode world as I wanted to clarify the status about identifiers, particularly with Perl, in order to find out a simplified formulation for the Function List Perl parser created by @peterjones and improved with your help, by using atomic structures !

      My first attempt was clearly insufficient because I only took ASCII characters into account. Peter adviced me to refer to the article, below :

      https://perldoc.perl.org/perldata#Identifier-parsing

      which explains that, when using UTF-8, the Perl identifier syntax should be :

      /  (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])
         (?[ ( \p{Word} & \p{XID_Continue} ) ]) *   /x
      
      or in a SINGLE line
      
        (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])(?[ ( \p{Word} & \p{XID_Continue} ) ]) *
      

      Although the properties \p{XID_Start} and \p{XID_Continue} are NOT part of the General Category list and are not functional with the Boost regex engine, this Perl syntax could be expressed, in theory, with our Boost regex engine as :

      (?:(?=\p{XID_Start})\w|_)(?=\p{XID_Continue})\w*


      Now, with the v17.0 release of BabelMap software, I was able to get the complete and exact list of these properties : \p{WORD}, \p{ID_Start}, \p{ID_Continue}, \p{XID_Start}, \p{XID_Continue},

      Then, from these lists, I could deduce the Unicode characters count of the regexes (?:(?=\p{XID_Start})\w|_) and (?=\p{XID_Continue})\w. Refer below :

      # ==================================================================================================
      #
      # Unicode 17.0.0
      #
      # From article https://unicode.org/reports/tr18/tr18-23.html#word
      #
      #
      # Derived Property WORD :
      #
      #
      #  Lu + Ll + Lt + Lm + Lo =     #  L*  145,672  = \p{lettter}  or  [[:alpha:]]
      #
      #  + Decimal_Number             #  Nd      770  =  \p{Decimal Digit Number}
      #                                    -----------
      # Total :                              146,442  =  Columns++ WORD chars - \x{005F}
      #
      #  + Mc + Me + Mn               #  M*    2,543  =  \p{Mark}
      #
      #  + Connector_Punctuation      #  Pc       10  ( including the LOW LINE character \x{005F} )
      #
      #  + 200C ;  Other_ID_Continue  #  Cf        1  ZERO WIDTH NON-JOINER ( JOIN-CONTROL character )
      #
      #  + 200D ;  Other_ID_Continue  #  Cf        1  ZERO WIDTH JOINER     ( JOIN-CONTROL character )
      #
      #  =>  Total = 148,997 characters
      #
      # ==================================================================================================
      #
      # From file 'DerivedCoreProperties.txt'  :
      #
      # https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt
      #
      #
      # Derived Property ID_Start :
      #
      #
      #  Lu + Ll + Lt + Lm + Lo =     # L*  145,672  ( = [[:alpha:]] )
      #
      #  + Letter_Number              # Nl      239
      #
      #  + 1885 ;  Other_ID_Start     # Mn        1  MONGOLIAN LETTER ALI GALI BALUDA
      #
      #  + 1886 ;  Other_ID_Start     # Mn        1  MONGOLIAN LETTER ALI GALI THREE BALUDA
      #
      #  + 2118 ;  Other_ID_Start     # Sm        1  SCRIPT CAPITAL P
      #
      #  + 212E ;  Other_ID_Start     # So        1  ESTIMATED SYMBOL
      #
      #  + 309B ;  Other_ID_Start     # Sk        1  KATAKANA-HIRAGANA VOICED SOUND MARK
      #
      #  + 309C ;  Other_ID_Start     # Sk        1  KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
      #
      #  - 2E2F ;                     # Lm        1  VERTICAL TILDE   ( as INCLUDED in L* )
      #
      #  =>  Total = 145,916 characters
      #
      # ==================================================================================================
      #
      # Derived Property XID_Start ( ID_Start MODIFIED for closure under NFKx ) :
      #
      #
      #  ID_Start                           145,916
      #
      #  - 037A ;  ID_Start           # Lm        1  GREEK YPOGEGRAMMENI
      #
      #  - 0E33 ;  ID_Start           # Lo        1  THAI CHARACTER SARA AM
      #
      #  - 0EB3 ;  ID_Start           # Lo        1  LAO VOWEL SIGN AM
      #
      #  - 309B ;  Other_ID_Start     # Sk        1  KATAKANA-HIRAGANA VOICED SOUND MARK
      #
      #  - 309C ;  Other_ID_Start     # Sk        1  KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
      #
      #  - FC5E ;  ID_Start           # Lo        1  ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
      #  - FC5F ;  ID_Start           # Lo        1  ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
      #  - FC60 ;  ID_Start           # Lo        1  ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
      #  - FC61 ;  ID_Start           # Lo        1  ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
      #  - FC62 ;  ID_Start           # Lo        1  ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
      #  - FC63 ;  ID_Start           # Lo        1  ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
      #
      #
      #  - FDFA ;  ID_Start           # Lo        1  ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
      #  - FDFB ;  ID_Start           # Lo        1  ARABIC LIGATURE JALLAJALALOUHOU
      #
      #  - FE70 ;  ID_Start           # Lm        1  ARABIC FATHATAN ISOLATED FORM
      #  - FE72 ;  ID_Start           # Lo        1  ARABIC DAMMATAN ISOLATED FORM
      #  - FE74 ;  ID_Start           # Lo        1  ARABIC KASRATAN ISOLATED FORM
      #  - FE76 ;  ID_Start           # Lo        1  ARABIC FATHA ISOLATED FORM
      #  - FE78 ;  ID_Start           # Lo        1  ARABIC DAMMA ISOLATED FORM
      #  - FE7A ;  ID_Start           # Lo        1  ARABIC KASRA ISOLATED FORM
      #  - FE7C ;  ID_Start           # Lo        1  ARABIC SHADDA ISOLATED FORM
      #  - FE7E ;  ID_Start           # Lo        1  ARABIC SUKUN ISOLATED FORM
      #
      #  - FF9E ;  ID_Start           # Lm        1  HALFWIDTH KATAKANA VOICED SOUND MARK
      #  - FF9F ;  ID_Start           # Lm        1  HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
      #
      #  =>  Total = 145,893 characters
      #
      # ==================================================================================================
      #
      # Derived Property ID_Continue :
      #
      #
      #  ID_Start =                         145,916
      #
      #  - 1885 ;  Other_ID_Start     #  Mn       1  MONGOLIAN LETTER ALI GALI BALUDA
      #
      #  - 1886 ;  Other_ID_Start     #  Mn       1  MONGOLIAN LETTER ALI GALI THREE BALUDA
      #
      #  The TWO characters above must be SUBTRACTED because they are, both, INCLUDED in 'Other_ID_Start' and in 'Nonspacing Mark'
      #
      #  + Nonspacing_Mark            #  Mn   2,059
      #
      #  + Spacing_Mark               #  Mc     471
      #
      #  + Decimal_Number             #  Nd     770
      #
      #  + Connector_Punctuation      #  Pc      10  ( including the LOW LINE char : 005F _ )
      #
      #  + 00B7 ;  Other_ID_Continue  #  Po       1  MIDDLE DOT
      #  + 0387 ;  Other_ID_Continue  #  Po       1  GREEK ANO TELEIA
      #  + 1369 ;  Other_ID_Continue  #  No       1  ETHIOPIC DIGIT ONE
      #  + 136A ;  Other_ID_Continue  #  No       1  ETHIOPIC DIGIT TWO
      #  + 136B ;  Other_ID_Continue  #  No       1  ETHIOPIC DIGIT THREE
      #  + 136C ;  Other_ID_Continue  #  No       1  ETHIOPIC DIGIT FOUR
      #  + 136D ;  Other_ID_Continue  #  No       1  ETHIOPIC DIGIT FIVE
      #  + 136E ;  Other_ID_Continue  #  No       1  ETHIOPIC DIGIT SIX
      #  + 136F ;  Other_ID_Continue  #  No       1  ETHIOPIC DIGIT SEVEN
      #  + 1370 ;  Other_ID_Continue  #  No       1  ETHIOPIC DIGIT EIGHT
      #  + 1371 ;  Other_ID_Continue  #  No       1  ETHIOPIC DIGIT NINE
      #  + 19DA ;  Other_ID_Continue  #  No       1  NEW TAI LUE THAM DIGIT ONE
      #  + 200C ;  Other_ID_Continue  #  Cf       1  ZERO WIDTH NON-JOINER
      #  + 200D ;  Other_ID_Continue  #  Cf       1  ZERO WIDTH JOINER
      #  + 30FB ;  Other_ID_Continue  #  Po       1  KATAKANA MIDDLE DOT
      #  + FF65 ;  Other_ID_Continue  #  Po       1  HALFWIDTH KATAKANA MIDDLE DOT
      #
      #  =>  Total = 149,240 characters
      #
      # ==================================================================================================
      #
      # Derived Property XID_Continue ( ID_Continue MODIFIED for closure under NFKx ) :
      #
      #
      #  ID_Continue                        149,240
      #
      #  - 037A ;  ID_Continue        #  Lm       1  GREEK YPOGEGRAMMENI
      #
      #  - 309B ;  ID_Continue        #  Sk       1  KATAKANA-HIRAGANA VOICED SOUND MARK
      #
      #  - 309C ;  ID_Continue        #  Sk       1  KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
      #
      #  - FC5E ;  ID_Continue        #  Lo       1  ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
      #  - FC5F ;  ID_Continue        #  Lo       1  ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
      #  - FC60 ;  ID_Continue        #  Lo       1  ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
      #  - FC61 ;  ID_Continue        #  Lo       1  ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
      #  - FC62 ;  ID_Continue        #  Lo       1  ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
      #  - FC63 ;  ID_Continue        #  Lo       1  ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
      #
      #  - FDFA ;  ID_Continue        #  Lo       1  ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
      #  - FDFB ;  ID_Continue        #  Lo       1  ARABIC LIGATURE JALLAJALALOUHOU
      #
      #  - FE70 ;  ID_Continue        #  Lm       1  ARABIC FATHATAN ISOLATED FORM
      #  - FE72 ;  ID_Continue        #  Lo       1  ARABIC DAMMATAN ISOLATED FORM
      #  - FE74 ;  ID_Continue        #  Lo       1  ARABIC KASRATAN ISOLATED FORM
      #  - FE76 ;  ID_Continue        #  Lo       1  ARABIC FATHA ISOLATED FORM
      #  - FE78 ;  ID_Continue        #  Lo       1  ARABIC DAMMA ISOLATED FORM
      #  - FE7A ;  ID_Continue        #  Lo       1  ARABIC KASRA ISOLATED FORM
      #  - FE7C ;  ID_Continue        #  Lo       1  ARABIC SHADDA ISOLATED FORM
      #  - FE7E ;  ID_Continue        #  Lo       1  ARABIC SUKUN ISOLATED FORM
      #
      #  =>  Total = 149,221 characters
      #
      # ==================================================================================================
      #
      #  From https://perldoc.perl.org/perldate/#identifier-parsing
      #
      #
      #  Intersection of WORD and XID_Start properties + LOW LINE char :
      #
      #
      #  Lu + Ll + Lt + Lm + Lo =         # L*  145,672  ( = \p{lettter}  or  [[:alpha:]] )
      #
      #
      #  + 005F ;  Connector_Punctuation  # Pc        1  LOW LINE
      #
      #  + 1885 ;  Other_ID_Start         # Mn        1  MONGOLIAN LETTER ALI GALI BALUDA        ( NON-SPACING mark, common in WORD and XID_Start )
      #
      #  + 1886 ;  Other_ID_Start         # Mn        1  MONGOLIAN LETTER ALI GALI THREE BALUDA  ( NON-SPACING mark, common in WORD and XID_Start )
      #
      #
      #  - 037A ;  ID_Start               # Lm        1  GREEK YPOGEGRAMMENI
      #
      #  - 0E33 ;  ID_Start               # Lo        1  THAI CHARACTER SARA AM
      #
      #  - 0EB3 ;  ID_Start               # Lo        1  LAO VOWEL SIGN AM
      #
      #  - 2E2F ;                         # Lm        1  VERTICAL TILDE   ( as ALREADY included in L* )
      #
      #  - FC5E ;  ID_Start               # Lo        1  ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
      #  - FC5F ;  ID_Start               # Lo        1  ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
      #  - FC60 ;  ID_Start               # Lo        1  ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
      #  - FC61 ;  ID_Start               # Lo        1  ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
      #  - FC62 ;  ID_Start               # Lo        1  ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
      #  - FC63 ;  ID_Start               # Lo        1  ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
      #
      #
      #  - FDFA ;  ID_Start               # Lo        1  ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
      #  - FDFB ;  ID_Start               # Lo        1  ARABIC LIGATURE JALLAJALALOUHOU
      #
      #  - FE70 ;  ID_Start               # Lm        1  ARABIC FATHATAN ISOLATED FORM
      #  - FE72 ;  ID_Start               # Lo        1  ARABIC DAMMATAN ISOLATED FORM
      #  - FE74 ;  ID_Start               # Lo        1  ARABIC KASRATAN ISOLATED FORM
      #  - FE76 ;  ID_Start               # Lo        1  ARABIC FATHA ISOLATED FORM
      #  - FE78 ;  ID_Start               # Lo        1  ARABIC DAMMA ISOLATED FORM
      #  - FE7A ;  ID_Start               # Lo        1  ARABIC KASRA ISOLATED FORM
      #  - FE7C ;  ID_Start               # Lo        1  ARABIC SHADDA ISOLATED FORM
      #  - FE7E ;  ID_Start               # Lo        1  ARABIC SUKUN ISOLATED FORM
      #
      #  - FF9E ;  ID_Start               # Lm        1  HALFWIDTH KATAKANA VOICED SOUND MARK
      #  - FF9F ;  ID_Start               # Lm        1  HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
      #
      #  =>  Total = 145,653 characters, which can START an IDENTIFIER
      #
      # ==================================================================================================
      #
      #  From  https://perldoc.perl.org/perldate/#identifier-parsing
      #
      #
      #  Intersection of WORD and XID_Continue properties :
      #
      #
      #  Lu + Ll + Lt + Lm + Lo =     #  L*  145,672  ( = \p{lettter}  or  [[:alpha:]] )
      #
      #  + Nonspacing_Mark            #  Mn    2,059
      #
      #  + Spacing_Mark               #  Mc      471
      #
      #  + Decimal_Number             #  Nd      770
      #
      #  + Connector_Punctuation      #  Pc       10  ( including the LOW LINE char : 005F _ )
      #
      #  + 200C ;  Other_ID_Continue  #  Cf        1  ZERO WIDTH NON-JOINER    ( FORMAT character, common in common in WORD and XID_Continue )
      #
      #  + 200D ;  Other_ID_Continue  #  Cf        1  ZERO WIDTH JOINER        ( FORMAT character, common in common in WORD and XID_Continue )
      #
      #
      #  - 037A ;  ID_Continue        #  Lm        1  GREEK YPOGEGRAMMENI
      #
      #  - 2E2F ;                     #  Lm        1  VERTICAL TILDE   ( as ALREADY included in L* )
      #
      #  - FC5E ;  ID_Continue        #  Lo        1  ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
      #  - FC5F ;  ID_Continue        #  Lo        1  ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
      #  - FC60 ;  ID_Continue        #  Lo        1  ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
      #  - FC61 ;  ID_Continue        #  Lo        1  ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
      #  - FC62 ;  ID_Continue        #  Lo        1  ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
      #  - FC63 ;  ID_Continue        #  Lo        1  ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
      #
      #  - FDFA ;  ID_Continue        #  Lo        1  ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
      #  - FDFB ;  ID_Continue        #  Lo        1  ARABIC LIGATURE JALLAJALALOUHOU
      #
      #  - FE70 ;  ID_Continue        #  Lm        1  ARABIC FATHATAN ISOLATED FORM
      #  - FE72 ;  ID_Continue        #  Lo        1  ARABIC DAMMATAN ISOLATED FORM
      #  - FE74 ;  ID_Continue        #  Lo        1  ARABIC KASRATAN ISOLATED FORM
      #  - FE76 ;  ID_Continue        #  Lo        1  ARABIC FATHA ISOLATED FORM
      #  - FE78 ;  ID_Continue        #  Lo        1  ARABIC DAMMA ISOLATED FORM
      #  - FE7A ;  ID_Continue        #  Lo        1  ARABIC KASRA ISOLATED FORM
      #  - FE7C ;  ID_Continue        #  Lo        1  ARABIC SHADDA ISOLATED FORM
      #  - FE7E ;  ID_Continue        #  Lo        1  ARABIC SUKUN ISOLATED FORM
      #
      #  =>  Total = 148,966 characters, which can CONTINUE an IDENTIFIER
      #
      

      However, the last two results (?:(?=\p{XID_Start})\w|_) and (?=\p{XID_Continue})\w, above, are true ONLY IF the regex engine would respect all Unicode properties. Unfortunately, from a Boost point of view, which :

      • Only considers that word characters are all in the BMP

      • Generally considers that word characters are those defined prior to the Unicode 5.3 release !

      I verified that, presently, only 47,681 characters can begin an PERL identifier and only 48,011 characters can continue a PERL identifier !

      So, @Peterjones, in all cases, the regex rules, used in Function List for Perl, are a rough approximation of what they should be !

      Now, Peter, the goal is to get a Perl parser using the approximative BOOST \w definition, without the help of atomic structures.

      Refer to https://community.notepad-plus-plus.org/post/104861

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 1
      • First post
        Last post
      The Community of users of the Notepad++ text editor.
      Powered by NodeBB | Contributors