Tests and impressions on the "View > Summary..." functionality



  • Hello All,

    Recently, I’ve been looking at the results given by the N++ Summary feature ( View > Summary... ). And I must say that numerous things are really weird !

    For tests, I used contents with a lot of Unicode characters, both, in the Basic Multilingual Plane and, sometimes, over the BMP too, saved in the 4 N++ Unicode encodings files as well as in an ANSI file, containing the 256 characters of the Windows-1252 encoding :

    • ANSI
    • UTF-8
    • UTF-8-BOM
    • UCS-2 BE BOM
    • UCS-2 LE BOM

    To my mind, there are 3 major problems and some minor points :

    • The first and worse problem is the fact that, when an UTF-8[-BOM] file, containing various Unicode chars ( of the BMP only : this point is important ! ) is copied in an UCS-2 BE BOM or UCS-2 LE BOM encoded file, some results, given by the Summary feature for these files, are totally wrong :

      • The characters( without line endings ) value seems to be the number of bytes used in the corresponding UTF-8[-BOM] file

      • The Document length value seems to be the document length of the corresponding UTF-8[-BOM] file and is also displayed, unfortunately, in the status bar !

    • The second problem is that the definition of a word char, by the Summary feature is definitively NOT the same of the definition of the regex \w, as explained further on !

    • Thus, the third problem is that the of given number of words is totally inaccurate ! And, anyway, the number of words, although well enough defined for an English / American text, is rather a vague notion, for a lot of texts written in other languages, especially Asiatic ones ! ( See further on )

    • Some minor things :

      • The number of lines given is, most of the time, increased by one unit

      • Presently, the Summary feature displays the document length in the Notepad++ buffer. I think it would be good to display, as well, the actual document length saved on disk. Incidentally, for just saved documents, it would give, by difference, the length of the possible Byte Order Mark, if its size wouldn’t be explicitly displayed !

      • For UTF-8 or UTF-8-BOM encoded files, a decomposition, giving the number of chars coded with 1, 2, 3 and 4 bytes, for chars over the BMP, would be welcome !

    So, in brief, in the present Summary window :

    • The Characters (without line endings): number is wrong for the UCS-2 BE BOM or UCS-2 LE BOM encodings

    • The Words number is totally wrong, given the regex definition of a word character, whatever the encoding used

    • The Lines: number is wrong, by one unit, if a line-break ends the last line of current file, in any encoding

    • The Document length value , in N++ buffer, is wrong for the UCS-2 BE BOM or UCS-2 LE BOM encodings, as well as the Length: indication in the status bar

    Note, that I’m about to create an issue for the wrong results returned for UCS-2 BE BOM and UCS-2 LE BOM encoded files !


    To begin with, let’s me develop the… second bug ! After numerous tests, I determined that, in the present View > summary… feature, the characters, considered a word character, are :

    • The C0 control characters, except for the Tabulation ( \x{0009} ) and the two EOL ( \x{000a} and \x{000d} ), so the regex (?![\t\r\n])[\x00-\x1F]

    • The number sign #

    • The 10 digits, so the regex [0-9] :

    • The 26 uppercase and lowercase letters, so the regex (?i)[A-Z]

    • The low line character _

    • All the characters, of the Basic Multilingual Plane ( BMP ), with code-point over \x{007E}, so the regex (?![\x{D800}-\x{DFFF}])[\x{007F}-\x{FFFF}] for a Unicode encoded file or [\x7F-\xFF] for an ANSI encoded file

    • All the characters, over the Basic Multilingual Plane, so the regex (?-s).[\x{D800}-\x{DFFF}] for an Unicode encoded file, only

    To simulate the present Words: number ( which is erroneous ! ), given by the summary feature, whatever the file encoding, simply use the regex below :

    [^\t\n\r\x20!"$%&'()*+,\-./:;<=>?@\[\\\]^`{|}~]+
    

    and click on the Count button of the Find dialog, with the Wrap around option ticked

    Obviously, this is not exact as a single word character is matched with the \w regex, which is the class [\u\l\d_], where \u, \l and \d represents any Unicode uppercase, lowercase and digit char or a related char, so, finally, much more than the simple [A-Za-z0-9] set !

    But , worse, it’s the notion of word which is practically, not consistent, most of the time ! Indeed, for instance, if we consider the French expression l'école ( the school ), the regex \w+ would return 2 words, which is correct as this expression can be mentally decomposed as la école. However, this regex would wrongly say the that the single word aujourd'hui ( today ) is a two-words expression. Of course, you could change the regex as [\w']+ which would return 1 word, but, this time, the expression l'école would wrongly be considered as a two-words string !

    In addition, what can be said about languages that do not use the Space character or where the use of the Space is discretionary ? Then, counting of words is impossible or rather non-significant ! This is developed in this Martin Haspelmath’s article, below :

    https://zenodo.org/record/225844/files/WordSegmentationFL.pdf

    At end of section 5, it is said : … On such a view, the claim that “all languages have words” (Radford et al. 1999: 145) would be interpretable only in the weaker sense that "all languages have a unit which falls between the minimal sign and the phrase” …

    And : … The basic problem remains the same: The units are defined in a language-specific way and cannot be equated across languages, and there is no reason to give special status to a unit called ‘word’. …

    At beginning of section, 7 : … Linguists have no good basis for identifying words across languages …

    And in the conclusion, section 10 : … I conclude, from the arguments presented in this article, that there is no definition of ‘word’ that can be applied to any language and that would yield consistent results …


    Now, the Unicode definition of a word character is :

    \p{gc=Alphabetic} | \p{gc=Mark} | \p{gc=Decimal_Number} | \p{gc=Connector_Punctuation} | \p{Join-Control}

    https://stackoverflow.com/questions/5555613/does-w-match-all-alphanumeric-characters-defined-in-the-unicode-standard

    https://www.unicode.org/reports/tr18/#Simple_Word_Boundaries

    So, in theory, the word_character class should include :

    • All values of the derived category Alphabetic ( = alpha = \p{alphabetic} ) so 132,875 chars, from the DerivedCoreProperties.txt file, which can be decomposed into :

      • Uppercase_Letter (Lu) + Lowercase_Letter (Ll) + Titlecase_Letter (Lt) + Modifier_Letter (Lm) + Other_Letter (Lo) + Letter_Number (Nl) + Other_Alphabetic, so the characters sum 1,791 + 2,155 + 31 + 260 + 127,004 + 236 + 1,398

      • Note : The last property Other_Alphabetic, from the Prop_list.txt file, contains some, but not all, characters from the 3 General_Categories Spacing_Mark ( Mc ), Nonspacing_Mark ( Mn ) and Other_Symbol ( So ), so the characters sum 417 + 851 + 130

    • All values with General_Category = Decimal_Number, from the DerivedGeneralCategory.txt file, so 650 characters

      ( These are characters, with defined values in the three fields 6, 7 and 8 of the UnicodeData.txt file

    • All values with General_Category = Connector_Punctuation, from the DerivedGeneralCategory.txt file, so 10 characters

    • All values with the binary Property Join_Control, from the PropList.txt file, so 2 characters

    So, if we include all Unicode languages, even historical ones :

    => Total number of Unicode word characters = 132,875 + 650 + 10 + 2 = 133,537 characters, with version UNICODE 13.0.0 !!

    Notes :

    • The different files mentioned can be downloaded from the Unicode Character Database ( UCD ) or sub-directories, below :

    http://www.unicode.org/Public/UCD/latest/ucd/

    • And refer to the sites, below, for additional information to this topic :

    https://www.unicode.org/reports/tr18/#Compatibility_Properties

    https://www.unicode.org/reports/tr29/#Word_Boundaries

    https://www.unicode.org/reports/tr31/    for tables 4, 5 and 6 of section 2.4

    https://www.unicode.org/reports/tr44/#UnicodeData.txt


    If someone did click on the links to the Unicode Consortium, above, one understood, very quickly, that word characters and word boundaries notions are a real nightmare !

    Even if we restrict the definition of word chars to Unicode living scripts, forgetting all the historical scripts not in use, and also leaving aside all scripts which do not use the space char to, systematically, delimit words, we still have a list of about 21,000 characters which should be considered as word character ! I tried to build up such a list, with the help of these sites :

    https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries

    https://linguistlist.org/issues/6/6-1302/

    https://unicode-org.github.io/cldr-staging/charts/37/supplemental/scripts_and_languages.html

    https://scriptsource.org/cms/scripts/page.php?item_id=script_overview

    And I ended up with this list of 36 living scripts which always use a Space character between words :

    •-----------------------------•----------------•-------------------•-----------------•
    |                             |    SCRIPT      |   SPACE between   |  UNICODE Script |
    |                             |      Type :    |      Words :      |      Class :    |
    |                             •----------------•-------------------•-----------------•
    |           SCRIPT            |  (L)iving      |  (Y)es            |  (R)ecommended  |
    |                             |                |  (U)nspecified    |  (L)imited      |
    |                             |  (H)istorical  |  (D)iscretionary  |  (E)xcluded     |
    |                             |                |  (N)o             |                 |
    •-----------------------------•----------------•-------------------•-----------------•
    |  ARMENIAN                   |       L        |         Y         |        R        |
    |  ADLAM                      |       L        |         Y         |        L        |
    |  ARABIC                     |       L        |         Y         |        R        |
    |  BALINESE                   |       L        |         Y         |        L        |
    |  BENGALI ( Assamese )       |       L        |         Y         |        R        |
    |  BOPOMOFO                   |       L        |         Y         |        R        |
    |  CANADIAN SYLLABICS         |       L        |         Y         |        L        |
    |  CHEROKEE                   |       L        |         Y         |        L        |
    |  CYRILLIC                   |       L        |         Y         |        R        |
    |  DEVANAGARI                 |       L        |         Y         |        R        |
    |  ETHIOPIC (Ge'ez)           |       L        |         Y         |        R        |
    |  GEORGIAN                   |       L        |         Y         |        R        |
    |  GREEK                      |       L        |         Y         |        R        |
    |  GUJARATI                   |       L        |         Y         |        R        |
    |  GURMUKHI                   |       L        |         Y         |        R        |
    |  HANGUL                     |       L        |         Y         |        R        |
    |  HEBREW                     |       L        |         Y         |        R        |
    |  KANNADA                    |       L        |         Y         |        R        |
    |  KAYAH LI                   |       L        |         Y         |        L        |
    |  LATIN                      |       L        |         Y         |        R        |
    |  LIMBU                      |       L        |         Y         |        L        |
    |  MALAYALAM                  |       L        |         D         |        R        |
    |  MANDAIC                    |       H        |         Y         |        L        |
    |  MEETEI MAYEK               |       L        |         Y         |        L        |
    |  MIAO (Pollard)             |       L        |         Y         |        L        |
    |  NEWA                       |       L        |         Y         |        L        |
    |  NKO                        |       L        |         Y         |        L        |
    |  ORIYA (Odia)               |       L        |         Y         |        R        |
    |  OSAGE                      |       L        |         Y         |        L        |
    |  SINHALA                    |       L        |         Y         |        R        |
    |  SUNDANESE                  |       L        |         Y         |        L        |
    |  SYLOTI NAGRI               |       L        |         Y         |        L        |
    |  SYRIAC                     |       L        |         Y         |        L        |
    |  TAMIL                      |       L        |         Y         |        R        |
    |  TELUGU                     |       L        |         Y         |        R        |
    |  THAANA                     |       L        |         D         |        R        |
    |  TIFINAGH (Berber)          |       L        |         Y         |        L        |
    |  WANCHO                     |       L        |         Y         |        L        |
    |  YI                         |       L        |         Y         |        L        |
    •-----------------------------•----------------•-------------------•-----------------•
    

    These scripts involves 93 legal Unicode scripts, from Basic Latin ( 0000 - 007F ) till Symbols for Legacy Computing ( 1FB00 - 1FBFF )


    You may, also, have a look to these sites for general information :

    https://en.wikipedia.org/wiki/List_of_Unicode_characters

    https://en.wikipedia.org/wiki/Scriptio_continua#Decline

    https://glottolog.org/glottolog/language    especially to locate the area where a language is used

    Continued discussion in the next post

    guy038



  • Hi All,

    Continuation of the previous script :

    Then , with the help of the excellent Babel Map software, updated for Unicode v13.0

    https://www.babelstone.co.uk/Software/BabelMap.html

    I succeeded to create a list of the 21,143 remaining characters, from the living scripts, above, which should be truly considered as word character, without any ambiguity

    However, when applying the regex \t\w\t against this list, I got a total of 17,307 word characters, only, because, probably, Notepad++ does not use the Boost regex library with full Unicode support :

    • The Boost definition of the regex \w does not consider all the characters over the BMP

    • Some characters of the BMP, although alphabetic, are not considered, yet, as word chars

    For instance, in this short list, below, each Unicode char, surrounded with two tabulation chars, cannot be found with the regex \t\w\t, although it is, indeed, seen as a word by the Unicode Consortium` :-((

     023D	Ƚ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER L WITH BAR
     0370	Ͱ	  ; Upper_Letter # Lu         GREEK CAPITAL LETTER HETA
     04CF	ӏ	  ; Lower_Letter # Ll         CYRILLIC SMALL LETTER PALOCHKA
     066F	ٯ	  ; Other_Letter # Lo         ARABIC LETTER DOTLESS QAF
     0D60	ൠ	  ; Other_Letter # Lo         MALAYALAM LETTER VOCALIC RR
     200D	‍	  ; Join_Control # Cf         ZERO WIDTH JOINER
     213F	ℿ	  ; Upper_Letter # Lu         DOUBLE-STRUCK CAPITAL PI
     2187	ↇ	  ; Letter_Numb. # Nl         ROMAN NUMERAL FIFTY THOUSAND
     24B6	Ⓐ	  ; Other_Alpha. # So         CIRCLED LATIN CAPITAL LETTER A
     2E2F	ⸯ	  ; Modifier_Let # Lm         VERTICAL TILDE
     A727	ꜧ	  ; Lower_Letter # Ll         LATIN SMALL LETTER HENG
     FF3F	_	  ; Conn._Punct. # Pc         FULLWIDTH LOW LINE
    1D400	𝐀	  ; Upper_Letter # Lu         MATHEMATICAL BOLD CAPITAL A
    1D70B	𝜋	  ; Lower_Letter # Ll         MATHEMATICAL ITALIC SMALL PI
    1F150	🅐	  ; Other_Alpha. # So         NEGATIVE CIRCLED LATIN CAPITAL LETTER A
    

    To my mind, for all these reasons, as we cannot rely on the Word notion, the View > Summary... feature should just ignore the number of words or, at least, add the indication With caution !


    By contrast, I think that it would be useful to count the number of Non_Space strings, determined with the regex \S+. Indeed, we would get more confident results ! The boundaries of Non_Space strings, which are the Space characters, belong to the well-defined list of the 25 Unicode characters with the binary property White_Space, from the PropList.txt file. Refer to the very beginning of this file :

    http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

    As a reminder, the regex \s, identical to \h|\v. So, it represents the complete character class [\t\x20\xA0\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]|[\n\x0B\f\r\x85\x{2028}\x{2029}] which can be re-ordered as :

    \s = [\t\n\x0B\f\r\x20\x85\xA0\x{1680}\x{2000}-\x{200A}\x{202F}\x{2028}\x{2029}\x{205F}\x{3000}]

    Note that, in practice, the \s regex is mainly equivalent to the simple regex [\t\n\r\x20]

    Here is that Unicode list of all Unicode characters with the property White_Space, with their name and their General_Category value :

    0009	TAB	; White_Space         # Cc       TABULATION  <control-0009>
    000A	LF	; White_Space         # Cc       LINE FEED  <control-000A>
    000B		; White_Space         # Cc       VERTICAL TABULATION  <control-000B>
    000C		; White_Space         # Cc       FORM FEED  <control-000C>
    000D	CR	; White_Space         # Cc       CARRIAGE RETURN  <control-000D>
    0020	 	; White_Space         # Zs       SPACE
    0085	…	; White_Space         # Cc       NEXT LINE  <control-0085>
    00A0	 	; White_Space         # Zs       NO-BREAK SPACE
    1680	 	; White_Space         # Zs       OGHAM SPACE MARK
    2000	 	; White_Space         # Zs       EN QUAD
    2001	 	; White_Space         # Zs       EM QUAD
    2002	 	; White_Space         # Zs       EN SPACE
    2003	 	; White_Space         # Zs       EM SPACE
    2004	 	; White_Space         # Zs       THREE-PER-EM SPACE
    2005	 	; White_Space         # Zs       FOUR-PER-EM SPACE
    2006	 	; White_Space         # Zs       SIX-PER-EM SPACE
    2007	 	; White_Space         # Zs       FIGURE SPACE
    2008	 	; White_Space         # Zs       PUNCTUATION SPACE
    2009	 	; White_Space         # Zs       THIN SPACE
    200A	 	; White_Space         # Zs       HAIR SPACE
    2028	
	; White_Space         # Zl       LINE SEPARATOR
    2029	
	; White_Space         # Zp       PARAGRAPH SEPARATOR
    202F	 	; White_Space         # Zs       NARROW NO-BREAK SPACE
    205F	 	; White_Space         # Zs       MEDIUM MATHEMATICAL SPACE
    3000	 	; White_Space         # Zs       IDEOGRAPHIC SPACE
    

    Note that I used the notations TAB, LF and CR, standing for the three characters \t, \n and \r, instead of the chars themselves

    So, in order to get the number of Non_Space strings, we should, normally, use the simple regex \S+. However, it does not give the right number. Indeed, when several characters, with code-point over the BMP, are consecutive, they are not seen as a global Non_Space string but as individual ones :-((

    Test my statement with this string, composed of four consecutive emoji chars 👨👩👦👧. The regex \S+ returns four Non_Space chars, whereas I would have expected only one string !

    Consequently, I verified that the suitable regex to count all the Non_Space strings of a file, whatever their Unicode code-point, is rather the regex ((?!\s).[\x{D800}-\x{DFFF}]?)+ ( Longer, I agree but exact ! )


    Now, here is a list of regexes which allow you to count different subsets of characters from the current file contents. Of course, tick the Wrap around option !

    - In a current non-ANSI ( UNICODE ) document, as the zone [\x{D800}-\x{DFFF}] represents the reserved SURROGATE area :
    
      - Number of chars, in range [U+0000 - U+007F ], WITHOUT the \r AND \n chars               =  N1  =  (?![\r\n])[\x{0000}-\x{007F}]
    
      - Number of chars, in range [U+0080 - U+07FF ]                                            =  N2  =  [\x{0080}-\x{07FF}]
    
      - Number of chars, in range [U+0800 - U+FFFF ], except in SURROGATE range [D800 - DFFF]   =  N3  =  (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]
                                                                                                         ------------------------------------------------
    
      - Number of chars, in range [U+0000 - U+FFFF ], in BMP , WITHOUT the \r AND \n  =  N1 + N2 + N3  =  (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]
    
      - Number of chars, in range [U+10000 - U+10FFFF], OVER the BMP                            =  N4  =  (?-s).[\x{D800}-\x{DFFF}]
                                                                                                         ------------------------------
    
      - TOTAL chars, in a NON-ANSI document, WITHOUT the \r AND \r chars         =  N1 + N2 + N3 + N4  =  [^\r\n]
    
      - Number of \r characters + Number of \n characters                                       =  N0  =  \r|\n
                                                                                                         ------------------------------
    
      - TOTAL chars, in a NON-ANSI document, WITH the \r AND \r chars       =  N0 + N1 + N2 + N3 + N4  =  (?s).
    
    
    - In a current ANSI document :
    
      - Number of characters, in range [U+0000 - U+00FF], WITHOUT the \r AND \n chars           =  N1  =  [^\r\n]
    
      - Number of \r characters + Number of \n characters                                       =  N0  =  \r|\n
                                                                                                         ------------------------------
    
      - TOTAL chars, in an ANSI document, WITH the \r AND \r chars                         =  N0 + N1  =  (?s).
    
    
    - TOTAL current DOCUMENT length <Lb> in Notepad++ BUFFER :
    
      - In an ANSI                          document        Lb  =                             N0 + N1  =  (?s).
    
      - In an UTF-8 or UTF-8-BOM            document        Lb  =  N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4
    
      - In an UCS-2 BE BOM or UCS-2 BE BOM  document        Lb  =           ( N0 + N1 + N2 + N3 ) × 2  =  (?s). × 2
    
    
    
    - Byte Order Mark ( BOM = U+FEFF ) length <Bl> and encoding, for SAVED documents :
    
      - In an ANSI or UTF-8                 document        Bl  =  0 byte
    
      - In an UTF-8-BOM                     document        Bl  =  3 bytes  ( EF BB BF )
    
      - In an UCS-2 BE BOM                  document        Bl  =  2 bytes  ( FE FF )
    
      - In an UCS-2 LE BOM                  document        Bl  =  2 bytes  ( FF FE )
    
    
    
    - TOTAL current FILE length <Ld> on DISK for SAVED documents, only :
    
      - WHATEVER the encoding of the        document        Ld  =  Lb + Bl  ( = Total DOCUMENT length + BOM length )
    
    
    - NUMBER of WORDS              =  \w+  ( to be considered with CAUTION )
    
    
    - NUMBER of NON_SPACE strings  =  ((?!\s).[\x{D800}-\x{DFFF}]?)+
    
    
    - Number of LINES :
    
      - Number of true EMPTY lines                                     =  (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)
    
      - Number of lines containing TAB and/or SPACE characters ONLY    =  (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)
                                                                         -----------------------------------------------------------
    
      - Number of BLANK or EMPTY lines                                 =  (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z)
    
      - Number of NON BLANK and NON EMPTY lines                        =  (?-s)(?!^[\t\x20]+$)(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z)
                                                                         ---------------------------------------------------------------------
    
    - TOTAL number of LINES                                            =  (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z
    

    Remarks : Although most of the regexes, above, can be easily understood, here are some additional elements :

    • The regex (?-s).[\x{D800}-\x{DFFF}] is the sole correct syntax, with our Boost regex engine, to count all the characters over the BMP

    • The regex (?s)((?!\s).[\x{D800}-\x{DFFF}]?)+, to count all the Non_Space strings, was explained before

    • In all the regexes, relative to counting of lines, you probably noticed the character class [\f\x{0085}\x{2028}\x{2029}]. It must be present because the four characters \f, \x{0085} , \x{2028} and \x{2029} are, both, considered as a start and an End of line, like the assertions ^ and $ !

      • For instance, if, in a new file, you insert one Next_Line char ( NEL ), of code-point \x{0085} and hit the Enter key, this sole line is wrongly seen as an empty line by the simple regex ^(?:\r\n|\r|\n) which matches the line-break after the Next_Line char !

    To end , I would like to propose a new layout of an summary feature, which should be more informative !

    IMPORTANT : In the list below, any text, after the 1st colon character of each line, is only regexes, comments or descriptive areas !

    Full File Path    :  X:\....\....\
    
    Creation Date     :  MM/DD/YYYY HH:MM:SS                         UTF-8[-BOM]                         UCS-2 BE/LE BOM            ANSI
    Modification Date :  MM/DD/YYYY HH:MM:SS          --------------------------------------------------------------------------------------
    
    
    1-Byte  Chars     :  N1                         =   (?![\r\n])[\x{0000}-\x{007F}]                         idem                [^\r\n]
    2-Bytes Chars     :  N2                         =   [\x{0080}-\x{07FF}]                                   idem                   0
    3-Bytes Chars     :  N3                         =   (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]            idem                   0
    
    Total BMP Chars   :  N1 + N2 + N3               =   (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]        idem                [^\r\n]
    4-Bytes Chars     :  N4                         =   (?-s).[\x{D800}-\x{DFFF}]                               0                    0
    
    Chars w/o CR|LF   :  N1 + N2 + N3 + N4          =   [^\r\n]
    EOL ( CR or LF )  :  N0                         =   \r|\n
    
    TOTAL Characters  :  N0 + N1 + N2 + N3 + N4     =   (?s).
    
    
    N++ BUFFER Length :                             =   N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4        ( N0 + N1 + N2 + N3 ) × 2        (?s).
    
    Byte Order Mark   :                             =   0 ( UTF-8)  or  3 ( UTF-8-BOM )                        2                     0
    
    
    DOCUMENT Length   :  BUFFER length  +  BOM
    
    FILE     Length   :  Present SIZE on DISK
    
    
    WORDS ( Caution ) :  \w+
    
    
    NON-SPACE strings :  ((?!\s).[\x{D800}-\x{DFFF}]?)+
    
    
    True EMPTY lines  :  (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)
    
    BLANK lines       :  (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)
    
    
    EMPTY/BLANK lines :  (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z)
    
    NON-BLANK lines   :  (?-s)(?!^[\t\x20]+$)(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z)
    
    
    TOTAL lines       :  (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z
    
    
    Selection(s)      :  X characters (Y bytes) in Z ranges
    

    Best Regards,

    guy038


Log in to reply