Tests and impressions on the "View > Summary..." functionality
-
Hello All,
Recently, I’ve been looking at the results given by the N++ Summary feature (
View > Summary...
). And I must say that numerous things are really weird !For tests, I used contents with a lot of Unicode characters, both, in the Basic Multilingual Plane and, sometimes, over the
BMP
too, saved in the4
N++ Unicode encodings files as well as in anANSI
file, containing the256
characters of theWindows-1252
encoding :ANSI
UTF-8
UTF-8-BOM
UCS-2 BE BOM
UCS-2 LE BOM
To my mind, there are
3
major problems and some minor points :-
The first and worse problem is the fact that, when an
UTF-8[-BOM]
file, containing various Unicode chars ( of theBMP
only : this point is important ! ) is copied in anUCS-2 BE BOM
orUCS-2 LE BOM
encoded file, some results, given by theSummary
feature for these new files, are totally wrong :-
The
characters( without line endings )
value seems to be the number of bytes used in the correspondingUTF-8[-BOM]
file -
The
Document length
value seems to be the document length of the correspondingUTF-8[-BOM]
file and is also displayed, unfortunately, in the status bar !
-
-
The second problem is that the definition of a word char, by the
Summary
feature is definitively NOT the same of the definition of the regex\w
, as explained further on ! -
Thus, the third problem is that the of given number of words is totally inaccurate ! And, anyway, the number of words, although well enough defined for an
English / American
text, is rather a vague notion, for a lot of texts written in other languages, especially Asiatic ones ! ( See further on ) -
Some minor things :
-
The number of lines given is, most of the time, increased by one unit
-
Presently, the Summary feature displays the document length in the Notepad++ buffer. I think it would be good to display, as well, the actual document length saved on disk. Incidentally, for just saved documents, it would give, by difference, the length of the possible
Byte Order Mark
, if its size wouldn’t be explicitly displayed ! -
For
UTF-8
orUTF-8-BOM
encoded files, a decomposition, giving the number of chars coded with1
,2
,3
and4
bytes, for chars over theBMP
, would be welcome !
-
So, in brief, in the present
Summary
window :-
The
Characters (without line endings):
number is wrong for theUCS-2 BE BOM
orUCS-2 LE BOM
encodings -
The
Words
number is totally wrong, given the regex definition of a word character, whatever the encoding used -
The
Lines:
number is wrong, by one unit, if a line-break ends the last line of current file, in any encoding -
The
Document length
value, in N++ buffer, is wrong for theUCS-2 BE BOM
orUCS-2 LE BOM
encodings, as well as theLength:
indication in the status bar
Note, that I’m about to create an issue for the wrong results returned for
UCS-2 BE BOM
andUCS-2 LE BOM
encoded files !
To begin with, let’s me develop the… second bug ! After numerous tests, I determined that, in the present
View > Summary...
feature, the characters, considered a word character, are :-
The C0 control characters, except for the Tabulation (
\x{0009}
) and the two EOL (\x{000a}
and\x{000d}
), so the regex(?![\t\r\n])[\x00-\x1F]
-
The number sign
#
-
The
10
digits, so the regex[0-9]
: -
The
26
uppercase and lowercase letters, so the regex(?i)[A-Z]
-
The low line character
_
-
All the characters, of the Basic Multilingual Plane (
BMP
), with code-point over\x{007E}
, so the regex(?![\x{D800}-\x{DFFF}])[\x{007F}-\x{FFFF}]
for aUnicode
encoded file or[\x7F-\xFF]
for anANSI
encoded file -
All the characters, over the Basic Multilingual Plane, so the regex
(?-s).[\x{D800}-\x{DFFF}]
for anUnicode
encoded file, only
To simulate the present
Words:
number ( which is erroneous ! ), given by the summary feature, whatever the file encoding, simply use the regex below :[^\t\n\r\x20!"$%&'()*+,\-./:;<=>?@\x5B\x5C\x5D^\x60{|}~]+
and click on the
Count
button of the Find dialog, with theWrap around
option tickedObviously, this is not exact as a single word character is matched with the
\w
regex, which is the class[\u\l\d_]
, where\u
,\l
and\d
represents any Unicodeuppercase
,lowercase
anddigit
char or a related char, so, finally, much more than the simple[A-Za-z0-9]
set !But , worse, it’s the notion of word which is practically, not consistent, most of the time ! Indeed, for instance, if we consider the French expression
l'école
( the school ), the regex\w+
would return2
words, which is correct as this expression can be mentally decomposed asla école
. However, this regex would wrongly say the that the single wordaujourd'hui
( today ) is a two-words expression. Of course, you could change the regex as[\w']+
which would return1
word, but, this time, the expressionl'école
would wrongly be considered as a one-word string !In addition, what can be said about languages that do not use the
Space
character or where the use of theSpace
is discretionary ? Then, counting of words is impossible or rather non-significant ! This is developed in this Martin Haspelmath’s article, below :https://zenodo.org/record/225844/files/WordSegmentationFL.pdf
At end of section 5, it is said : … On such a view, the claim that “all languages have words” (Radford et al. 1999: 145) would be interpretable only in the weaker sense that "all languages have a unit which falls between the minimal sign and the phrase” …
And : … The basic problem remains the same: The units are defined in a language-specific way and cannot be equated across languages, and there is no reason to give special status to a unit called ‘word’. …
At beginning of section, 7 : … Linguists have no good basis for identifying words across languages …
And in the conclusion, section 10 : … I conclude, from the arguments presented in this article, that there is no definition of ‘word’ that can be applied to any language and that would yield consistent results …
Now, the Unicode definition of a word character is :
\p{gc=Alphabetic} | \p{gc=Mark} | \p{gc=Decimal_Number} | \p{gc=Connector_Punctuation} | \p{Join-Control}
https://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
So, in theory, the
word_character
class should include :-
All values of the derived category Alphabetic ( =
alpha
=\p{alphabetic}
) so132,875 chars
, from the DerivedCoreProperties.txt file, which can be decomposed into :-
Uppercase_Letter (
Lu
) + Lowercase_Letter (Ll
) + Titlecase_Letter (Lt
) + Modifier_Letter (Lm
) + Other_Letter (Lo
) + Letter_Number (Nl
) + Other_Alphabetic, so the characters sum1,791 + 2,155 + 31 + 260 + 127,004 + 236 + 1,398
-
Note : The last property Other_Alphabetic, from the Prop_list.txt file, contains some, but not all, characters from the
3
General_Categories Spacing_Mark (Mc
), Nonspacing_Mark (Mn
) and Other_Symbol (So
), so the characters sum417 + 851 + 130
-
-
All values with General_Category =
Decimal_Number
, from the DerivedGeneralCategory.txt file, so650
characters( These are characters, with defined values in the three fields
6
,7
and8
of the UnicodeData.txt file -
All values with General_Category =
Connector_Punctuation
, from the DerivedGeneralCategory.txt file, so10
characters -
All values with the binary Property
Join_Control
, from the PropList.txt file, so2
characters
So, if we include all Unicode languages, even historical ones :
=> Total number of Unicode word characters =
132,875 + 650 + 10 + 2
=133,537
characters, with version UNICODE13.0.0
!!Notes :
- The different files mentioned can be downloaded from the Unicode Character Database (
UCD
) or sub-directories, below :
http://www.unicode.org/Public/UCD/latest/ucd/
- And refer to the sites, below, for additional information to this topic :
https://www.unicode.org/reports/tr18/#Compatibility_Properties
https://www.unicode.org/reports/tr29/#Word_Boundaries
https://www.unicode.org/reports/tr31/ for tables
4
,5
and6
of section2.4
https://www.unicode.org/reports/tr44/#UnicodeData.txt
If someone did click on the links to the Unicode Consortium, above, one understood, very quickly, that word characters and word boundaries notions are a real nightmare !
Even if we restrict the definition of word chars to Unicode living scripts, forgetting all the historical scripts not in use, and also leaving aside all scripts which do not use the space char to, systematically, delimit words, we still have a list of about
21,000
characters which should be considered as word character ! I tried to build up such a list, with the help of these sites :https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries
https://linguistlist.org/issues/6/6-1302/
https://unicode-org.github.io/cldr-staging/charts/37/supplemental/scripts_and_languages.html
https://scriptsource.org/cms/scripts/page.php?item_id=script_overview
https://r12a.github.io/scripts/featurelist/
And I ended up with this list of
46
living scripts which always use aSpace
character between words :•------------------------•----------------•-------------------•-----------------• | | SCRIPT | SPACE between | UNICODE Script | | | Type : | Words : | Class : | | •----------------•-------------------•-----------------• | SCRIPT | (L)iving | (Y)es | (R)ecommended | | | | (U)nspecified | (L)imited | | | (H)istorical | (D)iscretionary | (E)xcluded | | | | (N)o | | •------------------------•----------------•-------------------•-----------------• | ARMENIAN | L | Y | R | | ADLAM | L | Y | L | | ARABIC | L | Y | R | | BAMUM | L | Y | L | | BASSA VAH | L | Y | E | | BENGALI ( Assamese ) | L | Y | R | | BOPOMOFO | L | Y | R | | BUGINESE | L | D | E | | CANADIAN SYLLABICS | L | Y | L | | CHEROKEE | L | Y | L | | CYRILLIC | L | Y | R | | DEVANAGARI | L | Y | R | | ETHIOPIC (Ge'ez) | L | Y | R | | GEORGIAN | L | Y | R | | GREEK | L | Y | R | | GUJARATI | L | Y | R | | GURMUKHI | L | Y | R | | HANGUL | L | Y | R | | HANIFI ROHINGYA | L | Y | L | | HEBREW | L | Y | R | | KANNADA | L | Y | R | | KAYAH LI | L | Y | L | | LATIN | L | Y | R | | LIMBU | L | Y | L | | MALAYALAM | L | D | R | | MANDAIC | H | Y | L | | MEETEI MAYEK | L | Y | L | | MIAO (Pollard) | L | Y | L | | MONGOLIAN | L | Y | E | | NEWA | L | Y | L | | NKO | L | Y | L | | OL CHIKI | L | Y | L | | ORIYA (Odia) | L | Y | R | | OSAGE | L | Y | L | | SINHALA | L | Y | R | | SUNDANESE | L | Y | L | | SYLOTI NAGRI | L | Y | L | | SYRIAC | L | Y | L | | TAi VIET | L | Y | L | | TAMIL | L | Y | R | | TELUGU | L | Y | R | | THAANA | L | D | R | | TIFINAGH (Berber) | L | Y | L | | VAI | L | Y | L | | WANCHO | L | Y | L | | YI | L | Y | L | •------------------------•----------------•-------------------•-----------------•
These scripts involve
101
legal Unicode scripts, from Basic Latin (0000 - 007F
) till Symbols for Legacy Computing (1FB00 - 1FBFF
)
You may, also, have a look to these sites for general information :
https://en.wikipedia.org/wiki/List_of_Unicode_characters
https://en.wikipedia.org/wiki/Scriptio_continua#Decline
https://glottolog.org/glottolog/language especially to locate the area where a language is used
Continued discussion in the next post
guy038
-
Hi All,
Continuation of the previous script :
Then, with the help of the excellent
Babel Map
software, updated forUnicode v13.0
https://www.babelstone.co.uk/Software/BabelMap.html
I succeeded to create a list of the
21,143
remaining characters, from the living scripts, above, which should be truly considered as word character, without any ambiguityHowever, when applying the regex
\t\w\t
against this list, I got a total of17,307
word characters, only, because, probably, Notepad++ does not use the Boost regex library with FULL Unicode support :-
The Boost definition of the regex
\w
does not consider all the characters over theBMP
-
Some characters of the
BMP
, although alphabetic, are not considered, yet, as word chars
For instance, in this short list, below, each Unicode char, surrounded with two
tabulation
chars, cannot be found with the regex\t\w\t
, although it is, indeed, seen as a word by the Unicode Consortium` :-((023D Ƚ ; Upper_Letter # Lu LATIN CAPITAL LETTER L WITH BAR 0370 Ͱ ; Upper_Letter # Lu GREEK CAPITAL LETTER HETA 04CF ӏ ; Lower_Letter # Ll CYRILLIC SMALL LETTER PALOCHKA 066F ٯ ; Other_Letter # Lo ARABIC LETTER DOTLESS QAF 0D60 ൠ ; Other_Letter # Lo MALAYALAM LETTER VOCALIC RR 200D ; Join_Control # Cf ZERO WIDTH JOINER 213F ℿ ; Upper_Letter # Lu DOUBLE-STRUCK CAPITAL PI 2187 ↇ ; Letter_Numb. # Nl ROMAN NUMERAL FIFTY THOUSAND 24B6 Ⓐ ; Other_Alpha. # So CIRCLED LATIN CAPITAL LETTER A 2E2F ⸯ ; Modifier_Let # Lm VERTICAL TILDE A727 ꜧ ; Lower_Letter # Ll LATIN SMALL LETTER HENG FF3F _ ; Conn._Punct. # Pc FULLWIDTH LOW LINE 1D400 𝐀 ; Upper_Letter # Lu MATHEMATICAL BOLD CAPITAL A 1D70B 𝜋 ; Lower_Letter # Ll MATHEMATICAL ITALIC SMALL PI 1F150 🅐 ; Other_Alpha. # So NEGATIVE CIRCLED LATIN CAPITAL LETTER A
To my mind, for all these reasons, as we cannot rely on the Word notion, the
View > Summary...
feature should just ignore the number of words or, at least, add the indicationWith caution
!
By contrast, I think that it would be useful to count the number of
Non_Space
strings, determined with the regex\S+
. Indeed, we would get more confident results ! The boundaries ofNon_Space
strings, which are theSpace
characters, belong to the well-defined list of the25
Unicode characters with the binary propertyWhite_Space
, from thePropList.txt
file. Refer to the very beginning of this file :http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
As a reminder, the regex
\s
is identical to\h|\v
. So, it represents the complete character class[\t\x20\xA0\x{1680}\x{2000}-\x{200B}\x{202F}\x{3000}]|[\n\x0B\f\r\x85\x{2028}\x{2029}]
which can be re-ordered as :\s
=[\t\n\x0B\f\r\x20\x85\xA0\x{1680}\x{2000}-\x{200B}\x{2028}\x{2029}\x{202F}\x{3000}]
Note that, in practice, the
\s
regex is mainly equivalent to the simple regex[\t\n\r\x20]
Here is that Unicode list of all Unicode characters with the property
White_Space
, with their name and theirGeneral_Category
value :0009 TAB ; White_Space # Cc TABULATION <control-0009> 000A LF ; White_Space # Cc LINE FEED <control-000A> 000B ; White_Space # Cc VERTICAL TABULATION <control-000B> 000C ; White_Space # Cc FORM FEED <control-000C> 000D CR ; White_Space # Cc CARRIAGE RETURN <control-000D> 0020 ; White_Space # Zs SPACE 0085 ; White_Space # Cc NEXT LINE <control-0085> 00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 2000 ; White_Space # Zs EN QUAD 2001 ; White_Space # Zs EM QUAD 2002 ; White_Space # Zs EN SPACE 2003 ; White_Space # Zs EM SPACE 2004 ; White_Space # Zs THREE-PER-EM SPACE 2005 ; White_Space # Zs FOUR-PER-EM SPACE 2006 ; White_Space # Zs SIX-PER-EM SPACE 2007 ; White_Space # Zs FIGURE SPACE 2008 ; White_Space # Zs PUNCTUATION SPACE 2009 ; White_Space # Zs THIN SPACE 200A ; White_Space # Zs HAIR SPACE 2028 ; White_Space # Zl LINE SEPARATOR 2029 ; White_Space # Zp PARAGRAPH SEPARATOR 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACE
Note that I used the notations TAB, LF and CR, standing for the three characters
\t
,\n
and\r
, instead of the chars themselvesSo, in order to get the number of
Non_Space
strings, we should, normally, use the simple regex\S+
. However, it does not give the right number. Indeed, when several characters, with code-point over theBMP
, are consecutive, they are not seen as a globalNon_Space
string but as individual characters :-((Test my statement with this string, composed of four consecutive
emoji
chars 👨👩👦👧. The regex\S+
returns fourNon_Space
strings, whereas I would have expected only one string !Consequently, I verified that the suitable regex to count all the
Non_Space
strings of a file, whatever their Unicode code-point, is rather the regex((?!\s).[\x{D800}-\x{DFFF}]?)+
( Longer, I agree but exact ! )
Now, here is a list of regexes which allow you to count different subsets of characters from the current file contents. Of course, tick the
Wrap around
option !- Number of chars in a current non-ANSI ( UNICODE ) file, as the zone [\x{D800}-\x{DFFF}] represents the reserved SURROGATE area : - Number of chars, in range [U+0000 - U+007F ], WITHOUT the \r AND \n chars = N1 = (?![\r\n])[\x{0000}-\x{007F}] - Number of chars, in range [U+0080 - U+07FF ] = N2 = [\x{0080}-\x{07FF}] - Number of chars, in range [U+0800 - U+FFFF ], except in SURROGATE range [D800 - DFFF] = N3 = (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}] ------------------------------------------------ - Number of chars, in range [U+0000 - U+FFFF ], in BMP , WITHOUT the \r AND \n = N1 + N2 + N3 = (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}] or [^\r\n\x{D800}-\x{DFFF}] - Number of chars, in range [U+10000 - U+10FFFF], OVER the BMP = N4 = (?-s).[\x{D800}-\x{DFFF}] --------------------------- - TOTAL chars, in an UNICODE file, WITHOUT the \r AND \r chars = N1 + N2 + N3 + N4 = [^\r\n] - Number of \r characters + Number of \n characters = N0 = \r|\n --------- - TOTAL chars, in an UNICODE file, WITH the \r AND \r chars = N0 + N1 + N2 + N3 + N4 = (?s). - Number of chars in a current ANSI file : - Number of characters, in range [U+0000 - U+00FF], WITHOUT the \r AND \n chars = N1 = [^\r\n] - Number of \r characters + Number of \n characters = N0 = \r|\n --------- - TOTAL chars, in an ANSI file, WITH the \r AND \r chars = N0 + N1 = (?s). - TOTAL current FILE length <Fl> in Notepad++ BUFFER : - For an ANSI file Fl = N0 + N1 = (?s). - For an UTF-8 or UTF-8-BOM file Fl = N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4 - For an UCS-2 BE BOM or UCS-2 BE BOM file Fl = ( N0 + N1 + N2 + N3 ) × 2 = (?s). × 2 - Byte Order Mark ( BOM = U+FEFF ) length <Bl> and encoding, for SAVED files : - For an ANSI or UTF-8 file Bl = 0 byte - For an UTF-8-BOM file Bl = 3 bytes ( EF BB BF ) - For an UCS-2 BE BOM file Bl = 2 bytes ( FE FF ) - For an UCS-2 LE BOM file Bl = 2 bytes ( FF FE ) - TOTAL CURRENT file length on DISK, WHATEVER its encoding Ld = Fl + Bl ( = Total FILE length + BOM length ) - NUMBER of WORDS = \w+ whatever the file TYPE ( This result must be considered with CAUTION ) - NUMBER of NON_SPACE strings = ((?!\s).[\x{D800}-\x{DFFF}]?)+ for an UNICODE file or ((?!\s).)+ for an ANSI file - Number of LINES in an UNICODE file : - Number of true EMPTY lines = (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n) - Number of lines containing TAB and/or SPACE characters ONLY = (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z) -------------------------------------------------------------- - TOTAL Number of BLANK or EMPTY lines = (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z) - Number of NON BLANK and NON EMPTY lines = (?-s)(?!^[\t\x20]+$)(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z) -------------------------------------------------------------------------- - TOTAL number of LINES in an UNICODE file = (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z - Number of LINES in an ANSI file : - Number of true EMPTY lines = (?<!\f)^(?:\r\n|\r|\n) - Number of lines containing TAB and/or SPACE characters ONLY = (?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z) ------------------------------------ - TOTAL Number of EMPTY or BLANK lines = (?<!\f)^[\t\x20]*(?:\r\n|\r|\n|\z) - Number of NON BLANK and NON EMPTY lines = (?-s)(?!^[\t\x20]+$)^(?:.|\f)+(?:\r\n|\r|\n|\z) ------------------------------------------------- - TOTAL number of LINES in an ANSI file = (?-s)\r\n|\r|\n|(?:.|\f)\z
Remarks : Although most of the regexes, above, can be easily understood, here are some additional elements :
-
The regex
(?-s).[\x{D800}-\x{DFFF}]
is the sole correct syntax, with our Boost regex engine, to count all the characters over theBMP
-
The regex
(?s)((?!\s).[\x{D800}-\x{DFFF}]?)+
, to count all theNon_Space
strings, was explained before -
In all the regexes, relative to counting of lines, you probably noticed the character class
[\f\x{0085}\x{2028}\x{2029}]
. It must be present because the four characters\f
,\x{0085}
,\x{2028}
and\x{2029}
are, both, considered as astart
and anEnd
of line, like the assertions^
and$
!- For instance, if, in a new file, you insert one Next_Line char (
NEL
), of code-point\x{0085}
and hit theEnter
key, this sole line is wrongly seen as an empty line by the simple regex^(?:\r\n|\r|\n)
which matches the line-break after theNext_Line
char !
- For instance, if, in a new file, you insert one Next_Line char (
To end , I would like to propose a new layout of an summary feature, which should be more informative !
IMPORTANT : In the list below, any text, before the
1st
colon character of each line, is the name which should be displayed in theSummary
dialog !Full File Path : X:\....\....\ Creation Date : MM/DD/YYYY HH:MM:SS Modification Date : MM/DD/YYYY HH:MM:SS UTF-8[-BOM] UCS-2 BE/LE BOM ANSI ----------------------------------------------------------------------------------------------------------------- 1-Byte Chars : N1 = (?![\r\n])[\x{0000}-\x{007F}] idem [^\r\n] 2-Bytes Chars : N2 = [\x{0080}-\x{07FF}] idem 0 3-Bytes Chars : N3 = (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}] idem 0 Total BMP Chars : N1 + N2 + N3 = (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}] idem [^\r\n] 4-Bytes Chars : N4 = (?-s).[\x{D800}-\x{DFFF}] 0 0 NON BLANK chars : = [^\r\n\t\x20] idem idem Chars w/o CR|LF : N1 + N2 + N3 + N4 = [^\r\n] idem idem EOL ( CR or LF ) : N0 = \r|\n idem idem TOTAL Characters : N0 + N1 + N2 + N3 + N4 = (?s). idem idem BYTE Length : = N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4 ( N0 + N1 + N2 + N3 ) × 2 (?s). Byte Order Mark : = 0 ( UTF-8) or 3 ( UTF-8-BOM ) 2 0 BUFFER Length : BYTE length + BOM FILE Length : SAVED length of CURRENT file on DISK WORDS ( Caution ) : = \w+ idem idem NON-SPACE strings : = ((?!\s).[\x{D800}-\x{DFFF}]?)+ ((?!\s).)+ ((?!\s).)+ True EMPTY lines : = (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n) idem (?<!\f)^(?:\r\n|\r|\n) BLANK lines : = (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z) idem (?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z) EMPTY/BLANK lines : = (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z) idem (?<!\f)^[\t\x20]*(?:\r\n|\r|\n|\z) NON-BLANK lines : = (?-s)(?!^[\t\x20]+$)^(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z) idem (?-s)(?!^[\t\x20]+$)^(?:.|\f)+(?:\r\n|\r|\n|\z) TOTAL lines : = (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z idem (?-s)\r\n|\r|\n|(?:.|\f)\z Selection(s) : X characters (Y bytes) in Z ranges
Best Regards,
guy038
-