Emulation of the "View > Summary" feature with a Python script
- 
 Hi All, Continuation of the previous script : Then, with the help of the excellent Babel Mapsoftware, updated forUnicode v13.0https://www.babelstone.co.uk/Software/BabelMap.html I succeeded to create a list of the 21,143remaining characters, from the living scripts, above, which should be truly considered as word character, without any ambiguityHowever, when applying the regex \t\w\tagainst this list, I got a total of17,307word characters, only, because, probably, Notepad++ does not use the Boost regex library with FULL Unicode support :- 
The Boost definition of the regex \wdoes not consider all the characters over theBMP
- 
Some characters of the BMP, although alphabetic, are not considered, yet, as word chars
 For instance, in this short list, below, each Unicode char, surrounded with two tabulationchars, cannot be found with the regex\t\w\t, although it is, indeed, seen as a word by the Unicode Consortium` :-((023D Ƚ ; Upper_Letter # Lu LATIN CAPITAL LETTER L WITH BAR 0370 Ͱ ; Upper_Letter # Lu GREEK CAPITAL LETTER HETA 04CF ӏ ; Lower_Letter # Ll CYRILLIC SMALL LETTER PALOCHKA 066F ٯ ; Other_Letter # Lo ARABIC LETTER DOTLESS QAF 0D60 ൠ ; Other_Letter # Lo MALAYALAM LETTER VOCALIC RR 200D  ; Join_Control # Cf ZERO WIDTH JOINER 213F ℿ ; Upper_Letter # Lu DOUBLE-STRUCK CAPITAL PI 2187 ↇ ; Letter_Numb. # Nl ROMAN NUMERAL FIFTY THOUSAND 24B6 Ⓐ ; Other_Alpha. # So CIRCLED LATIN CAPITAL LETTER A 2E2F ⸯ ; Modifier_Let # Lm VERTICAL TILDE A727 ꜧ ; Lower_Letter # Ll LATIN SMALL LETTER HENG FF3F _ ; Conn._Punct. # Pc FULLWIDTH LOW LINE 1D400 𝐀 ; Upper_Letter # Lu MATHEMATICAL BOLD CAPITAL A 1D70B 𝜋 ; Lower_Letter # Ll MATHEMATICAL ITALIC SMALL PI 1F150 🅐 ; Other_Alpha. # So NEGATIVE CIRCLED LATIN CAPITAL LETTER ATo my mind, for all these reasons, as we cannot rely on the Word notion, the View > Summary...feature should just ignore the number of words or, at least, add the indicationWith caution!
 By contrast, I think that it would be useful to count the number of Non_Spacestrings, determined with the regex\S+. Indeed, we would get more confident results ! The boundaries ofNon_Spacestrings, which are theSpacecharacters, belong to the well-defined list of the25Unicode characters with the binary propertyWhite_Space, from thePropList.txtfile. Refer to the very beginning of this file :http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt As a reminder, the regex \sis identical to\h|\v. So, it represents the complete character class[\t\x20\xA0\x{1680}\x{2000}-\x{200B}\x{202F}\x{3000}]|[\n\x0B\f\r\x85\x{2028}\x{2029}]which can be re-ordered as :\s=[\t\n\x0B\f\r\x20\x85\xA0\x{1680}\x{2000}-\x{200B}\x{2028}\x{2029}\x{202F}\x{3000}]Note that, in practice, the \sregex is mainly equivalent to the simple regex[\t\n\r\x20]Here is that Unicode list of all Unicode characters with the property White_Space, with their name and theirGeneral_Categoryvalue :0009 TAB ; White_Space # Cc TABULATION <control-0009> 000A LF ; White_Space # Cc LINE FEED <control-000A> 000B ; White_Space # Cc VERTICAL TABULATION <control-000B> 000C ; White_Space # Cc FORM FEED <control-000C> 000D CR ; White_Space # Cc CARRIAGE RETURN <control-000D> 0020 ; White_Space # Zs SPACE 0085 ; White_Space # Cc NEXT LINE <control-0085> 00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 2000 ; White_Space # Zs EN QUAD 2001 ; White_Space # Zs EM QUAD 2002 ; White_Space # Zs EN SPACE 2003 ; White_Space # Zs EM SPACE 2004 ; White_Space # Zs THREE-PER-EM SPACE 2005 ; White_Space # Zs FOUR-PER-EM SPACE 2006 ; White_Space # Zs SIX-PER-EM SPACE 2007 ; White_Space # Zs FIGURE SPACE 2008 ; White_Space # Zs PUNCTUATION SPACE 2009 ; White_Space # Zs THIN SPACE 200A ; White_Space # Zs HAIR SPACE 2028 ; White_Space # Zl LINE SEPARATOR 2029 ; White_Space # Zp PARAGRAPH SEPARATOR 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACENote that I used the notations TAB, LF and CR, standing for the three characters \t,\nand\r, instead of the chars themselvesSo, in order to get the number of Non_Spacestrings, we should, normally, use the simple regex\S+. However, it does not give the right number. Indeed, when several characters, with code-point over theBMP, are consecutive, they are not seen as a globalNon_Spacestring but as individual characters :-((Test my statement with this string, composed of four consecutive emojichars 👨👩👦👧. The regex\S+returns fourNon_Spacestrings, whereas I would have expected only one string !Consequently, I verified that the suitable regex to count all the Non_Spacestrings of a file, whatever their Unicode code-point, is rather the regex((?!\s).[\x{D800}-\x{DFFF}]?)+( Longer, I agree but exact ! )
 Now, here is a list of regexes which allow you to count different subsets of characters from the current file contents. Of course, tick the Wrap aroundoption !- Number of chars in a current non-ANSI ( UNICODE ) file, as the zone [\x{D800}-\x{DFFF}] represents the reserved SURROGATE area : - Number of chars, in range [U+0000 - U+007F ], WITHOUT the \r AND \n chars = N1 = (?![\r\n])[\x{0000}-\x{007F}] - Number of chars, in range [U+0080 - U+07FF ] = N2 = [\x{0080}-\x{07FF}] - Number of chars, in range [U+0800 - U+FFFF ], except in SURROGATE range [D800 - DFFF] = N3 = (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}] ------------------------------------------------ - Number of chars, in range [U+0000 - U+FFFF ], in BMP , WITHOUT the \r AND \n = N1 + N2 + N3 = (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}] or [^\r\n\x{D800}-\x{DFFF}] - Number of chars, in range [U+10000 - U+10FFFF], OVER the BMP = N4 = (?-s).[\x{D800}-\x{DFFF}] --------------------------- - TOTAL chars, in an UNICODE file, WITHOUT the \r AND \r chars = N1 + N2 + N3 + N4 = [^\r\n] - Number of \r characters + Number of \n characters = N0 = \r|\n --------- - TOTAL chars, in an UNICODE file, WITH the \r AND \r chars = N0 + N1 + N2 + N3 + N4 = (?s). - Number of chars in a current ANSI file : - Number of characters, in range [U+0000 - U+00FF], WITHOUT the \r AND \n chars = N1 = [^\r\n] - Number of \r characters + Number of \n characters = N0 = \r|\n --------- - TOTAL chars, in an ANSI file, WITH the \r AND \r chars = N0 + N1 = (?s). - TOTAL current FILE length <Fl> in Notepad++ BUFFER : - For an ANSI file Fl = N0 + N1 = (?s). - For an UTF-8 or UTF-8-BOM file Fl = N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4 - For an UCS-2 BE BOM or UCS-2 BE BOM file Fl = ( N0 + N1 + N2 + N3 ) × 2 = (?s). × 2 - Byte Order Mark ( BOM = U+FEFF ) length <Bl> and encoding, for SAVED files : - For an ANSI or UTF-8 file Bl = 0 byte - For an UTF-8-BOM file Bl = 3 bytes ( EF BB BF ) - For an UCS-2 BE BOM file Bl = 2 bytes ( FE FF ) - For an UCS-2 LE BOM file Bl = 2 bytes ( FF FE ) - TOTAL CURRENT file length on DISK, WHATEVER its encoding Ld = Fl + Bl ( = Total FILE length + BOM length ) - NUMBER of WORDS = \w+ whatever the file TYPE ( This result must be considered with CAUTION ) - NUMBER of NON_SPACE strings = ((?!\s).[\x{D800}-\x{DFFF}]?)+ for an UNICODE file or ((?!\s).)+ for an ANSI file - Number of LINES in an UNICODE file : - Number of true EMPTY lines = (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n) - Number of lines containing TAB and/or SPACE characters ONLY = (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z) -------------------------------------------------------------- - TOTAL Number of BLANK or EMPTY lines = (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z) - Number of NON BLANK and NON EMPTY lines = (?-s)(?!^[\t\x20]+$)(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z) -------------------------------------------------------------------------- - TOTAL number of LINES in an UNICODE file = (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z - Number of LINES in an ANSI file : - Number of true EMPTY lines = (?<!\f)^(?:\r\n|\r|\n) - Number of lines containing TAB and/or SPACE characters ONLY = (?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z) ------------------------------------ - TOTAL Number of EMPTY or BLANK lines = (?<!\f)^[\t\x20]*(?:\r\n|\r|\n|\z) - Number of NON BLANK and NON EMPTY lines = (?-s)(?!^[\t\x20]+$)^(?:.|\f)+(?:\r\n|\r|\n|\z) ------------------------------------------------- - TOTAL number of LINES in an ANSI file = (?-s)\r\n|\r|\n|(?:.|\f)\z
 Remarks : Although most of the regexes, above, can be easily understood, here are some additional elements : - 
The regex (?-s).[\x{D800}-\x{DFFF}]is the sole correct syntax, with our Boost regex engine, to count all the characters over theBMP
- 
The regex (?s)((?!\s).[\x{D800}-\x{DFFF}]?)+, to count all theNon_Spacestrings, was explained before
- 
In all the regexes, relative to counting of lines, you probably noticed the character class [\f\x{0085}\x{2028}\x{2029}]. It must be present because the four characters\f,\x{0085},\x{2028}and\x{2029}are, both, considered as astartand anEndof line, like the assertions^and$!- For instance, if, in a new file, you insert one Next_Line char ( NEL), of code-point\x{0085}and hit theEnterkey, this sole line is wrongly seen as an empty line by the simple regex^(?:\r\n|\r|\n)which matches the line-break after theNext_Linechar !
 
- For instance, if, in a new file, you insert one Next_Line char ( 
 
 To end , I would like to propose a new layout of an summary feature, which should be more informative ! IMPORTANT : In the list below, any text, before the 1stcolon character of each line, is the name which should be displayed in theSummarydialog !Full File Path : X:\....\....\ Creation Date : MM/DD/YYYY HH:MM:SS Modification Date : MM/DD/YYYY HH:MM:SS UTF-8[-BOM] UCS-2 BE/LE BOM ANSI ----------------------------------------------------------------------------------------------------------------- 1-Byte Chars : N1 = (?![\r\n])[\x{0000}-\x{007F}] idem [^\r\n] 2-Bytes Chars : N2 = [\x{0080}-\x{07FF}] idem 0 3-Bytes Chars : N3 = (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}] idem 0 Total BMP Chars : N1 + N2 + N3 = (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}] idem [^\r\n] 4-Bytes Chars : N4 = (?-s).[\x{D800}-\x{DFFF}] 0 0 NON BLANK chars : = [^\r\n\t\x20] idem idem Chars w/o CR|LF : N1 + N2 + N3 + N4 = [^\r\n] idem idem EOL ( CR or LF ) : N0 = \r|\n idem idem TOTAL Characters : N0 + N1 + N2 + N3 + N4 = (?s). idem idem BYTE Length : = N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4 ( N0 + N1 + N2 + N3 ) × 2 (?s). Byte Order Mark : = 0 ( UTF-8) or 3 ( UTF-8-BOM ) 2 0 BUFFER Length : BYTE length + BOM FILE Length : SAVED length of CURRENT file on DISK WORDS ( Caution ) : = \w+ idem idem NON-SPACE strings : = ((?!\s).[\x{D800}-\x{DFFF}]?)+ ((?!\s).)+ ((?!\s).)+ True EMPTY lines : = (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n) idem (?<!\f)^(?:\r\n|\r|\n) BLANK lines : = (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z) idem (?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z) EMPTY/BLANK lines : = (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z) idem (?<!\f)^[\t\x20]*(?:\r\n|\r|\n|\z) NON-BLANK lines : = (?-s)(?!^[\t\x20]+$)^(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z) idem (?-s)(?!^[\t\x20]+$)^(?:.|\f)+(?:\r\n|\r|\n|\z) TOTAL lines : = (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z idem (?-s)\r\n|\r|\n|(?:.|\f)\z Selection(s) : X characters (Y bytes) in Z rangesBest Regards, guy038 
- 
- 
 Hello All, Here is the updated version of my previous posts regarding the present N++ Summary feature ( View > Summary...). And I must say that numerous things are still weird !For tests, I used various files as well as my Total_Chars.txtfile, written with the4N++ Unicode encodings and also anANSIfile, containing the256characters of theWindows-1252encoding :- ANSI
- UTF-8
- UTF-8-BOM
- UTF-16 BE BOM
- UTF-16 LE BOM
 
 To my mind, there are 3major problems and some minor points :- 
The first and worse problem is the fact that, when an UTF-8[-BOM]file, containing various Unicode chars ( of theBMPonly : this point is important ! ) is copied in anUCS-2 BE BOMorUCS-2 LE BOMencoded file, some results, given by theSummaryfeature for these new files, are totally wrong :- 
The characters( without line endings )value seems to be the number of bytes used in the correspondingUTF-8[-BOM]file
- 
The Document lengthvalue seems to be the document length of the correspondingUTF-8[-BOM]file and is also displayed, unfortunately, in the status bar !!
 
- 
- 
The second problem is that the definition of a word char, by the Summaryfeature is definitively NOT the same of the definition of the regex\w, as explained further on !
- 
Thus, the third problem is that the given number of words is totally inaccurate ! And, anyway, the number of words, although well enough defined for an English / Americantext, is rather a vague notion, for a lot of texts written in other languages, especially Asiatic ones ! ( See further on )
- 
Some minor things : - 
The number of lines given is, most of the time, increased by one unit 
- 
Presently, the Summary feature displays the document length in the Notepad++ buffer. I think it would be good to display, as well, the actual document length saved on disk. Incidentally, for just saved documents, it would give, by difference, the length of the possible Byte Order Mark, if its size wouldn’t be explicitly displayed !
- 
For any encoded file, a decomposition, giving the number of chars coded with 1,2,3and4bytes would be welcome !
 
- 
 So, in brief, in the present Summarywindow :- 
The Characters (without line endings):number is wrong for theUTF-16 BE BOMorUTF-16 LE BOMencodings
- 
The Wordsnumber is totally wrong, given the regex definition of a word character, whatever the encoding used
- 
The Lines:number is wrong, by one unit, if a line-break ends the last line of current file, in any encoding
- 
The Document lengthvalue, in N++ buffer, is wrong for theUTF-16 BE BOMorUTF-16 LE BOMencodings, as well as theLength:indication in the status bar
 
 To begin with, let’s me develop the… second bug ! After numerous tests, I determined that, in the present View > Summary...feature, the characters, considered a word character, are :- 
The C0 control characters, except for the Tabulation ( \x{0009}) and the two EOL (\x{000a}and\x{000d}), so the regex(?![\t\r\n])[\x00-\x1F]
- 
The number sign #
- 
The 10digits, so the regex[0-9]:
- 
The 26uppercase and lowercase letters, so the regex(?i)[A-Z]
- 
The low line character _
- 
All the characters, of the Basic Multilingual Plane ( BMP), with code-point over\x{007E}, so the regex(?![\x{D800}-\x{DFFF}])[\x{007F}-\x{FFFF}]for aUnicodeencoded file or[\x7F-\xFF]for anANSIencoded file
- 
All the characters, over the Basic Multilingual Plane, so the regex (?-s).[\x{D800}-\x{DFFF}]for anUnicodeencoded file, only
 To simulate the present Words:number ( which is erroneous ! ), given by the summary feature, whatever the file encoding, simply use the regex below :[^\t\n\r\x20!"$%&'()*+,\-./:;<=>?@\x5B\x5C\x5D^\x60{|}~]+and click on the Countbutton of the Find dialog, with theWrap aroundoption tickedObviously, this is not exact as a single word character is matched with the \wregex, which is the class[\u\l\d_], where\u,\land\drepresents any Unicodeuppercase,lowercaseanddigitchar or a related char, so, finally, much more than the simple[A-Za-z0-9]set !But , worse, it’s the notion of word which is practically, not consistent, most of the time ! Indeed, for instance, if we consider the French expression l'école( the school ), the regex\w+would return2words, which is correct as this expression can be mentally decomposed asla école. However, this regex would wrongly say the that the single wordaujourd'hui( today ) is a two-words expression. Of course, you could change the regex as[\w']+which would return1word, but, this time, the expressionl'écolewould wrongly be considered as a one-word string !In addition, what can be said about languages that do not use the Spacecharacter or where the use of theSpaceis discretionary ? Then, counting of words is impossible or rather non-significant ! This is developed in this Martin Haspelmath’s article, below :https://zenodo.org/record/225844/files/WordSegmentationFL.pdf At end of section 5, it is said : … On such a view, the claim that “all languages have words” (Radford et al. 1999: 145) would be interpretable only in the weaker sense that "all languages have a unit which falls between the minimal sign and the phrase” … And : … The basic problem remains the same: The units are defined in a language-specific way and cannot be equated across languages, and there is no reason to give special status to a unit called ‘word’. … At beginning of section, 7 : … Linguists have no good basis for identifying words across languages … And in the conclusion, section 10 : … I conclude, from the arguments presented in this article, that there is no definition of ‘word’ that can be applied to any language and that would yield consistent results … 
 Now, the Unicode definition of a word character is : \p{gc=Alphabetic} | \p{gc=Mark} | \p{gc=Decimal_Number} | \p{gc=Connector_Punctuation} | \p{Join-Control}https://www.unicode.org/reports/tr18/#Simple_Word_Boundaries So, in theory, the word_characterclass should include :- 
All values of the derived category Alphabetic ( = alpha=\p{alphabetic}) so132,875 chars, from the DerivedCoreProperties.txt file, which can be decomposed into :- 
Uppercase_Letter ( Lu) + Lowercase_Letter (Ll) + Titlecase_Letter (Lt) + Modifier_Letter (Lm) + Other_Letter (Lo) + Letter_Number (Nl) + Other_Alphabetic, so the characters sum1,791 + 2,155 + 31 + 260 + 127,004 + 236 + 1,398
- 
Note : The last property Other_Alphabetic, from the Prop_list.txt file, contains some, but not all, characters from the 3General_Categories Spacing_Mark (Mc), Nonspacing_Mark (Mn) and Other_Symbol (So), so the characters sum417 + 851 + 130
 
- 
- 
All values with General_Category = Decimal_Number, from the DerivedGeneralCategory.txt file, so650characters( These are characters, with defined values in the three fields 6,7and8of the UnicodeData.txt file
- 
All values with General_Category = Connector_Punctuation, from the DerivedGeneralCategory.txt file, so10characters
- 
All values with the binary Property Join_Control, from the PropList.txt file, so2characters
 So, if we include all Unicode languages, even historical ones : => Total number of Unicode word characters = 132,875 + 650 + 10 + 2=133,537characters, with version UNICODE13.0.0!!Notes : - The different files mentioned can be downloaded from the Unicode Character Database ( UCD) or sub-directories, below :
 http://www.unicode.org/Public/UCD/latest/ucd/ - And refer to the sites, below, for additional information to this topic :
 https://www.unicode.org/reports/tr18/#Compatibility_Properties https://www.unicode.org/reports/tr29/#Word_Boundaries https://www.unicode.org/reports/tr31/ for tables 4,5and6of section2.4https://www.unicode.org/reports/tr44/#UnicodeData.txt 
 If someone did click on the links to the Unicode Consortium, above, one understood, very quickly, that word characters and word boundaries notions are a real nightmare ! Even if we restrict the definition of word chars to Unicode living scripts, forgetting all the historical scripts not in use, and also leaving aside all scripts which do not use the space char to, systematically, delimit words, we still have a list of about 21,000characters which should be considered as word character ! I tried to build up such a list, with the help of these sites :https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries https://linguistlist.org/issues/6/6-1302/ https://unicode-org.github.io/cldr-staging/charts/37/supplemental/scripts_and_languages.html https://scriptsource.org/cms/scripts/page.php?item_id=script_overview https://r12a.github.io/scripts/featurelist/ And I ended up with this list of 46living scripts which always use aSpacecharacter between words :•------------------------•----------------•-------------------•-----------------• | | SCRIPT | SPACE between | UNICODE Script | | | Type : | Words : | Class : | | •----------------•-------------------•-----------------• | SCRIPT | (L)iving | (Y)es | (R)ecommended | | | | (U)nspecified | (L)imited | | | (H)istorical | (D)iscretionary | (E)xcluded | | | | (N)o | | •------------------------•----------------•-------------------•-----------------• | ARMENIAN | L | Y | R | | ADLAM | L | Y | L | | ARABIC | L | Y | R | | BAMUM | L | Y | L | | BASSA VAH | L | Y | E | | BENGALI ( Assamese ) | L | Y | R | | BOPOMOFO | L | Y | R | | BUGINESE | L | D | E | | CANADIAN SYLLABICS | L | Y | L | | CHEROKEE | L | Y | L | | CYRILLIC | L | Y | R | | DEVANAGARI | L | Y | R | | ETHIOPIC (Ge'ez) | L | Y | R | | GEORGIAN | L | Y | R | | GREEK | L | Y | R | | GUJARATI | L | Y | R | | GURMUKHI | L | Y | R | | HANGUL | L | Y | R | | HANIFI ROHINGYA | L | Y | L | | HEBREW | L | Y | R | | KANNADA | L | Y | R | | KAYAH LI | L | Y | L | | LATIN | L | Y | R | | LIMBU | L | Y | L | | MALAYALAM | L | D | R | | MANDAIC | H | Y | L | | MEETEI MAYEK | L | Y | L | | MIAO (Pollard) | L | Y | L | | MONGOLIAN | L | Y | E | | NEWA | L | Y | L | | NKO | L | Y | L | | OL CHIKI | L | Y | L | | ORIYA (Odia) | L | Y | R | | OSAGE | L | Y | L | | SINHALA | L | Y | R | | SUNDANESE | L | Y | L | | SYLOTI NAGRI | L | Y | L | | SYRIAC | L | Y | L | | TAi VIET | L | Y | L | | TAMIL | L | Y | R | | TELUGU | L | Y | R | | THAANA | L | D | R | | TIFINAGH (Berber) | L | Y | L | | VAI | L | Y | L | | WANCHO | L | Y | L | | YI | L | Y | L | •------------------------•----------------•-------------------•-----------------•These scripts involve 101legal Unicode scripts, from Basic Latin (0000 - 007F) till Symbols for Legacy Computing (1FB00 - 1FBFF)
 You may, also, have a look to these sites for general information : https://en.wikipedia.org/wiki/List_of_Unicode_characters https://en.wikipedia.org/wiki/Scriptio_continua#Decline https://glottolog.org/glottolog/language especially to locate the area where a language is used Continued discussion in the next post guy038 
- 
 Hi All, Then, with the help of the excellent Babel Mapsoftware, updated for Unicode v13.0``https://www.babelstone.co.uk/Software/BabelMap.html I succeeded to create a list of the 21,143remaining characters, from the living scripts, above, which should be truly considered as word character, without any ambiguityOn the other hand, with the help of my Total_Chars.txt, which contains325,590characters, I detected48,031word chars with the simple search of the\wregex. This number seems important but include all theChinesecharacters and equivalent chars which cannot be truly counted as word chars because of their vertical / horizontal way of writing !In addition, when applying the regex \t\w\tagainst this list above, I got a total of17,307word characters, only, because, probably, Notepad++ does not use the Boost regex library with FULL Unicode supportIndeed, after some verifications : - 
The Boost definition of the regex \wdoes not consider all the characters over theBMP
- 
Some characters of the BMP, although alphabetic, are not considered, yet, as word chars
 For instance, in this short list, below, each Unicode char, surrounded with two tabulationchars, cannot be found with the regex\t\w\t, although that each char is, indeed, seen as a word by the Unicode Consortium` :-((24B6 Ⓐ ; Other_Symbol # So CIRCLED LATIN CAPITAL LETTER A 1D400 𝐀 ; Uppercase_Letter # Lu MATHEMATICAL BOLD CAPITAL A 1D70B 𝜋 ; Lowercase_Letter # Ll MATHEMATICAL ITALIC SMALL PI 1F150 🅐 ; Other_symbol # So NEGATIVE CIRCLED LATIN CAPITAL LETTER ATo my mind, for all these reasons, as we cannot rely on the word notion, the View > Summary...feature should just ignore the number of words or, at least, add the indicationWith caution!
 By contrast, I think that it would be useful to count the number of Non_Spacestrings, determined with the regex\S+. Indeed, we would get more confident results ! The boundaries ofNon_Spacestrings, which are theSpacecharacters, belong to the well-defined list of the25Unicode characters with the binary propertyWhite_Space, from thePropList.txtfile. Refer to the very beginning of this file :http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt As a reminder, the regex \sis identical to\h|\v. So, it represents the complete character class[\t\x20\xA0\x{1680}\x{2000}-\x{200B}\x{202F}\x{3000}]|[\n\x0B\f\r\x85\x{2028}\x{2029}]which can be re-ordered as :\s=[\t\n\x0B\f\r\x20\x85\xA0\x{1680}\x{2000}-\x{200B}\x{2028}\x{2029}\x{202F}\x{3000}]Note that, in practice, the \sregex is mainly equivalent to the simple regex[\t\n\r\x20]Here is that Unicode list of all Unicode characters with the property White_Space, with their name and theirGeneral_Categoryvalue :0009 TAB ; White_Space # Cc TABULATION <control-0009> 000A LF ; White_Space # Cc LINE FEED <control-000A> 000B ; White_Space # Cc VERTICAL TABULATION <control-000B> 000C ; White_Space # Cc FORM FEED <control-000C> 000D CR ; White_Space # Cc CARRIAGE RETURN <control-000D> 0020 ; White_Space # Zs SPACE 0085 ; White_Space # Cc NEXT LINE <control-0085> 00A0 ; White_Space # Zs NO-BREAK SPACE 1680 ; White_Space # Zs OGHAM SPACE MARK 2000 ; White_Space # Zs EN QUAD 2001 ; White_Space # Zs EM QUAD 2002 ; White_Space # Zs EN SPACE 2003 ; White_Space # Zs EM SPACE 2004 ; White_Space # Zs THREE-PER-EM SPACE 2005 ; White_Space # Zs FOUR-PER-EM SPACE 2006 ; White_Space # Zs SIX-PER-EM SPACE 2007 ; White_Space # Zs FIGURE SPACE 2008 ; White_Space # Zs PUNCTUATION SPACE 2009 ; White_Space # Zs THIN SPACE 200A ; White_Space # Zs HAIR SPACE 2028 ; White_Space # Zl LINE SEPARATOR 2029 ; White_Space # Zp PARAGRAPH SEPARATOR 202F ; White_Space # Zs NARROW NO-BREAK SPACE 205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE 3000 ; White_Space # Zs IDEOGRAPHIC SPACENote that I used the notations TAB, LF and CR, standing for the three characters \t,\nand\r, instead of the chars themselvesSo, in order to get the number of Non_Spacestrings, we should, normally, use the simple regex\S+. However, it does not give the right number. Indeed, when several characters, with code-point over theBMP, are consecutive, they are not seen as a globalNon_Spacestring but as individual characters :-((You may test my statement with this string, composed of four consecutive emojichars 👨👩👦👧. The regex\S+returns fourNon_Spacestrings, whereas I would have expected only one string !Consequently, I verified that, when the number of four bytes chars is > 0, the suitable regex to count all theNon_Spacestrings of a file, whatever their Unicode code-point, is rather the regex((?!\s).[\x{D800}-\x{DFFF}]?)+( longer, I agree but exact ! )
 So, I would like to propose a new layout of an summary feature, which should be more informative. It contains a list of regexes which allow you to count different subsets of characters from the current file contents. Of course, tick the Wrap aroundoption, in theFinddialog and click on theCountbutton for tests !IMPORTANT : In the list below, any text, before the colon character of each line, is the name which should be displayed in the new Summarydialog !FULL File Path : X:\....\....\ CREATION Date : Name Month Day 22-05-26 Year MODIFICATION Date : Name Month Day 22-05-26 Year READ-ONLY flag : YES / NO READ-ONLY editor : YES / NO Current VIEW : MAIN view / SECONDARY view Current ENCODING : UTF-... / ANSI Current LANGUAGE : TXT ( Normal txt file) / ... Current Line END : Windows (CR LF) / Macintosh (CR) / Unix (LF) Current WRAPPING : YES / NO •------------------------------------------------•----------------------------------------------------------------------------•------------------------------------------------•-----------------------------------• | UTF-8 [-BOM] | UCS-2/UTF-16 BE/LE BOM | ANSI •------------------------------------------------•----------------------------------------------------------------------------•------------------------------------------------•-----------------------------------• | | | 1-BYTE Chars : N1 | (?![\r\n])[\x{0000}-\x{007F}] | 0 | [^\r\n] 2-BYTES Chars : N2 | [\x{0080}-\x{07FF}] | (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}] | 0 3-BYTES Chars : N3 | (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}] | 0 | 0 | | | Sum BMP Chars : N1 + N2 + N3 | (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}] or [^\r\n\x{D800}-\x{DFFF}] | idem | [^\r\n] 4-BYTES Chars : N4 | (?-s).[\x{D800}-\x{DFFF}] or [\x{D800}-\x{DFFF}] | idem | 0 | | | Chars w/o CR|LF : N1 + N2 + N3 + N4 | [^\r\n] | idem | idem EOL ( CR or LF ) : N0 | \r|\n | idem | idem | | | TOTAL Characters : N0 + N1 + N2 + N3 + N4 | (?s). | idem | idem | | | | | | BYTE Length : | N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4 | N0 × 2 + N2 × 2 + N4 × 4 | NO + N1 | | | Byte Order Mark : | 0 ( UTF-8) or 3 ( UTF-8-BOM ) | 2 | 0 | | | BUFFER Length : BYTE length + BOM | | | | | | Length on DISK : Length CURRENT file on DISK| | | | | | | | | NON BLANK chars : | [^\r\n\t\x20] | idem | idem | | | WORDS count : (Caution !) | \w+ | idem | idem | | | NON-SPACE count : | (?:(?!\s).[\x{D800}-\x{DFFF}]?)+ or \S+ | idem | \S+ | | | | | | True EMPTY lines : L1 | (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n) | idem | (?<!\f)^(?:\r\n|\r|\n) | | | True BLANK lines : L2 | (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z) | idem | (?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z) | | | | | | EMPTY/BLANK lines : L1 + L2 | (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z) | idem | (?<!\f)^[\t\x20]*(?:\r\n|\r|\n|\z) | | | NON-BLANK lines : | (?-s)(?!^[\t\x20]+$)^(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z) | idem | (?-s)(?!^[\t\x20]+$)^(?:.|\f)+(?:\r\n|\r|\n|\z) | | | TOTAL lines : | (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z | idem | (?-s)\r\n|\r|\n|(?:.|\f)\z | | | | | | SELECTION(S) : X characters (Y bytes) in Z ranges | idem | idem •------------------------------------------------•----------------------------------------------------------------------------•------------------------------------------------•------------------------------------•Continued discussion in the next post guy038 
- 
- 
 Hi, All, Remarks : Although most of the regexes, above, can be easily understood, here are some additional elements : - 
The regex (?-s).[\x{D800}-\x{DFFF}]is the sole correct syntax, with our Boost regex engine, to count all the characters over theBMP. But it may fail with the messageRan out of stack space trying to match the regular expression.. Luckily, I do not use it because it can be deduced from the differenceTotal_Standard - Total_BMP
- 
The regex (?s)((?!\s).[\x{D800}-\x{DFFF}]?)+, to count all theNon_Spacestrings, was explained before but may fail with the messageRan out of stack space trying to match the regular expression.
- 
In all the regexes, relative to the counting of lines, you probably noticed the character class [\f\x{0085}\x{2028}\x{2029}]. It must be present because the four characters\f,\x{0085},\x{2028}and\x{2029}are, both, considered as astartand anEndof line, like the assertions^and$!- For instance, if, in a new file, you insert one Next_Line char ( NEL), of code-point\x{0085}and hit theEnterkey, this sole line is wrongly seen as an empty line by the simple regex^(?:\r\n|\r|\n)which matches the line-break after theNext_Linechar !
 
- For instance, if, in a new file, you insert one Next_Line char ( 
 
 Here is the python script, split on two posts # encoding=utf-8 #------------------------------------------------------------------------- # STATISTICS about the CURRENT file ( v0.6 ) #------------------------------------------------------------------------- from __future__ import print_function # for Python2 compatibility from Npp import * import re import os, time, datetime import ctypes from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT # -------------------------------------------------------------------------------------------------------------------------------------------------------------- # From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- def npp_get_statusbar(statusbar_item_number): WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM) FindWindowW = ctypes.windll.user32.FindWindowW FindWindowExW = ctypes.windll.user32.FindWindowExW SendMessageW = ctypes.windll.user32.SendMessageW LRESULT = LPARAM SendMessageW.restype = LRESULT SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ] EnumChildWindows = ctypes.windll.user32.EnumChildWindows GetClassNameW = ctypes.windll.user32.GetClassNameW create_unicode_buffer = ctypes.create_unicode_buffer SBT_OWNERDRAW = 0x1000 WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13 npp_get_statusbar.STATUSBAR_HANDLE = None def get_result_from_statusbar(statusbar_item_number): assert statusbar_item_number <= 5 retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0) length = retcode & 0xFFFF type = (retcode >> 16) & 0xFFFF assert (type != SBT_OWNERDRAW) text_buffer = create_unicode_buffer(length) retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer)) retval = '{}'.format(text_buffer[:length]) return retval def EnumCallback(hwnd, lparam): curr_class = create_unicode_buffer(256) GetClassNameW(hwnd, curr_class, 256) if curr_class.value.lower() == "msctls_statusbar32": npp_get_statusbar.STATUSBAR_HANDLE = hwnd return False # stop the enumeration return True # continue the enumeration npp_hwnd = FindWindowW(u"Notepad++", None) EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0) if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number) assert False St_bar = npp_get_statusbar(4) # Zone 4 ( STATUSBARSECTION.UNICODETYPE )See next post for continuation ! 
- 
- 
 Continuation of the script : # -------------------------------------------------------------------------------------------------------------------------------------------------------------- def number(occ): global num num += 1 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_encoding = str(notepad.getEncoding()) if Curr_encoding == 'ENC8BIT': Curr_encoding = 'ANSI' if Curr_encoding == 'COOKIE': Curr_encoding = 'UTF-8' if Curr_encoding == 'UTF8': Curr_encoding = 'UTF-8-BOM' if Curr_encoding == 'UCS2BE': Curr_encoding = 'UTF-16 BE BOM' if Curr_encoding == 'UCS2LE': Curr_encoding = 'UTF-16 LE BOM' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': Line_title = 95 else: Line_title = 75 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- File_name = notepad.getCurrentFilename() if os.path.isfile(File_name) == True: Creation_date = time.ctime(os.path.getctime(File_name)) Modif_date = time.ctime(os.path.getmtime(File_name)) Size_length = os.path.getsize(File_name) RO_flag = 'YES' if os.access(File_name, os.W_OK): RO_flag = 'NO' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- RO_editor = 'NO' if editor.getReadOnly() == True: RO_editor = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if notepad.getCurrentView() == 0: Curr_view = 'MAIN View' else: Curr_view = 'SECONDARY view' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_lang = notepad.getCurrentLang() Lang_desc = notepad.getLanguageDesc(Curr_lang) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if editor.getEOLMode() == 0: Curr_eol = 'Windows (CR LF)' if editor.getEOLMode() == 1: Curr_eol = 'Macintosh (CR)' if editor.getEOLMode() == 2: Curr_eol = 'Unix (LF)' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_wrap = 'NO' if editor.getWrapMode() == 1: Curr_wrap = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'[^\r\n]', number) if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': editor.research(r'(?![\r\n])[\x{0000}-\x{007F}]', number) Total_1_byte = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': editor.research(r'[\x{0080}-\x{07FF}]', number) if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number) # ALL BMP vchars ( With PYTHON, the [^\r\n\x{D800}-\x{DFFF}] syntax does NOT work properly !) Total_2_bytes = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': editor.research(r'(?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]', number) Total_3_bytes = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'[^\r\n]', number) Total_standard = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_4_bytes = 0 # By default if Curr_encoding != 'ANSI': Total_4_bytes = Total_standard - Total_BMP # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\r|\n', number) Total_EOL = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_chars = Total_EOL + Total_standard # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'ANSI': Bytes_length = Total_EOL + Total_1_byte if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': Bytes_length = Total_EOL + Total_1_byte + 2 * Total_2_bytes + 3 * Total_3_bytes + 4 * Total_4_bytes if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes # -------------------------------------------------------------------------------------------------------------------------------------------------------------- BOM = 0 # Default ANSI and UTF-8 if Curr_encoding == 'UTF-8-BOM': BOM = 3 if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': BOM = 2 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Buffer_length = Bytes_length + BOM # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'[^\r\n\t\x20]', number) Non_blank_chars = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\w+', number) Words_count = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI' or Total_4_bytes == 0: editor.research(r'\S+', number) else: editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number) Non_space_count = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'(?<!\f)^(?:\r\n|\r|\n)', number) else: editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)', number) Empty_lines = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'(?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)', number) else: editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)', number) Blank_lines = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Emp_blk_lines = Empty_lines + Blank_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'(?-s)\r\n|\r|\n|(?:.|\f)\z', number) else: editor.research(r'(?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z', number) Total_lines = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Non_blk_lines = Total_lines - Emp_blk_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Num_sel = editor.getSelections() # Get ALL selections ( EMPTY or NOT ) # print ('Res = ', Num_sel) if Num_sel != 0: Bytes_count = 0 Chars_count = 0 for n in range(Num_sel): Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n) Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Chars_count < 2: Txt_chars = ' selected char (' else: Txt_chars = ' selected chars (' if Bytes_count < 2: Txt_bytes = ' selected byte) in ' else: Txt_bytes = ' selected bytes) in ' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Num_sel < 2 and Bytes_count == 0: Txt_ranges = ' EMPTY range\n' if Num_sel < 2 and Bytes_count > 0: Txt_ranges = ' range\n' if Num_sel > 1 and Bytes_count == 0: Txt_ranges = ' EMPTY ranges\n' if Num_sel > 1 and Bytes_count > 0: Txt_ranges = ' ranges (EMPTY or NOT)\n' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- line_list = [] # empty list line_list.append ('-' * Line_title) line_list.append (' ' * int((Line_title - 37) / 2) + 'SUMMARY on ' + str(datetime.datetime.now())) line_list.append ('-' * Line_title +'\n') line_list.append (' FULL File Path : ' + File_name + '\n') if os.path.isfile(File_name) == True: line_list.append(' CREATION Date : ' + Creation_date) line_list.append(' MODIFICATION Date : ' + Modif_date + '\n') line_list.append(' READ-ONLY flag : ' + RO_flag ) line_list.append (' READ-ONLY editor : ' + RO_editor + '\n\n') line_list.append (' Current VIEW : ' + Curr_view + '\n') line_list.append (' Current ENCODING : ' + Curr_encoding + '\n') line_list.append (' Current LANGUAGE : ' + str(Curr_lang) + ' (' + Lang_desc + ')\n') line_list.append (' Current Line END : ' + Curr_eol + '\n') line_list.append (' Current WRAPPING : ' + Curr_wrap + '\n\n') line_list.append (' 1-BYTE Chars : ' + str(Total_1_byte)) line_list.append (' 2-BYTES Chars : ' + str(Total_2_bytes)) line_list.append (' 3-BYTES Chars : ' + str(Total_3_bytes) + '\n') line_list.append (' Sum BMP Chars : ' + str(Total_BMP)) line_list.append (' 4-BYTES Chars : ' + str(Total_4_bytes) + '\n') line_list.append (' CHARS w/o CR & LF : ' + str(Total_standard)) line_list.append (' EOL ( CR or LF ) : ' + str(Total_EOL) + '\n') line_list.append (' TOTAL characters : ' + str(Total_chars) + '\n\n') if Curr_encoding == 'ANSI': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)') if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\ + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)') if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)') line_list.append (' Byte Order Mark : ' + str(BOM) + '\n') line_list.append (' BUFFER Length : ' + str(Buffer_length)) if os.path.isfile(File_name) == True: line_list.append (' Length on DISK : ' + str(Size_length) + '\n\n') else: line_list.append ('\n') line_list.append (' NON-Blank Chars : ' + str(Non_blank_chars) + '\n') line_list.append (' WORDS Count : ' + str(Words_count) + ' (Caution !)\n') line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + '\n\n') line_list.append (' True EMPTY lines : ' + str(Empty_lines)) line_list.append (' True BLANK lines : ' + str(Blank_lines) + '\n') line_list.append (' EMPTY/BLANK lines : ' + str(Emp_blk_lines) + '\n') line_list.append (' NON-BLANK lines : ' + str(Non_blk_lines)) line_list.append (' TOTAL Lines : ' + str(Total_lines) + '\n\n') line_list.append (' SELECTION(S) : ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges) editor.copyText ('\r\n'.join(line_list)) notepad.new() editor.paste() editor.copyText('') if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM': if Curr_encoding == 'UTF-8': # SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature notepad.prompt ('CURRENT file re-interpreted as ' + St_bar + ' => Possible ERRONEOUS results' + \ '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!', '') # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------
 The way to use this script is quite self-explanatory. Just three points to emphazise : - 
On the BUFFER lengthline, the values between parentheses :- 
Always begin with the number of EOL( I omitted thebafterx 1, on purpose ! )- 
Followed with the number of the 1-BYTEfor anANSIencoded file
- 
Followed with the numbers of the 1-BYTE,2-BYTES,3-BYTESand4-BYTES, for anUTF-8orUTF-8-BOMencoded file
- 
Followed with the numbers of the 2-BYTESand4-BYTES, for anUTF-16 BE BOMorUTF-16 LE BOMencoded file
 
- 
 
- 
- 
Normally, when a file is saved the values BUFFEER lengthandLength on DISKshould always be equal. If not, two cases are possible :- 
This file have been recently modified ( trivial case ) 
- 
The file is not identified with a BOMand has been re-interpreted with an other NON-Unicode encoding. Then, apply the actions, indicated in the pop-up message !
 
- 
- 
For a new #file, some values are obviously absent. These are theMODIFICATION date, theCREATION date, theREAD-ONLYflag and theLength on DISK( size ) values
 Best Regards, guy038 
- 
- 
 @guy038 said in Tests and impressions on the "View > Summary..." functionality: editor.copyText (‘\r\n’.join(line_list)) notepad.new() editor.paste() editor.copyText(‘’) Couldn’t you just do notepad.new() editor.setText('\r\n'.join(line_list))and thus avoid overwriting the user’s clipboard? 
- 
 Hello, All, - 
So, I followed the excellent @mark-olson’s suggestion to bypass the clipboard functionality ! 
- 
Now, in case of a RuntimeError, when searching for the NON-SPACE count of characters, I used an exception which displays a warning message, if theErr_Regexis True. But, even when theErr_Regexvariable is False, the result is not totally guaranteed too, if the analyzed file contains bytes over theBMP.
 So, globally, whatever the Err_Regexstatus, theNON-SPACE countvalue may be increased or decreased by1, in some cases ( still unclear ) !
 Here is the v0.7version of my script ( I indeed gave a version number to my successive attempts ! )# encoding=utf-8 #------------------------------------------------------------------------- # STATISTICS about the CURRENT file ( v0.7 ) #------------------------------------------------------------------------- from __future__ import print_function # for Python2 compatibility from Npp import * import re import os, time, datetime import ctypes from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT # -------------------------------------------------------------------------------------------------------------------------------------------------------------- # From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- def npp_get_statusbar(statusbar_item_number): WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM) FindWindowW = ctypes.windll.user32.FindWindowW FindWindowExW = ctypes.windll.user32.FindWindowExW SendMessageW = ctypes.windll.user32.SendMessageW LRESULT = LPARAM SendMessageW.restype = LRESULT SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ] EnumChildWindows = ctypes.windll.user32.EnumChildWindows GetClassNameW = ctypes.windll.user32.GetClassNameW create_unicode_buffer = ctypes.create_unicode_buffer SBT_OWNERDRAW = 0x1000 WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13 npp_get_statusbar.STATUSBAR_HANDLE = None def get_result_from_statusbar(statusbar_item_number): assert statusbar_item_number <= 5 retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0) length = retcode & 0xFFFF type = (retcode >> 16) & 0xFFFF assert (type != SBT_OWNERDRAW) text_buffer = create_unicode_buffer(length) retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer)) retval = '{}'.format(text_buffer[:length]) return retval def EnumCallback(hwnd, lparam): curr_class = create_unicode_buffer(256) GetClassNameW(hwnd, curr_class, 256) if curr_class.value.lower() == "msctls_statusbar32": npp_get_statusbar.STATUSBAR_HANDLE = hwnd return False # stop the enumeration return True # continue the enumeration npp_hwnd = FindWindowW(u"Notepad++", None) EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0) if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number) assert False St_bar = npp_get_statusbar(4) # Zone 4 ( STATUSBARSECTION.UNICODETYPE )Continuation on next post guy038 
- 
- 
 Hi all, Continuation of version v0.7of the script :# -------------------------------------------------------------------------------------------------------------------------------------------------------------- def number(occ): global num num += 1 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_encoding = str(notepad.getEncoding()) if Curr_encoding == 'ENC8BIT': Curr_encoding = 'ANSI' if Curr_encoding == 'COOKIE': Curr_encoding = 'UTF-8' if Curr_encoding == 'UTF8': Curr_encoding = 'UTF-8-BOM' if Curr_encoding == 'UCS2BE': Curr_encoding = 'UTF-16 BE BOM' if Curr_encoding == 'UCS2LE': Curr_encoding = 'UTF-16 LE BOM' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': Line_title = 95 else: Line_title = 75 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- File_name = notepad.getCurrentFilename() if os.path.isfile(File_name) == True: Creation_date = time.ctime(os.path.getctime(File_name)) Modif_date = time.ctime(os.path.getmtime(File_name)) Size_length = os.path.getsize(File_name) RO_flag = 'YES' if os.access(File_name, os.W_OK): RO_flag = 'NO' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- RO_editor = 'NO' if editor.getReadOnly() == True: RO_editor = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if notepad.getCurrentView() == 0: Curr_view = 'MAIN View' else: Curr_view = 'SECONDARY view' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_lang = notepad.getCurrentLang() Lang_desc = notepad.getLanguageDesc(Curr_lang) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if editor.getEOLMode() == 0: Curr_eol = 'Windows (CR LF)' if editor.getEOLMode() == 1: Curr_eol = 'Macintosh (CR)' if editor.getEOLMode() == 2: Curr_eol = 'Unix (LF)' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_wrap = 'NO' if editor.getWrapMode() == 1: Curr_wrap = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'[^\r\n]', number) if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': editor.research(r'(?![\r\n])[\x{0000}-\x{007F}]', number) Total_1_byte = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': editor.research(r'[\x{0080}-\x{07FF}]', number) if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number) # ALL BMP vchars ( With PYTHON, the [^\r\n\x{D800}-\x{DFFF}] syntax does NOT work properly !) Total_2_bytes = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': editor.research(r'(?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]', number) Total_3_bytes = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'[^\r\n]', number) Total_standard = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_4_bytes = 0 # By default if Curr_encoding != 'ANSI': Total_4_bytes = Total_standard - Total_BMP # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\r|\n', number) Total_EOL = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_chars = Total_EOL + Total_standard # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'ANSI': Bytes_length = Total_EOL + Total_1_byte if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': Bytes_length = Total_EOL + Total_1_byte + 2 * Total_2_bytes + 3 * Total_3_bytes + 4 * Total_4_bytes if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes # -------------------------------------------------------------------------------------------------------------------------------------------------------------- BOM = 0 # Default ANSI and UTF-8 if Curr_encoding == 'UTF-8-BOM': BOM = 3 if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': BOM = 2 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Buffer_length = Bytes_length + BOM # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'[^\r\n\t\x20]', number) Non_blank_chars = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\w+', number) Words_count = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Err_Regex = False num = 0 if Curr_encoding == 'ANSI' or Total_4_bytes == 0: editor.research(r'\S+', number) else: try: editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number) except RuntimeError: Err_Regex = True Non_space_count = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'(?<!\f)^(?:\r\n|\r|\n)', number) else: editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)', number) Empty_lines = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'(?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)', number) else: editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)', number) Blank_lines = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Emp_blk_lines = Empty_lines + Blank_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'(?-s)\r\n|\r|\n|(?:.|\f)\z', number) else: editor.research(r'(?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z', number) Total_lines = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Non_blk_lines = Total_lines - Emp_blk_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Num_sel = editor.getSelections() # Get ALL selections ( EMPTY or NOT ) # print ('Res = ', Num_sel) if Num_sel != 0: Bytes_count = 0 Chars_count = 0 for n in range(Num_sel): Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n) Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Chars_count < 2: Txt_chars = ' selected char (' else: Txt_chars = ' selected chars (' if Bytes_count < 2: Txt_bytes = ' selected byte) in ' else: Txt_bytes = ' selected bytes) in ' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Num_sel < 2 and Bytes_count == 0: Txt_ranges = ' EMPTY range\n' if Num_sel < 2 and Bytes_count > 0: Txt_ranges = ' range\n' if Num_sel > 1 and Bytes_count == 0: Txt_ranges = ' EMPTY ranges\n' if Num_sel > 1 and Bytes_count > 0: Txt_ranges = ' ranges (EMPTY or NOT)\n' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- line_list = [] # empty list line_list.append ('-' * Line_title) line_list.append (' ' * int((Line_title - 37) / 2) + 'SUMMARY on ' + str(datetime.datetime.now())) line_list.append ('-' * Line_title +'\n') line_list.append (' FULL File Path : ' + File_name + '\n') if os.path.isfile(File_name) == True: line_list.append(' CREATION Date : ' + Creation_date) line_list.append(' MODIFICATION Date : ' + Modif_date + '\n') line_list.append(' READ-ONLY flag : ' + RO_flag ) line_list.append (' READ-ONLY editor : ' + RO_editor + '\n\n') line_list.append (' Current VIEW : ' + Curr_view + '\n') line_list.append (' Current ENCODING : ' + Curr_encoding + '\n') line_list.append (' Current LANGUAGE : ' + str(Curr_lang) + ' (' + Lang_desc + ')\n') line_list.append (' Current Line END : ' + Curr_eol + '\n') line_list.append (' Current WRAPPING : ' + Curr_wrap + '\n\n') line_list.append (' 1-BYTE Chars : ' + str(Total_1_byte)) line_list.append (' 2-BYTES Chars : ' + str(Total_2_bytes)) line_list.append (' 3-BYTES Chars : ' + str(Total_3_bytes) + '\n') line_list.append (' Sum BMP Chars : ' + str(Total_BMP)) line_list.append (' 4-BYTES Chars : ' + str(Total_4_bytes) + '\n') line_list.append (' CHARS w/o CR & LF : ' + str(Total_standard)) line_list.append (' EOL ( CR or LF ) : ' + str(Total_EOL) + '\n') line_list.append (' TOTAL characters : ' + str(Total_chars) + '\n\n') if Curr_encoding == 'ANSI': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)') if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\ + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)') if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)') line_list.append (' Byte Order Mark : ' + str(BOM) + '\n') line_list.append (' BUFFER Length : ' + str(Buffer_length)) if os.path.isfile(File_name) == True: line_list.append (' Length on DISK : ' + str(Size_length) + '\n\n') else: line_list.append ('\n') line_list.append (' NON-Blank Chars : ' + str(Non_blank_chars) + '\n') line_list.append (' WORDS Count : ' + str(Words_count) + ' (Caution !)\n') if Err_Regex == False: line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + '\n\n') else: line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + ' (ERROR : Ran out of stack space trying to match the regular expressions !)\n\n') line_list.append (' True EMPTY lines : ' + str(Empty_lines)) line_list.append (' True BLANK lines : ' + str(Blank_lines) + '\n') line_list.append (' EMPTY/BLANK lines : ' + str(Emp_blk_lines) + '\n') line_list.append (' NON-BLANK lines : ' + str(Non_blk_lines)) line_list.append (' TOTAL Lines : ' + str(Total_lines) + '\n\n') line_list.append (' SELECTION(S) : ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges) notepad.new() editor.setText('\r\n'.join(line_list)) if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM': if Curr_encoding == 'UTF-8': # SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature notepad.prompt ('CURRENT file re-interpreted as ' + St_bar + ' => Possible ERRONEOUS results' + \ '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!', '') # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------
 So, just test this script against any file, to get any possible bug or limitation !! I’ve also heard of compiled regexes in Python. Would that be interesting for this script ? Best Regards, guy038 
- 
 Hi, All, I realized that it was the mess regarding the line_endings, in the Summaryreport. Thus, by defining aLine_endvariable equal to\r\n, the results are more harmonious !One advantage : if you do not want any supplementary line-break, in the Summaryreport, simply change the line :Line_end = '\r\n'by this one : Line_end = ''So, here is the v0.8version of my script :# encoding=utf-8 #------------------------------------------------------------------------- # STATISTICS about the CURRENT file ( v0.8 ) #------------------------------------------------------------------------- from __future__ import print_function # for Python2 compatibility from Npp import * import re import os, time, datetime import ctypes from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT # -------------------------------------------------------------------------------------------------------------------------------------------------------------- # From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- def npp_get_statusbar(statusbar_item_number): WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM) FindWindowW = ctypes.windll.user32.FindWindowW FindWindowExW = ctypes.windll.user32.FindWindowExW SendMessageW = ctypes.windll.user32.SendMessageW LRESULT = LPARAM SendMessageW.restype = LRESULT SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ] EnumChildWindows = ctypes.windll.user32.EnumChildWindows GetClassNameW = ctypes.windll.user32.GetClassNameW create_unicode_buffer = ctypes.create_unicode_buffer SBT_OWNERDRAW = 0x1000 WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13 npp_get_statusbar.STATUSBAR_HANDLE = None def get_result_from_statusbar(statusbar_item_number): assert statusbar_item_number <= 5 retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0) length = retcode & 0xFFFF type = (retcode >> 16) & 0xFFFF assert (type != SBT_OWNERDRAW) text_buffer = create_unicode_buffer(length) retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer)) retval = '{}'.format(text_buffer[:length]) return retval def EnumCallback(hwnd, lparam): curr_class = create_unicode_buffer(256) GetClassNameW(hwnd, curr_class, 256) if curr_class.value.lower() == "msctls_statusbar32": npp_get_statusbar.STATUSBAR_HANDLE = hwnd return False # stop the enumeration return True # continue the enumeration npp_hwnd = FindWindowW(u"Notepad++", None) EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0) if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number) assert False St_bar = npp_get_statusbar(4) # Zone 4 ( STATUSBARSECTION.UNICODETYPE )Continuation on next post guy038 
- 
 Hi all, Continuation of version v0.8of the script :# -------------------------------------------------------------------------------------------------------------------------------------------------------------- def number(occ): global num num += 1 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_encoding = str(notepad.getEncoding()) if Curr_encoding == 'ENC8BIT': Curr_encoding = 'ANSI' if Curr_encoding == 'COOKIE': Curr_encoding = 'UTF-8' if Curr_encoding == 'UTF8': Curr_encoding = 'UTF-8-BOM' if Curr_encoding == 'UCS2BE': Curr_encoding = 'UTF-16 BE BOM' if Curr_encoding == 'UCS2LE': Curr_encoding = 'UTF-16 LE BOM' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': Line_title = 95 else: Line_title = 75 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- File_name = notepad.getCurrentFilename() if os.path.isfile(File_name) == True: Creation_date = time.ctime(os.path.getctime(File_name)) Modif_date = time.ctime(os.path.getmtime(File_name)) Size_length = os.path.getsize(File_name) RO_flag = 'YES' if os.access(File_name, os.W_OK): RO_flag = 'NO' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- RO_editor = 'NO' if editor.getReadOnly() == True: RO_editor = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if notepad.getCurrentView() == 0: Curr_view = 'MAIN View' else: Curr_view = 'SECONDARY view' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_lang = notepad.getCurrentLang() Lang_desc = notepad.getLanguageDesc(Curr_lang) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if editor.getEOLMode() == 0: Curr_eol = 'Windows (CR LF)' if editor.getEOLMode() == 1: Curr_eol = 'Macintosh (CR)' if editor.getEOLMode() == 2: Curr_eol = 'Unix (LF)' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_wrap = 'NO' if editor.getWrapMode() == 1: Curr_wrap = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'[^\r\n]', number) if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': editor.research(r'(?![\r\n])[\x{0000}-\x{007F}]', number) Total_1_byte = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': editor.research(r'[\x{0080}-\x{07FF}]', number) if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number) # ALL BMP vchars ( With PYTHON, the [^\r\n\x{D800}-\x{DFFF}] syntax does NOT work properly !) Total_2_bytes = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': editor.research(r'(?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]', number) Total_3_bytes = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'[^\r\n]', number) Total_standard = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_4_bytes = 0 # By default if Curr_encoding != 'ANSI': Total_4_bytes = Total_standard - Total_BMP # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\r|\n', number) Total_EOL = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_chars = Total_EOL + Total_standard # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'ANSI': Bytes_length = Total_EOL + Total_1_byte if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': Bytes_length = Total_EOL + Total_1_byte + 2 * Total_2_bytes + 3 * Total_3_bytes + 4 * Total_4_bytes if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes # -------------------------------------------------------------------------------------------------------------------------------------------------------------- BOM = 0 # Default ANSI and UTF-8 if Curr_encoding == 'UTF-8-BOM': BOM = 3 if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': BOM = 2 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Buffer_length = Bytes_length + BOM # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'[^\r\n\t\x20]', number) Non_blank_chars = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\w+', number) Words_count = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Err_Regex = False num = 0 if Curr_encoding == 'ANSI' or Total_4_bytes == 0: editor.research(r'\S+', number) else: try: editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number) except RuntimeError: Err_Regex = True Non_space_count = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'(?<!\f)^(?:\r\n|\r|\n)', number) else: editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)', number) Empty_lines = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'(?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)', number) else: editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)', number) Blank_lines = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Emp_blk_lines = Empty_lines + Blank_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'(?-s)\r\n|\r|\n|(?:.|\f)\z', number) else: editor.research(r'(?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z', number) Total_lines = num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Non_blk_lines = Total_lines - Emp_blk_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Num_sel = editor.getSelections() # Get ALL selections ( EMPTY or NOT ) # print ('Res = ', Num_sel) if Num_sel != 0: Bytes_count = 0 Chars_count = 0 for n in range(Num_sel): Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n) Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Chars_count < 2: Txt_chars = ' selected char (' else: Txt_chars = ' selected chars (' if Bytes_count < 2: Txt_bytes = ' selected byte) in ' else: Txt_bytes = ' selected bytes) in ' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Num_sel < 2 and Bytes_count == 0: Txt_ranges = ' EMPTY range\n' if Num_sel < 2 and Bytes_count > 0: Txt_ranges = ' range\n' if Num_sel > 1 and Bytes_count == 0: Txt_ranges = ' EMPTY ranges\n' if Num_sel > 1 and Bytes_count > 0: Txt_ranges = ' ranges (EMPTY or NOT)\n' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- line_list = [] # empty list Line_end = '\r\n' line_list.append ('-' * Line_title) line_list.append (' ' * int((Line_title - 37) / 2) + 'SUMMARY on ' + str(datetime.datetime.now())) line_list.append ('-' * Line_title + Line_end) line_list.append (' FULL File Path : ' + File_name + Line_end) if os.path.isfile(File_name) == True: line_list.append(' CREATION Date : ' + Creation_date) line_list.append(' MODIFICATION Date : ' + Modif_date + Line_end) line_list.append(' READ-ONLY flag : ' + RO_flag ) line_list.append (' READ-ONLY editor : ' + RO_editor + Line_end * 2) line_list.append (' Current VIEW : ' + Curr_view + Line_end) line_list.append (' Current ENCODING : ' + Curr_encoding + Line_end) line_list.append (' Current LANGUAGE : ' + str(Curr_lang) + ' (' + Lang_desc + ')' + Line_end) line_list.append (' Current Line END : ' + Curr_eol + Line_end) line_list.append (' Current WRAPPING : ' + Curr_wrap + Line_end * 2) line_list.append (' 1-BYTE Chars : ' + str(Total_1_byte)) line_list.append (' 2-BYTES Chars : ' + str(Total_2_bytes)) line_list.append (' 3-BYTES Chars : ' + str(Total_3_bytes) + Line_end) line_list.append (' Sum BMP Chars : ' + str(Total_BMP)) line_list.append (' 4-BYTES Chars : ' + str(Total_4_bytes) + Line_end) line_list.append (' CHARS w/o CR & LF : ' + str(Total_standard)) line_list.append (' EOL ( CR or LF ) : ' + str(Total_EOL) + Line_end) line_list.append (' TOTAL characters : ' + str(Total_chars) + Line_end * 2) if Curr_encoding == 'ANSI': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)') if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\ + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)') if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)') line_list.append (' Byte Order Mark : ' + str(BOM) + Line_end) line_list.append (' BUFFER Length : ' + str(Buffer_length)) if os.path.isfile(File_name) == True: line_list.append (' Length on DISK : ' + str(Size_length) + Line_end * 2) else: line_list.append ('\n') line_list.append (' NON-Blank Chars : ' + str(Non_blank_chars) + Line_end) line_list.append (' WORDS Count : ' + str(Words_count) + ' (Caution !)' + Line_end) if Err_Regex == False: line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + Line_end * 2) else: line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + ' (ERROR : Ran out of stack space trying to match the regular expressions !)' + Line_end * 2) line_list.append (' True EMPTY lines : ' + str(Empty_lines)) line_list.append (' True BLANK lines : ' + str(Blank_lines) + Line_end) line_list.append (' EMPTY/BLANK lines : ' + str(Emp_blk_lines) + Line_end) line_list.append (' NON-BLANK lines : ' + str(Non_blk_lines)) line_list.append (' TOTAL Lines : ' + str(Total_lines) + Line_end * 2) line_list.append (' SELECTION(S) : ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges) notepad.new() editor.setText('\r\n'.join(line_list)) if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM': if Curr_encoding == 'UTF-8': # SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature notepad.prompt ('CURRENT file re-interpreted as ' + St_bar + ' => Possible ERRONEOUS results' + \ '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!', '') # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------
 Best Regards, guy038 
- 
 Hi, All, You’ll find, below, the v1.0version of my script. I changed a lot of things :- 
I add a counter to get the execution time of the script, which is written right after the current date, at the beginning of the summary 
- 
I modified some regexes in order to improve their performance as well as the order to search them for 
- 
I used the Pythonscript methods .editor.getLength(),editor.countCharacters(0, editor.getLength())andeditor.getLineCount()to get, respectively, the bytes length ( without a possibleBOM) value, the Total_chars value and the Total_lines value. Note that, in case of anUTF-8orUTF-8-BOMencoded file, we get two relations :- (A) Buffer length - Total_EOL - Total_1_byte - 2 × Total_2_bytes - 3 × Total_3_bytes = 4 × Total_4_bytes
- (B) Total_Chars - Total_EOL - Total_1_byte - Total_2_bytes - Total_3_bytes = Total_4_bytes
 
- (A) 
 So, we can deduce, from the relation A-B, the equations :Total_4_bytes = ( Total_length - Total_chars - Total_2_bytes - 2 × Total_3_bytes ) / 3and then : Total_1_byte = Total_chars - Total_EOL - Total_2_bytes - Total_3_bytes - Total_4_bytesThus, after counting the number of Total_2_bytesandTotal_3_bytes, the two resultsTotal_4_bytesandTotal_1_byteare easily deduced. This new way decreases, from a factor2to3, the execution time of the script, because, most of the time, the file contains only1-bytechars :-))However, the Buffer_lengthvalue wrongly remains the same, in case of anUTF-16 BE BOMorUTF-16 LE BOMencoded file. Thus, I needed to calcul theTotal_4_bytesandBuffer_lengthvalues, from the number ofTotal_2_bytes, with the relations :Total_4_bytes = Total_chars - Total_EOL - Total_2_bytesBytes_length = 2 * Total_EOL + 2 * Total_2_bytes + 4 × Total_4_bytes- 
Now, because some huge files may lead to a long time before getting the Summaryresults ( even with the native N++ version, BTW ! ), you can follow the progression of the different searches on thePythonconsole, which is automatically enabled at beginning of the script and disabled right before outputting the results
- 
At the end of the script, I just replace the notepad.promptmethod by thenotepad.messageBoxmethod in order to display the warning ( more logical ! )
 
 IMPORTANT : - 
Never switch to an other tab when running this script. Else, you’ll probably get unpredictable or negative results ! 
- 
Thus, by viewing the console messages, if you think that the results seem too long to happen for a specific file and that you prefer to abort its Summaryreport, simply stop the currentPythonscript with the classicalPlugins > Python Script > Stop scriptmenu option
 
 Now, I was a bit upset by some inconsistant results regarding the number of NON-SPACEstrings, when current file, with anUnicodeencoding, contains some bytes over theBMPSo, I searched among all my posts, since 2013, as well as some others used as documentation, for only those containing some four-bytescharacters and here is the list of these files with the reported results :•=============================•===========•=================•==================•============•================• | | | Expected | Summary Report | | | | Filename | 4_BYTES | NON-SPACE count | Difference | Encoding | | | | (?:(?!\s).[\x{D800}-\x{DFFF}]?)+ | | | •=============================•===========•=================•==================•============•================• | Symbola_Monospacified.txt | 11,951 | 199,891 | 199,882 | - 9 | UTF-8-BOM | | Total_Chars.txt | 262,136 | 9 | 18 | + 9 | UTF-8-BOM | •=============================•===========•=================•==================•============•================• | Caractères.txt | 2,901 | 7,361 | 7,358 | - 3 | UTF-8-BOM | | Test_2.txt | 1,276 | 8 | 9 | + 1 | UTF-8 | | Test_1.txt | 881 | 8 | 9 | + 1 | UTF-8 | | Plane_0.txt | 0 | 9 | 10 | + 1 | UCS-2 BE BOM | | Clemens.txt | 3,968 | 2,816 | 2,818 | + 2 | UTF-8-BOM | | Planes_0+1.txt | 65,534 | 9 | 12 | + 3 | UTF-8-BOM | •=============================•===========•=================•==================•============•================• | Chars_Over_BMP.txt | 28 | 455 | 455 | 0 | UTF-8-BOM | | Entites_by_Name.txt | 133 | 15,968 | 15,968 | 0 | UTF-8 | | Entites_by_Number.txt | 133 | 15,968 | 15,968 | 0 | UTF-8 | | Invisible_chars.txt | 31 | 3,459 | 3,459 | 0 | UTF-8-BOM | | Osmanya_Tout.txt | 119 | 605 | 605 | 0 | UTF-8-BOM | | Smileys.txt | 1,031 | 10,157 | 10,157 | 0 | UTF-8-BOM | | Alan_K.txt | 114 | 46,082 | 46,082 | 0 | UTF-8 | | Alexolog.txt | 13 | 2,199 | 2,199 | 0 | UTF-8 | | André_Z.txt | 8 | 5,860 | 5,860 | 0 | UTF-8 | | Bidule.txt | 1 | 327 | 327 | 0 | UTF-8 | | Carypt.txt | 1 | 3,551 | 3,551 | 0 | UTF-8 | | Dean_Corso.txt | 761 | 9,632 | 9,632 | 0 | UTF-8 | | Don_Ho.txt | 2 | 41,426 | 41,426 | 0 | UTF-8 | | Durkin.txt | 144 | 4,638 | 4,638 | 0 | UTF-8 | | Dylan.txt | 34 | 2,180 | 2,180 | 0 | UTF-8 | | Furek.txt | 20 | 499 | 499 | 0 | UTF-8 | | Gary_2.txt | 2 | 458 | 458 | 0 | UTF-8 | | Haleba.txt | 5 | 817 | 817 | 0 | UTF-8 | | ImSpecial.txt | 1 | 161 | 161 | 0 | UTF-8 | | Joss.txt | 6 | 105 | 105 | 0 | UTF-8 | | JR.txt | 39 | 1,735 | 1,735 | 0 | UTF-8 | | Mark_Olson.txt | 1 | 3,652 | 3,652 | 0 | UTF-8 | | Minus_Majus.txt | 62 | 9,931 | 9,931 | 0 | UTF-8 | | Niting-jain.txt | 4 | 537 | 537 | 0 | UTF-8 | | PeterCJ.txt | 31 | 37,323 | 37,323 | 0 | UTF-8 | | Petr_jaja.txt | 14 | 3,168 | 3,168 | 0 | UTF-8 | | Pintas.txt | 4 | 614 | 614 | 0 | UTF-8 | | Register.txt | 20 | 242 | 242 | 0 | UTF-8 | | Scott_3.txt | 4 | 42,552 | 42,552 | 0 | UTF-8 | | Skevich.txt | 6 | 715 | 715 | 0 | UTF-8 | | Statistiques.txt | 7 | 9,012 | 9,012 | 0 | UTF-8 | | Summary.txt | 7 | 4,322 | 4,322 | 0 | UTF-8 | | Summary_NEW.txt | 10 | 8,903 | 8,903 | 0 | UTF-8 | | Uzivatel.txt | 2 | 873 | 873 | 0 | UTF-8 | | Xavier_mdq.txt | 13 | 3,652 | 3,652 | 0 | UTF-8 | | Text.txt | 2,400 | 1,000 | 1,000 | 0 | UTF-8 | •============================•============•=================•==================•============•================•From that list, I deduced that the number of NON-space chars is erroneous in very rare cases, especially when current file contains consecutively : - 
All the characters of a font 
- 
All the characters of an Unicoderange
- 
All the characters of all Unicoderanges
 Luckily, in all the other cases, with a random position of these four-byteschars, theSummaryreport always gives the right results, regarding theNON-SPACEcount !
 Here is the v1.0version of my script, split on two posts :# encoding=utf-8 #------------------------------------------------------------------------- # STATISTICS about the CURRENT file ( v1.0 ) #------------------------------------------------------------------------- from __future__ import print_function # for Python2 compatibility from Npp import * import re import os, time, datetime import ctypes from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT # -------------------------------------------------------------------------------------------------------------------------------------------------------------- # From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- def npp_get_statusbar(statusbar_item_number): WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM) FindWindowW = ctypes.windll.user32.FindWindowW FindWindowExW = ctypes.windll.user32.FindWindowExW SendMessageW = ctypes.windll.user32.SendMessageW LRESULT = LPARAM SendMessageW.restype = LRESULT SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ] EnumChildWindows = ctypes.windll.user32.EnumChildWindows GetClassNameW = ctypes.windll.user32.GetClassNameW create_unicode_buffer = ctypes.create_unicode_buffer SBT_OWNERDRAW = 0x1000 WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13 npp_get_statusbar.STATUSBAR_HANDLE = None def get_result_from_statusbar(statusbar_item_number): assert statusbar_item_number <= 5 retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0) length = retcode & 0xFFFF type = (retcode >> 16) & 0xFFFF assert (type != SBT_OWNERDRAW) text_buffer = create_unicode_buffer(length) retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer)) retval = '{}'.format(text_buffer[:length]) return retval def EnumCallback(hwnd, lparam): curr_class = create_unicode_buffer(256) GetClassNameW(hwnd, curr_class, 256) if curr_class.value.lower() == "msctls_statusbar32": npp_get_statusbar.STATUSBAR_HANDLE = hwnd return False # stop the enumeration return True # continue the enumeration npp_hwnd = FindWindowW(u"Notepad++", None) EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0) if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number) assert False St_bar = npp_get_statusbar(4) # Zone 4 ( STATUSBARSECTION.UNICODETYPE )Continuation on next post guy038 
- 
- 
 Hi all, Continuation of version v1.0of the script :# -------------------------------------------------------------------------------------------------------------------------------------------------------------- def number(occ): global num num += 1 console.show() console.clear() Start_time = time.time() # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_encoding = str(notepad.getEncoding()) if Curr_encoding == 'ENC8BIT': Curr_encoding = 'ANSI' if Curr_encoding == 'COOKIE': Curr_encoding = 'UTF-8' if Curr_encoding == 'UTF8': Curr_encoding = 'UTF-8-BOM' if Curr_encoding == 'UCS2BE': Curr_encoding = 'UTF-16 BE BOM' if Curr_encoding == 'UCS2LE': Curr_encoding = 'UTF-16 LE BOM' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': Line_title = 95 else: Line_title = 75 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- File_name = notepad.getCurrentFilename() if os.path.isfile(File_name) == True: Creation_date = time.ctime(os.path.getctime(File_name)) Modif_date = time.ctime(os.path.getmtime(File_name)) Size_length = os.path.getsize(File_name) RO_flag = 'YES' if os.access(File_name, os.W_OK): RO_flag = 'NO' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- RO_editor = 'NO' if editor.getReadOnly() == True: RO_editor = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if notepad.getCurrentView() == 0: Curr_view = 'MAIN View' else: Curr_view = 'SECONDARY view' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_lang = notepad.getCurrentLang() Lang_desc = notepad.getLanguageDesc(Curr_lang) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if editor.getEOLMode() == 0: Curr_eol = 'Windows (CR LF)' if editor.getEOLMode() == 1: Curr_eol = 'Macintosh (CR)' if editor.getEOLMode() == 2: Curr_eol = 'Unix (LF)' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_wrap = 'NO' if editor.getWrapMode() == 1: Curr_wrap = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- print ('START') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Bytes_length = editor.getLength() Total_chars = editor.countCharacters(0, editor.getLength()) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\r|\n', number) Total_EOL = num print ('EOL') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_standard = Total_chars - Total_EOL # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'ANSI': Total_BMP = Total_standard Total_1_byte = Total_BMP Total_2_bytes = 0 Total_3_bytes = 0 Total_4_bytes = 0 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': num = 0 editor.research(r'[\x{0080}-\x{07FF}]', number) Total_2_bytes = num print ('2-BYTES') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'[\x{0800}-\x{D7FF}\x{E000}-\x{FFFF}]', number) Total_3_bytes = num print ('3-BYTES') # ----------------------------------------------------------------------------------------------------------------------------- Total_4_bytes = ( Bytes_length - Total_chars - Total_2_bytes - 2 * Total_3_bytes ) / 3 Total_1_byte = Total_standard - Total_2_bytes - Total_3_bytes - Total_4_bytes Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': num = 0 editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number) # ALL BMP chars different from '\r' and '\n' Total_2_bytes = num Total_4_bytes = Total_standard - Total_2_bytes Total_BMP = Total_2_bytes Total_1_byte = 0 Total_3_bytes = 0 Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes print ('2-BYTES') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- BOM = 0 # Default ANSI and UTF-8 if Curr_encoding == 'UTF-8-BOM': BOM = 3 if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': BOM = 2 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Buffer_length = Bytes_length + BOM # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\t|\x20', number) Non_blank_chars = Total_standard - num print ('NON-BLANK') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\w+', number) Words_count = num print ('WORDS') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Err_regex = False num = 0 if Curr_encoding == 'ANSI' or Total_4_bytes == 0: editor.research(r'\S+', number) else: try: editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number) except RuntimeError: Err_regex = True Non_space_count = num print ('NON-SPACE') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'\f^(?:\r\n|\r|\n)', number) else: editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^(?:\r\n|\r|\n)', number) Special_empty = num num = 0 editor.research(r'^(?:\r\n|\r|\n)', number) Default_empty = num Empty_lines = Default_empty - Special_empty print ('EMPTY lines') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'\f^[\t\x20]+(?:\r\n|\r|\n|\z)', number) else: editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^[\t\x20]+(?:\r\n|\r|\n|\z)', number) Special_blank = num num = 0 editor.research(r'^[\t\x20]+(?:\r\n|\r|\n|\z)', number) Default_blank = num Blank_lines = Default_blank - Special_blank print ('BLANK lines') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Emp_blk_lines = Empty_lines + Blank_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_lines = editor.getLineCount() num = 0 editor.research(r'(?-s)^.+\z', number) if num == 0: Total_lines = Total_lines - 1 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Non_blk_lines = Total_lines - Emp_blk_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Num_sel = editor.getSelections() # Get ALL selections ( EMPTY or NOT ) if Num_sel != 0: Bytes_count = 0 Chars_count = 0 for n in range(Num_sel): Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n) Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Chars_count < 2: Txt_chars = ' selected char (' else: Txt_chars = ' selected chars (' if Bytes_count < 2: Txt_bytes = ' selected byte) in ' else: Txt_bytes = ' selected bytes) in ' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Num_sel < 2 and Bytes_count == 0: Txt_ranges = ' EMPTY range\n' if Num_sel < 2 and Bytes_count > 0: Txt_ranges = ' range\n' if Num_sel > 1 and Bytes_count == 0: Txt_ranges = ' EMPTY ranges\n' if Num_sel > 1 and Bytes_count > 0: Txt_ranges = ' ranges (EMPTY or NOT)\n' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- console.hide() line_list = [] # empty list Line_end = '\r\n' line_list.append ('-' * Line_title) line_list.append (' ' * int((Line_title - 54) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()) + ' ( ' + str(time.time() - Start_time) + ' )') line_list.append ('-' * Line_title + Line_end) line_list.append (' FULL File Path : ' + File_name + Line_end) if os.path.isfile(File_name) == True: line_list.append (' CREATION Date : ' + Creation_date) line_list.append (' MODIFICATION Date : ' + Modif_date + Line_end) line_list.append (' READ-ONLY flag : ' + RO_flag) line_list.append (' READ-ONLY editor : ' + RO_editor + Line_end * 2) line_list.append (' Current VIEW : ' + Curr_view + Line_end) line_list.append (' Current ENCODING : ' + Curr_encoding + Line_end) line_list.append (' Current LANGUAGE : ' + str(Curr_lang) + ' (' + Lang_desc + ')' + Line_end) line_list.append (' Current Line END : ' + Curr_eol + Line_end) line_list.append (' Current WRAPPING : ' + Curr_wrap + Line_end * 2) line_list.append (' 1-BYTE Chars : ' + str(Total_1_byte)) line_list.append (' 2-BYTES Chars : ' + str(Total_2_bytes)) line_list.append (' 3-BYTES Chars : ' + str(Total_3_bytes) + Line_end) line_list.append (' Sum BMP Chars : ' + str(Total_BMP)) line_list.append (' 4-BYTES Chars : ' + str(Total_4_bytes) + Line_end) line_list.append (' CHARS w/o CR & LF : ' + str(Total_standard)) line_list.append (' EOL ( CR or LF ) : ' + str(Total_EOL) + Line_end) line_list.append (' TOTAL characters : ' + str(Total_chars) + Line_end * 2) if Curr_encoding == 'ANSI': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)') if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\ + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)') if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)') line_list.append (' Byte Order Mark : ' + str(BOM) + Line_end) line_list.append (' BUFFER Length : ' + str(Buffer_length)) if os.path.isfile(File_name) == True: line_list.append (' Length on DISK : ' + str(Size_length) + Line_end * 2) else: if Line_end == '\r\n': line_list.append (Line_end) line_list.append (' NON-Blank Count : ' + str(Non_blank_chars) + Line_end) line_list.append (' WORDS Count : ' + str(Words_count) + ' (Caution !)' + Line_end) if Err_regex == False: line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + Line_end * 2) else: line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + ' (Caution as " RuntimeError " occured !)' + Line_end * 2) line_list.append (' True EMPTY lines : ' + str(Empty_lines)) line_list.append (' True BLANK lines : ' + str(Blank_lines) + Line_end) line_list.append (' EMPTY/BLANK lines : ' + str(Emp_blk_lines) + Line_end) line_list.append (' NON-BLANK lines : ' + str(Non_blk_lines)) line_list.append (' TOTAL Lines : ' + str(Total_lines) + Line_end * 2) line_list.append (' SELECTION(S) : ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges) notepad.new() editor.setText('\r\n'.join(line_list)) if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM': if Curr_encoding == 'UTF-8': # SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature notepad.messageBox ('CURRENT file re-interpreted as ' + St_bar + ' => Possible ERRONEOUS results' + \ '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!') # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------
 Remenber that you can use a shorter summaryreport by changing the line :Line_end = '\r\n'by this one : Line_end = ''Best Regards, guy038 
- 
 
- 
 Hello, @alan-kilborn and All, Following your advice, I included the number of selected words \w+in the last line of thesummaryreport, regarding the different selectionsIf needed, the OP may choose this second syntax, which includes the hyphen, the apostrophe and the Right Single Quotation Mark, when surrounded by word chars, as true words chars ! SEARCH (?:(?<=\w)[-'’](?=\w)|\w)+And thus, replace the line editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))by this one : editor.research(r'(?:(?<=\w)[-'’](?=\w)|\w)+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
 So, here is the v1.1version of my script, split on two posts :# encoding=utf-8 #------------------------------------------------------------------------- # STATISTICS about the CURRENT file ( v1.1 ) #------------------------------------------------------------------------- from __future__ import print_function # for Python2 compatibility from Npp import * import re import os, time, datetime import ctypes from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT # -------------------------------------------------------------------------------------------------------------------------------------------------------------- # From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- def npp_get_statusbar(statusbar_item_number): WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM) FindWindowW = ctypes.windll.user32.FindWindowW FindWindowExW = ctypes.windll.user32.FindWindowExW SendMessageW = ctypes.windll.user32.SendMessageW LRESULT = LPARAM SendMessageW.restype = LRESULT SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ] EnumChildWindows = ctypes.windll.user32.EnumChildWindows GetClassNameW = ctypes.windll.user32.GetClassNameW create_unicode_buffer = ctypes.create_unicode_buffer SBT_OWNERDRAW = 0x1000 WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13 npp_get_statusbar.STATUSBAR_HANDLE = None def get_result_from_statusbar(statusbar_item_number): assert statusbar_item_number <= 5 retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0) length = retcode & 0xFFFF type = (retcode >> 16) & 0xFFFF assert (type != SBT_OWNERDRAW) text_buffer = create_unicode_buffer(length) retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer)) retval = '{}'.format(text_buffer[:length]) return retval def EnumCallback(hwnd, lparam): curr_class = create_unicode_buffer(256) GetClassNameW(hwnd, curr_class, 256) if curr_class.value.lower() == "msctls_statusbar32": npp_get_statusbar.STATUSBAR_HANDLE = hwnd return False # stop the enumeration return True # continue the enumeration npp_hwnd = FindWindowW(u"Notepad++", None) EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0) if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number) assert False St_bar = npp_get_statusbar(4) # Zone 4 ( STATUSBARSECTION.UNICODETYPE )Continuation on next post guy038 
- 
 Hi Alan and all, Continuation of version v1.1of the script :# -------------------------------------------------------------------------------------------------------------------------------------------------------------- def number(occ): global num num += 1 console.show() console.clear() Start_time = time.time() # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_encoding = str(notepad.getEncoding()) if Curr_encoding == 'ENC8BIT': Curr_encoding = 'ANSI' if Curr_encoding == 'COOKIE': Curr_encoding = 'UTF-8' if Curr_encoding == 'UTF8': Curr_encoding = 'UTF-8-BOM' if Curr_encoding == 'UCS2BE': Curr_encoding = 'UTF-16 BE BOM' if Curr_encoding == 'UCS2LE': Curr_encoding = 'UTF-16 LE BOM' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': Line_title = 95 else: Line_title = 75 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- File_name = notepad.getCurrentFilename().decode('utf-8') if os.path.isfile(File_name) == True: Creation_date = time.ctime(os.path.getctime(File_name)) Modif_date = time.ctime(os.path.getmtime(File_name)) Size_length = os.path.getsize(File_name) RO_flag = 'YES' if os.access(File_name, os.W_OK): RO_flag = 'NO' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- RO_editor = 'NO' if editor.getReadOnly() == True: RO_editor = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if notepad.getCurrentView() == 0: Curr_view = 'MAIN View' else: Curr_view = 'SECONDARY view' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_lang = notepad.getCurrentLang() Lang_desc = notepad.getLanguageDesc(Curr_lang) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if editor.getEOLMode() == 0: Curr_eol = 'Windows (CR LF)' if editor.getEOLMode() == 1: Curr_eol = 'Macintosh (CR)' if editor.getEOLMode() == 2: Curr_eol = 'Unix (LF)' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_wrap = 'NO' if editor.getWrapMode() == 1: Curr_wrap = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- print ('START') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Bytes_length = editor.getLength() Total_chars = editor.countCharacters(0, editor.getLength()) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\r|\n', number) Total_EOL = num print ('EOL') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_standard = Total_chars - Total_EOL # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'ANSI': Total_BMP = Total_standard Total_1_byte = Total_BMP Total_2_bytes = 0 Total_3_bytes = 0 Total_4_bytes = 0 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': num = 0 editor.research(r'[\x{0080}-\x{07FF}]', number) Total_2_bytes = num print ('2-BYTES') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'[\x{0800}-\x{D7FF}\x{E000}-\x{FFFF}]', number) Total_3_bytes = num print ('3-BYTES') # ----------------------------------------------------------------------------------------------------------------------------- Total_4_bytes = ( Bytes_length - Total_chars - Total_2_bytes - 2 * Total_3_bytes ) / 3 Total_1_byte = Total_standard - Total_2_bytes - Total_3_bytes - Total_4_bytes Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': num = 0 editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number) # ALL BMP chars different from '\r' and '\n' Total_2_bytes = num Total_4_bytes = Total_standard - Total_2_bytes Total_BMP = Total_2_bytes Total_1_byte = 0 Total_3_bytes = 0 Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes print ('2-BYTES') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- BOM = 0 # Default ANSI and UTF-8 if Curr_encoding == 'UTF-8-BOM': BOM = 3 if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': BOM = 2 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Buffer_length = Bytes_length + BOM # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\t|\x20', number) Non_blank_chars = Total_standard - num print ('NON-BLANK') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\w+', number) Words_total = num print ('WORDS') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Err_regex = False num = 0 if Curr_encoding == 'ANSI' or Total_4_bytes == 0: editor.research(r'\S+', number) else: try: editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number) except RuntimeError: Err_regex = True Non_space_count = num print ('NON-SPACE') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'\f^(?:\r\n|\r|\n)', number) else: editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^(?:\r\n|\r|\n)', number) Special_empty = num num = 0 editor.research(r'^(?:\r\n|\r|\n)', number) Default_empty = num Empty_lines = Default_empty - Special_empty print ('EMPTY lines') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'\f^[\t\x20]+(?:\r\n|\r|\n|\z)', number) else: editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^[\t\x20]+(?:\r\n|\r|\n|\z)', number) Special_blank = num num = 0 editor.research(r'^[\t\x20]+(?:\r\n|\r|\n|\z)', number) Default_blank = num Blank_lines = Default_blank - Special_blank print ('BLANK lines') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Emp_blk_lines = Empty_lines + Blank_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_lines = editor.getLineCount() num = 0 editor.research(r'(?-s)^.+\z', number) if num == 0: Total_lines = Total_lines - 1 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Non_blk_lines = Total_lines - Emp_blk_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Num_sel = editor.getSelections() # Get ALL selections ( EMPTY or NOT ) if Num_sel != 0: Bytes_count = 0 Chars_count = 0 Words_count = 0 for n in range(Num_sel): Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n) Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) num = 0 editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) Words_count += num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Bytes_count < 2: Txt_bytes = ' selected byte) in ' else: Txt_bytes = ' selected bytes) in ' if Chars_count < 2: Txt_chars = ' selected char, ' else: Txt_chars = ' selected chars, ' if Words_count < 2: Txt_words = ' selected word (' else: Txt_words = ' selected words (' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Num_sel < 2 and Bytes_count == 0: Txt_ranges = ' EMPTY range' if Num_sel < 2 and Bytes_count > 0: Txt_ranges = ' range' if Num_sel > 1 and Bytes_count == 0: Txt_ranges = ' EMPTY ranges' if Num_sel > 1 and Bytes_count > 0: Txt_ranges = ' ranges (EMPTY or NOT)' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- console.hide() line_list = [] # empty list Line_end = '\r\n' line_list.append ('-' * Line_title) line_list.append (' ' * int((Line_title - 54) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()) + ' ( ' + str(time.time() - Start_time) + ' )') line_list.append ('-' * Line_title + Line_end) line_list.append (' FULL File Path : ' + File_name + Line_end) if os.path.isfile(File_name) == True: line_list.append (' CREATION Date : ' + Creation_date) line_list.append (' MODIFICATION Date : ' + Modif_date + Line_end) line_list.append (' READ-ONLY flag : ' + RO_flag) line_list.append (' READ-ONLY editor : ' + RO_editor + Line_end * 2) line_list.append (' Current VIEW : ' + Curr_view + Line_end) line_list.append (' Current ENCODING : ' + Curr_encoding + Line_end) line_list.append (' Current LANGUAGE : ' + str(Curr_lang) + ' (' + Lang_desc + ')' + Line_end) line_list.append (' Current Line END : ' + Curr_eol + Line_end) line_list.append (' Current WRAPPING : ' + Curr_wrap + Line_end * 2) line_list.append (' 1-BYTE Chars : ' + str(Total_1_byte)) line_list.append (' 2-BYTES Chars : ' + str(Total_2_bytes)) line_list.append (' 3-BYTES Chars : ' + str(Total_3_bytes) + Line_end) line_list.append (' Sum BMP Chars : ' + str(Total_BMP)) line_list.append (' 4-BYTES Chars : ' + str(Total_4_bytes) + Line_end) line_list.append (' CHARS w/o CR & LF : ' + str(Total_standard)) line_list.append (' EOL ( CR or LF ) : ' + str(Total_EOL) + Line_end) line_list.append (' TOTAL characters : ' + str(Total_chars) + Line_end * 2) if Curr_encoding == 'ANSI': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)') if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\ + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)') if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)') line_list.append (' Byte Order Mark : ' + str(BOM) + Line_end) line_list.append (' BUFFER Length : ' + str(Buffer_length)) if os.path.isfile(File_name) == True: line_list.append (' Length on DISK : ' + str(Size_length) + Line_end * 2) else: if Line_end == '\r\n': line_list.append (Line_end) line_list.append (' NON-Blank Chars : ' + str(Non_blank_chars) + Line_end) line_list.append (' WORDS Count : ' + str(Words_total) + ' (Caution !)' + Line_end) if Err_regex == False: line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + Line_end * 2) else: line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + ' (Caution as " RuntimeError " occured !)' + Line_end * 2) line_list.append (' True EMPTY lines : ' + str(Empty_lines)) line_list.append (' True BLANK lines : ' + str(Blank_lines) + Line_end) line_list.append (' EMPTY/BLANK lines : ' + str(Emp_blk_lines) + Line_end) line_list.append (' NON-BLANK lines : ' + str(Non_blk_lines)) line_list.append (' TOTAL Lines : ' + str(Total_lines) + Line_end * 2) line_list.append (' SELECTION(S) : ' + str(Chars_count) + Txt_chars + str(Words_count) + Txt_words + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges + Line_end) notepad.new() editor.setText('\r\n'.join(line_list)) if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM': if Curr_encoding == 'UTF-8': # SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature notepad.messageBox ('CURRENT file re-interpreted as ' + St_bar + ' => Possible ERRONEOUS results' + \ '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!') # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------Best Regards, guy038 
- 
 Hello, @alan-kilborn and Python gurus, I’ve just found out a bug when trying to run my script against à “French” file called Numéros( which meansNumbers) :-((
 In that Python section of my script below, it detects if the current tab is associated with a true file, saved on disk, or if the current tab refers to a new #file, not saved yet# -------------------------------------------------------------------------------------------------------------------------------------------------------------- File_name = notepad.getCurrentFilename() if os.path.isfile(File_name) == True: Creation_date = time.ctime(os.path.getctime(File_name)) Modif_date = time.ctime(os.path.getmtime(File_name)) Size_length = os.path.getsize(File_name) RO_flag = 'YES' if os.access(File_name, os.W_OK): RO_flag = 'NO' # --------------------------------------------------------------------------------------------------------------------------------------------------------------
 And unfortunately, if current name contains accentuated characters, like Numéros, it wrongly suppose it’s anew #file !As soon as it is renamed as Numeros, everything is OK againSo, how to recognize the filename even if current file or current path contain NON-ASCIIcharacters ?TIA guy038 
- 
 @guy038 said in Emulation of the "View > Summary" feature with a Python script: how to recognize the filename even if current file or current path contain NON-ASCII characters ? Short answer: This is better done with Python3, i.e., PythonScript 3.x. Then things “just work”. :-) But, for Python2, (and PS 2.x) you can make a call to .encode('utf-8')or.decode('utf-8')– depending upon your circumstance (I’m not commenting on your specific code) – in order to get what you need.Basically, if you have a Python2 string (in a variable s) and you want to get a Unicode string (for things like Windows pathnames with non-trivial characters), uses.decode('utf-8')and to go the other way, where you have a Unicode str (in a variableu) and you want a Python2 str, dou.encode('utf-8').
- 
 Hi, @alan-kilborn, Many thanks for the tip ! I did some Google searches before, but just saw some obscur explanations. But, right now, trying again with this question : How to get "os.path.isfile(Filename)" == True: when Filename contains "NON ASCII" chars ?And reading the first article, named “python - UnicodeEncodeError on joining file name”, on Jan. 05 2010, from the site Stack Overflow, it is textually said, in the middle of the article :So I would first try filename = filename.decode('utf-8') -- that should allow the os.path.join to work
 Now, I won’t bother to re-edit my script with a new version number ! I just changed, in my v1.1version, above, the line :File_name = notepad.getCurrentFilename()by this one : File_name = notepad.getCurrentFilename().decode('utf-8')BR guy038 
- 
 G guy038 referenced this topic on G guy038 referenced this topic on
- 
 Hello, @alan-kilborn and All, Below, the v1.2version of the Python script for an enhancedSummaryfeature :- 
I decomposed the total number of chars in 3parts : EOL chars, Space and Tab chars and True chars ([^\t\x20\r\n])
- 
I also decomposed the total number of word chars in 3parts : letters chars, digits chars and low_line chars
- 
I added a count of the paragraphs ( You may adapt the corresponding regex to your needs ) 
- 
I added a count of the sentences ( You may adapt the corresponding regex to your needs ) 
- 
I added some remarks at the end of the summary report, regarding the global accurancy of some results ! 
 
 Now, Alan, I needed to change this part, regarding the selections : for n in range(Num_sel): Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n) Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) num = 0 editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) Words_count += numby this one : for n in range(Num_sel): Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n) Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) num = 0 if Bytes_count != 0: editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) Words_count += numBecause, if the unique zero-length selection was on a pure empty line, it did write, as expected, the message : 0 selected char, 0 selected word (0 selected byte) in 1 EMPTY rangeBut if this unique zero-length selection was on a non-empty line, it would wrongly write, for example : 0 selected char, **`568`** selected words (0 selected byte) in 1 EMPTY rangeGiven that the total file contains 568words
 So, here is the v1.2version of my script, split on two posts :# encoding=utf-8 #------------------------------------------------------------------------- # STATISTICS about the CURRENT file ( v1.2 ) #------------------------------------------------------------------------- from __future__ import print_function # for Python2 compatibility from Npp import * import re import os, time, datetime import ctypes from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT # -------------------------------------------------------------------------------------------------------------------------------------------------------------- # From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- def npp_get_statusbar(statusbar_item_number): WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM) FindWindowW = ctypes.windll.user32.FindWindowW FindWindowExW = ctypes.windll.user32.FindWindowExW SendMessageW = ctypes.windll.user32.SendMessageW LRESULT = LPARAM SendMessageW.restype = LRESULT SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ] EnumChildWindows = ctypes.windll.user32.EnumChildWindows GetClassNameW = ctypes.windll.user32.GetClassNameW create_unicode_buffer = ctypes.create_unicode_buffer SBT_OWNERDRAW = 0x1000 WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13 npp_get_statusbar.STATUSBAR_HANDLE = None def get_result_from_statusbar(statusbar_item_number): assert statusbar_item_number <= 5 retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0) length = retcode & 0xFFFF type = (retcode >> 16) & 0xFFFF assert (type != SBT_OWNERDRAW) text_buffer = create_unicode_buffer(length) retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer)) retval = '{}'.format(text_buffer[:length]) return retval def EnumCallback(hwnd, lparam): curr_class = create_unicode_buffer(256) GetClassNameW(hwnd, curr_class, 256) if curr_class.value.lower() == "msctls_statusbar32": npp_get_statusbar.STATUSBAR_HANDLE = hwnd return False # stop the enumeration return True # continue the enumeration npp_hwnd = FindWindowW(u"Notepad++", None) EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0) if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number) assert False St_bar = npp_get_statusbar(4) # Zone 4 ( STATUSBARSECTION.UNICODETYPE ) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- def number(occ): global num num += 1 console.show() console.clear() Start_time = time.time() # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_encoding = str(notepad.getEncoding()) if Curr_encoding == 'ENC8BIT': Curr_encoding = 'ANSI' if Curr_encoding == 'COOKIE': Curr_encoding = 'UTF-8' if Curr_encoding == 'UTF8': Curr_encoding = 'UTF-8-BOM' if Curr_encoding == 'UCS2BE': Curr_encoding = 'UTF-16 BE BOM' if Curr_encoding == 'UCS2LE': Curr_encoding = 'UTF-16 LE BOM' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': Line_title = 95 else: Line_title = 75 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- File_name = notepad.getCurrentFilename().decode('utf-8') if os.path.isfile(File_name) == True: Creation_date = time.ctime(os.path.getctime(File_name)) Modif_date = time.ctime(os.path.getmtime(File_name)) Size_length = os.path.getsize(File_name) RO_flag = 'YES' if os.access(File_name, os.W_OK): RO_flag = 'NO' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- RO_editor = 'NO' if editor.getReadOnly() == True: RO_editor = 'YES' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if notepad.getCurrentView() == 0: Curr_view = 'MAIN View' else: Curr_view = 'SECONDARY view' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_lang = notepad.getCurrentLang() Lang_desc = notepad.getLanguageDesc(Curr_lang) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if editor.getEOLMode() == 0: Curr_eol = 'Windows (CR LF)' if editor.getEOLMode() == 1: Curr_eol = 'Macintosh (CR)' if editor.getEOLMode() == 2: Curr_eol = 'Unix (LF)' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Curr_wrap = 'NO' if editor.getWrapMode() == 1: Curr_wrap = 'YES'Continuation on next post guy038 
- 
- 
 Hi @alan-kilborn and all, Continuation of version v1.2of the script :# -------------------------------------------------------------------------------------------------------------------------------------------------------------- print ('START') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Bytes_length = editor.getLength() Total_chars = editor.countCharacters(0, editor.getLength()) # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\n|\r', number) Total_EOL = num print ('EOL') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\t|\x20', number) Blank_chars = num print ('BLANK') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_standard = Total_chars - Total_EOL True_chars = Total_chars - Total_EOL - Blank_chars # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'ANSI': Total_BMP = Total_standard Total_1_byte = Total_BMP Total_2_bytes = 0 Total_3_bytes = 0 Total_4_bytes = 0 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': num = 0 editor.research(r'[\x{0080}-\x{07FF}]', number) Total_2_bytes = num print ('2-BYTES') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'[\x{0800}-\x{D7FF}\x{E000}-\x{FFFF}]', number) Total_3_bytes = num print ('3-BYTES') # ----------------------------------------------------------------------------------------------------------------------------- Total_4_bytes = ( Bytes_length - Total_chars - Total_2_bytes - 2 * Total_3_bytes ) / 3 Total_1_byte = Total_standard - Total_2_bytes - Total_3_bytes - Total_4_bytes Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': num = 0 editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number) # ALL BMP chars different from '\r' and '\n' Total_2_bytes = num Total_4_bytes = Total_standard - Total_2_bytes Total_BMP = Total_2_bytes Total_1_byte = 0 Total_3_bytes = 0 Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes print ('2-BYTES') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- BOM = 0 # Default ANSI and UTF-8 if Curr_encoding == 'UTF-8-BOM': BOM = 3 if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': BOM = 2 # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Buffer_length = Bytes_length + BOM # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\d', number) Number_chars = num print ('NUMBERS') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'_', number) Lowline_chars = num print ('LOW_LINES') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\w', number) Word_chars = num print ('WORDS') Letter_chars = Word_chars - Number_chars - Lowline_chars # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 editor.research(r'\w+', number) Words_total = num print ('WORDS+') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Err_regex_non_space = False num = 0 if Curr_encoding == 'ANSI' or Total_4_bytes == 0: editor.research(r'\S+', number) else: try: editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number) except RuntimeError: Err_regex_non_space = True Non_space_count = num print ('NON-SPACE+') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Err_regex_sentence = False num = 0 try: editor.research(r'(?-s)(?:\A|(?<=[\h\r\n.?!])).+?(?:(?=[.?!](\h|\R|\z))|(?=\R|\z))', number) except RuntimeError: Err_regex_sentence = True Sentence_count = num print ('SENTENCES') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Err_regex_paragraph = False num = 0 try: editor.research(r'(?-s)(?:(?:.[\x{D800}-\x{DFFF}]?)+(?:\r\n|\n|\r))+(?:\r\n|\n|\r){1,}(?:(?:.[\x{D800}-\x{DFFF}]?)+\z)?|(?:.[\x{D800}-\x{DFFF}]?)+\z', number) except RuntimeError: Err_regex_paragraph = True Paragraph_count = num print ('PARAGRAPHS') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'\f^(?:\r\n|\n|\r)', number) else: editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^(?:\r\n|\n|\r)', number) Special_empty = num num = 0 editor.research(r'^(?:\r\n|\n|\r)', number) Default_empty = num Empty_lines = Default_empty - Special_empty print ('EMPTY lines') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- num = 0 if Curr_encoding == 'ANSI': editor.research(r'\f^[\t\x20]+(?:\r\n|\n|\r|\z)', number) else: editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^[\t\x20]+(?:\r\n|\n|\r|\z)', number) Special_blank = num num = 0 editor.research(r'^[\t\x20]+(?:\r\n|\n|\r|\z)', number) Default_blank = num Blank_lines = Default_blank - Special_blank print ('BLANK lines') # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Emp_blk_lines = Empty_lines + Blank_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Total_lines = editor.getLineCount() num = 0 editor.research(r'(?-s)^.+\z', number) if num == 0: Total_lines = Total_lines - 1 # Because LAST line totally EMPTY # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Non_blk_lines = Total_lines - Emp_blk_lines # -------------------------------------------------------------------------------------------------------------------------------------------------------------- Num_sel = editor.getSelections() # Get ALL selections ( EMPTY or NOT ) if Num_sel != 0: Bytes_count = 0 Chars_count = 0 Words_count = 0 for n in range(Num_sel): Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n) Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) num = 0 if Bytes_count != 0: editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n)) Words_count += num # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Bytes_count < 2: Txt_bytes = ' selected byte) in ' else: Txt_bytes = ' selected bytes) in ' if Chars_count < 2: Txt_chars = ' selected char, ' else: Txt_chars = ' selected chars, ' if Words_count < 2: Txt_words = ' selected word (' else: Txt_words = ' selected words (' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- if Num_sel < 2 and Bytes_count == 0: Txt_ranges = ' EMPTY range' if Num_sel < 2 and Bytes_count > 0: Txt_ranges = ' range' if Num_sel > 1 and Bytes_count == 0: Txt_ranges = ' EMPTY ranges' if Num_sel > 1 and Bytes_count > 0: Txt_ranges = ' ranges (EMPTY or NOT)' # -------------------------------------------------------------------------------------------------------------------------------------------------------------- console.hide() line_list = [] # empty list Line_end = '\r\n' line_list.append ('-' * Line_title) line_list.append (' ' * int((Line_title - 54) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()) + ' ( ' + str(time.time() - Start_time) + ' )') line_list.append ('-' * Line_title + Line_end) line_list.append (' FULL File Path : ' + File_name + Line_end) if os.path.isfile(File_name) == True: line_list.append (' CREATION Date : ' + Creation_date) line_list.append (' MODIFICATION Date : ' + Modif_date + Line_end) line_list.append (' READ-ONLY flag : ' + RO_flag) line_list.append (' READ-ONLY editor : ' + RO_editor + Line_end * 2) line_list.append (' Current VIEW : ' + Curr_view + Line_end) line_list.append (' Current ENCODING : ' + Curr_encoding + Line_end) line_list.append (' Current LANGUAGE : ' + str(Curr_lang) + ' (' + Lang_desc + ')' + Line_end) line_list.append (' Current Line END : ' + Curr_eol + Line_end) line_list.append (' Current WRAPPING : ' + Curr_wrap + Line_end * 2) line_list.append (' 1-BYTE Chars : ' + str(Total_1_byte)) line_list.append (' 2-BYTES Chars : ' + str(Total_2_bytes)) line_list.append (' 3-BYTES Chars : ' + str(Total_3_bytes) + Line_end) line_list.append (' Sum BMP Chars : ' + str(Total_BMP)) line_list.append (' 4-BYTES Chars : ' + str(Total_4_bytes) + Line_end) line_list.append (' CHARS w/o CR & LF : ' + str(Total_standard) + Line_end * 2) line_list.append (' EOL ( CR or LF ) : ' + str(Total_EOL)) line_list.append (' SPC & TAB Chars : ' + str(Blank_chars)) line_list.append (' TRUE Chars : ' + str(True_chars) + Line_end) line_list.append (' TOTAL characters : ' + str(Total_chars) + Line_end * 2) if Curr_encoding == 'ANSI': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)') if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\ + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)') if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM': line_list.append (' BYTES Length : ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)') line_list.append (' Byte Order Mark : ' + str(BOM) + Line_end) line_list.append (' BUFFER Length : ' + str(Buffer_length)) if os.path.isfile(File_name) == True: line_list.append (' Length on DISK : ' + str(Size_length) + Line_end * 2) else: if Line_end == '\r\n': line_list.append (Line_end) line_list.append (' NUMBER Chars : ' + str(Number_chars) + '\t(*)') line_list.append (' LOW_LINE Chars : ' + str(Lowline_chars)) line_list.append (' LETTER Chars : ' + str(Letter_chars) + '\t(*)' + Line_end) line_list.append (' WORD Chars : ' + str(Word_chars) + '\t(*)' + Line_end * 2) line_list.append (' WORDS Count : ' + str(Words_total) + '\t(*)' + Line_end) if Err_regex_non_space == False: line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + '\t(**)' + Line_end * 2) else: line_list.append (' NON-SPACE Count : ' + str(Non_space_count) + '\t(Caution : a " RuntimeError " occured !)' + Line_end * 2) if Err_regex_sentence == False: line_list.append (' SENTENCES Count : ' + str(Sentence_count) + '\t(**)' + Line_end) else: line_list.append (' SENTENCES Count : ' + str(Sentence_count) + '\t(Caution : a " RuntimeError " occured !)' + Line_end) if Err_regex_paragraph == False: line_list.append (' PARAGRAPHS Count : ' + str(Paragraph_count) + '\t(**)' + Line_end * 2) else: line_list.append (' PARAGRAPHS Count : ' + str(Paragraph_count) + '\t(Caution : a " RuntimeError " occured !)' + Line_end * 2) line_list.append (' True EMPTY lines : ' + str(Empty_lines)) line_list.append (' True BLANK lines : ' + str(Blank_lines) + Line_end) line_list.append (' EMPTY/BLANK lines : ' + str(Emp_blk_lines) + Line_end) line_list.append (' NON-BLANK lines : ' + str(Non_blk_lines)) line_list.append (' TOTAL Lines : ' + str(Total_lines) + Line_end * 2) line_list.append (' SELECTION(S) : ' + str(Chars_count) + Txt_chars + str(Words_count) + Txt_words + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges + '\r\n' + Line_end) line_list.append (' (*) Our BOOST regex engine ignore all WORD, NUMBER and LETTER characters over the BMP and may ignore some others within the BMP !') line_list.append (' (**) The results may NOT be very accurate for "technical" or "non-regular" files !' + Line_end) notepad.new() editor.setText('\r\n'.join(line_list)) if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM': if Curr_encoding == 'UTF-8': # SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature notepad.messageBox ('CURRENT file re-interpreted as ' + St_bar + ' => Possible ERRONEOUS results' + \ '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!') # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------Best Regards, guy038 

