Community
    • Login

    Emulation of the "View > Summary" feature with a Python script

    Scheduled Pinned Locked Moved General Discussion
    26 Posts 3 Posters 3.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hi All,

      Continuation of the previous script :

      Then, with the help of the excellent Babel Map software, updated for Unicode v13.0

      https://www.babelstone.co.uk/Software/BabelMap.html

      I succeeded to create a list of the 21,143 remaining characters, from the living scripts, above, which should be truly considered as word character, without any ambiguity

      However, when applying the regex \t\w\t against this list, I got a total of 17,307 word characters, only, because, probably, Notepad++ does not use the Boost regex library with FULL Unicode support :

      • The Boost definition of the regex \w does not consider all the characters over the BMP

      • Some characters of the BMP, although alphabetic, are not considered, yet, as word chars

      For instance, in this short list, below, each Unicode char, surrounded with two tabulation chars, cannot be found with the regex \t\w\t, although it is, indeed, seen as a word by the Unicode Consortium` :-((

       023D	Ƚ	  ; Upper_Letter # Lu         LATIN CAPITAL LETTER L WITH BAR
       0370	Ͱ	  ; Upper_Letter # Lu         GREEK CAPITAL LETTER HETA
       04CF	ӏ	  ; Lower_Letter # Ll         CYRILLIC SMALL LETTER PALOCHKA
       066F	ٯ	  ; Other_Letter # Lo         ARABIC LETTER DOTLESS QAF
       0D60	ൠ	  ; Other_Letter # Lo         MALAYALAM LETTER VOCALIC RR
       200D	‍	  ; Join_Control # Cf         ZERO WIDTH JOINER
       213F	ℿ	  ; Upper_Letter # Lu         DOUBLE-STRUCK CAPITAL PI
       2187	ↇ	  ; Letter_Numb. # Nl         ROMAN NUMERAL FIFTY THOUSAND
       24B6	Ⓐ	  ; Other_Alpha. # So         CIRCLED LATIN CAPITAL LETTER A
       2E2F	ⸯ	  ; Modifier_Let # Lm         VERTICAL TILDE
       A727	ꜧ	  ; Lower_Letter # Ll         LATIN SMALL LETTER HENG
       FF3F	_	  ; Conn._Punct. # Pc         FULLWIDTH LOW LINE
      1D400	𝐀	  ; Upper_Letter # Lu         MATHEMATICAL BOLD CAPITAL A
      1D70B	𝜋	  ; Lower_Letter # Ll         MATHEMATICAL ITALIC SMALL PI
      1F150	🅐	  ; Other_Alpha. # So         NEGATIVE CIRCLED LATIN CAPITAL LETTER A
      

      To my mind, for all these reasons, as we cannot rely on the Word notion, the View > Summary... feature should just ignore the number of words or, at least, add the indication With caution !


      By contrast, I think that it would be useful to count the number of Non_Space strings, determined with the regex \S+. Indeed, we would get more confident results ! The boundaries of Non_Space strings, which are the Space characters, belong to the well-defined list of the 25 Unicode characters with the binary property White_Space, from the PropList.txt file. Refer to the very beginning of this file :

      http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

      As a reminder, the regex \s is identical to \h|\v. So, it represents the complete character class [\t\x20\xA0\x{1680}\x{2000}-\x{200B}\x{202F}\x{3000}]|[\n\x0B\f\r\x85\x{2028}\x{2029}] which can be re-ordered as :

      \s = [\t\n\x0B\f\r\x20\x85\xA0\x{1680}\x{2000}-\x{200B}\x{2028}\x{2029}\x{202F}\x{3000}]

      Note that, in practice, the \s regex is mainly equivalent to the simple regex [\t\n\r\x20]

      Here is that Unicode list of all Unicode characters with the property White_Space, with their name and their General_Category value :

      0009	TAB	; White_Space         # Cc       TABULATION  <control-0009>
      000A	LF	; White_Space         # Cc       LINE FEED  <control-000A>
      000B		; White_Space         # Cc       VERTICAL TABULATION  <control-000B>
      000C		; White_Space         # Cc       FORM FEED  <control-000C>
      000D	CR	; White_Space         # Cc       CARRIAGE RETURN  <control-000D>
      0020	 	; White_Space         # Zs       SPACE
      0085	…	; White_Space         # Cc       NEXT LINE  <control-0085>
      00A0	 	; White_Space         # Zs       NO-BREAK SPACE
      1680	 	; White_Space         # Zs       OGHAM SPACE MARK
      2000	 	; White_Space         # Zs       EN QUAD
      2001	 	; White_Space         # Zs       EM QUAD
      2002	 	; White_Space         # Zs       EN SPACE
      2003	 	; White_Space         # Zs       EM SPACE
      2004	 	; White_Space         # Zs       THREE-PER-EM SPACE
      2005	 	; White_Space         # Zs       FOUR-PER-EM SPACE
      2006	 	; White_Space         # Zs       SIX-PER-EM SPACE
      2007	 	; White_Space         # Zs       FIGURE SPACE
      2008	 	; White_Space         # Zs       PUNCTUATION SPACE
      2009	 	; White_Space         # Zs       THIN SPACE
      200A	 	; White_Space         # Zs       HAIR SPACE
      2028	
	; White_Space         # Zl       LINE SEPARATOR
      2029	
	; White_Space         # Zp       PARAGRAPH SEPARATOR
      202F	 	; White_Space         # Zs       NARROW NO-BREAK SPACE
      205F	 	; White_Space         # Zs       MEDIUM MATHEMATICAL SPACE
      3000	 	; White_Space         # Zs       IDEOGRAPHIC SPACE
      

      Note that I used the notations TAB, LF and CR, standing for the three characters \t, \n and \r, instead of the chars themselves

      So, in order to get the number of Non_Space strings, we should, normally, use the simple regex \S+. However, it does not give the right number. Indeed, when several characters, with code-point over the BMP, are consecutive, they are not seen as a global Non_Space string but as individual characters :-((

      Test my statement with this string, composed of four consecutive emoji chars 👨👩👦👧. The regex \S+ returns four Non_Space strings, whereas I would have expected only one string !

      Consequently, I verified that the suitable regex to count all the Non_Space strings of a file, whatever their Unicode code-point, is rather the regex ((?!\s).[\x{D800}-\x{DFFF}]?)+ ( Longer, I agree but exact ! )


      Now, here is a list of regexes which allow you to count different subsets of characters from the current file contents. Of course, tick the Wrap around option !

      - Number of chars in a current non-ANSI ( UNICODE ) file, as the zone [\x{D800}-\x{DFFF}] represents the reserved SURROGATE area :
      
        - Number of chars, in range [U+0000 - U+007F ], WITHOUT the \r AND \n chars               =  N1  =  (?![\r\n])[\x{0000}-\x{007F}]
      
        - Number of chars, in range [U+0080 - U+07FF ]                                            =  N2  =  [\x{0080}-\x{07FF}]
      
        - Number of chars, in range [U+0800 - U+FFFF ], except in SURROGATE range [D800 - DFFF]   =  N3  =  (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]
                                                                                                           ------------------------------------------------
      
        - Number of chars, in range [U+0000 - U+FFFF ], in BMP , WITHOUT the \r AND \n  =  N1 + N2 + N3  =  (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]  or  [^\r\n\x{D800}-\x{DFFF}]
      
        - Number of chars, in range [U+10000 - U+10FFFF], OVER the BMP                            =  N4  =  (?-s).[\x{D800}-\x{DFFF}]
                                                                                                           ---------------------------
      
        - TOTAL chars, in an UNICODE file, WITHOUT the \r AND \r chars             =  N1 + N2 + N3 + N4  =  [^\r\n]
      
        - Number of \r characters + Number of \n characters                                       =  N0  =  \r|\n
                                                                                                           ---------
      
        - TOTAL chars, in an UNICODE file, WITH the \r AND \r chars           =  N0 + N1 + N2 + N3 + N4  =  (?s).
      
      
      - Number of chars in a current ANSI file :
      
        - Number of characters, in range [U+0000 - U+00FF], WITHOUT the \r AND \n chars           =  N1  =  [^\r\n]
      
        - Number of \r characters + Number of \n characters                                       =  N0  =  \r|\n
                                                                                                           ---------
      
        - TOTAL chars, in an ANSI file, WITH the \r AND \r chars                             =  N0 + N1  =  (?s).
      
      
      
      - TOTAL current FILE length <Fl> in Notepad++ BUFFER :
      
        - For an ANSI                          file            Fl  =                             N0 + N1  =  (?s).
      
        - For an UTF-8 or UTF-8-BOM            file            Fl  =  N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4
      
        - For an UCS-2 BE BOM or UCS-2 BE BOM  file            Fl  =           ( N0 + N1 + N2 + N3 ) × 2  =  (?s). × 2
      
      
      
      - Byte Order Mark ( BOM = U+FEFF ) length <Bl> and encoding, for SAVED files :
      
        - For an ANSI or UTF-8                 file            Bl  =  0 byte
      
        - For an UTF-8-BOM                     file            Bl  =  3 bytes  ( EF BB BF )
      
        - For an UCS-2 BE BOM                  file            Bl  =  2 bytes  ( FE FF )
      
        - For an UCS-2 LE BOM                  file            Bl  =  2 bytes  ( FF FE )
      
      
      
      - TOTAL CURRENT file length on DISK, WHATEVER its encoding     Ld  =  Fl + Bl  ( = Total FILE length + BOM length )
      
      
      
      - NUMBER of WORDS                                                  =  \w+  whatever the file TYPE ( This result must be considered with CAUTION )
      
      
      - NUMBER of NON_SPACE strings                                      =  ((?!\s).[\x{D800}-\x{DFFF}]?)+ for an UNICODE file or ((?!\s).)+ for an ANSI file
      
      
      
      - Number of LINES in an UNICODE file :
      
        - Number of true EMPTY lines                                     =  (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)
      
        - Number of lines containing TAB and/or SPACE characters ONLY    =  (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)
                                                                           --------------------------------------------------------------
      
        - TOTAL Number of BLANK or EMPTY lines                           =  (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z)
      
        - Number of NON BLANK and NON EMPTY lines                        =  (?-s)(?!^[\t\x20]+$)(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z)
                                                                           --------------------------------------------------------------------------
      
        - TOTAL number of LINES in an UNICODE file                       =  (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z
      
      
      - Number of LINES in an ANSI file :
      
        - Number of true EMPTY lines                                     =  (?<!\f)^(?:\r\n|\r|\n)
      
        - Number of lines containing TAB and/or SPACE characters ONLY    =  (?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)
                                                                           ------------------------------------
      
        - TOTAL Number of EMPTY or BLANK lines                           =  (?<!\f)^[\t\x20]*(?:\r\n|\r|\n|\z)
      
        - Number of NON BLANK and NON EMPTY lines                        =  (?-s)(?!^[\t\x20]+$)^(?:.|\f)+(?:\r\n|\r|\n|\z)
                                                                           -------------------------------------------------
      
        - TOTAL number of LINES in an ANSI file                          =  (?-s)\r\n|\r|\n|(?:.|\f)\z
      
      

      Remarks : Although most of the regexes, above, can be easily understood, here are some additional elements :

      • The regex (?-s).[\x{D800}-\x{DFFF}] is the sole correct syntax, with our Boost regex engine, to count all the characters over the BMP

      • The regex (?s)((?!\s).[\x{D800}-\x{DFFF}]?)+, to count all the Non_Space strings, was explained before

      • In all the regexes, relative to counting of lines, you probably noticed the character class [\f\x{0085}\x{2028}\x{2029}]. It must be present because the four characters \f, \x{0085} , \x{2028} and \x{2029} are, both, considered as a start and an End of line, like the assertions ^ and $ !

        • For instance, if, in a new file, you insert one Next_Line char ( NEL ), of code-point \x{0085} and hit the Enter key, this sole line is wrongly seen as an empty line by the simple regex ^(?:\r\n|\r|\n) which matches the line-break after the Next_Line char !

      To end , I would like to propose a new layout of an summary feature, which should be more informative !

      IMPORTANT : In the list below, any text, before the 1st colon character of each line, is the name which should be displayed in the Summary dialog !

      Full File Path    :  X:\....\....\
      
      Creation Date     :  MM/DD/YYYY HH:MM:SS
      Modification Date :  MM/DD/YYYY HH:MM:SS
                                                                      UTF-8[-BOM]                                                   UCS-2 BE/LE BOM            ANSI
                                                        -----------------------------------------------------------------------------------------------------------------
      1-Byte  Chars     :  N1                         =   (?![\r\n])[\x{0000}-\x{007F}]                                                   idem                [^\r\n]
      2-Bytes Chars     :  N2                         =   [\x{0080}-\x{07FF}]                                                             idem                   0
      3-Bytes Chars     :  N3                         =   (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]                                      idem                   0
      
      Total BMP Chars   :  N1 + N2 + N3               =   (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]                                  idem                [^\r\n]
      4-Bytes Chars     :  N4                         =   (?-s).[\x{D800}-\x{DFFF}]                                                         0                    0
      
      NON BLANK chars   :                             =   [^\r\n\t\x20]                                                                   idem                idem
      
      Chars w/o CR|LF   :  N1 + N2 + N3 + N4          =   [^\r\n]                                                                         idem                idem
      EOL ( CR or LF )  :  N0                         =   \r|\n                                                                           idem                idem
      
      TOTAL Characters  :  N0 + N1 + N2 + N3 + N4     =   (?s).                                                                           idem                idem
      
      
      BYTE Length       :                             =   N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4                                 ( N0 + N1 + N2 + N3 ) × 2        (?s).
      
      Byte Order Mark   :                             =   0 ( UTF-8)  or  3 ( UTF-8-BOM )                                                   2                    0
      
      
      BUFFER Length     :  BYTE length  +  BOM
      
      FILE   Length     :  SAVED length of CURRENT file on DISK
      
      
      WORDS ( Caution ) :                             =   \w+                                                                             idem                idem
      
      NON-SPACE strings :                             =   ((?!\s).[\x{D800}-\x{DFFF}]?)+                                                  ((?!\s).)+          ((?!\s).)+
      
      
      True EMPTY lines  :                             =   (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)                                idem                (?<!\f)^(?:\r\n|\r|\n)
      
      BLANK lines       :                             =   (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)                    idem                (?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)
      
      
      EMPTY/BLANK lines :                             =   (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z)                    idem                (?<!\f)^[\t\x20]*(?:\r\n|\r|\n|\z)
      
      NON-BLANK lines   :                             =   (?-s)(?!^[\t\x20]+$)^(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z)       idem                (?-s)(?!^[\t\x20]+$)^(?:.|\f)+(?:\r\n|\r|\n|\z)
      
      TOTAL lines       :                             =   (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z                            idem                (?-s)\r\n|\r|\n|(?:.|\f)\z
      
      
      Selection(s)      :  X characters (Y bytes) in Z ranges
      

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 2
      • guy038G
        guy038
        last edited by

        Hello All,

        Here is the updated version of my previous posts regarding the present N++ Summary feature ( View > Summary... ). And I must say that numerous things are still weird !

        For tests, I used various files as well as my Total_Chars.txt file, written with the 4 N++ Unicode encodings and also an ANSI file, containing the 256 characters of the Windows-1252 encoding :

        • ANSI
        • UTF-8
        • UTF-8-BOM
        • UTF-16 BE BOM
        • UTF-16 LE BOM

        To my mind, there are 3 major problems and some minor points :

        • The first and worse problem is the fact that, when an UTF-8[-BOM] file, containing various Unicode chars ( of the BMP only : this point is important ! ) is copied in an UCS-2 BE BOM or UCS-2 LE BOM encoded file, some results, given by the Summary feature for these new files, are totally wrong :

          • The characters( without line endings ) value seems to be the number of bytes used in the corresponding UTF-8[-BOM] file

          • The Document length value seems to be the document length of the corresponding UTF-8[-BOM] file and is also displayed, unfortunately, in the status bar !!

        • The second problem is that the definition of a word char, by the Summary feature is definitively NOT the same of the definition of the regex \w, as explained further on !

        • Thus, the third problem is that the given number of words is totally inaccurate ! And, anyway, the number of words, although well enough defined for an English / American text, is rather a vague notion, for a lot of texts written in other languages, especially Asiatic ones ! ( See further on )

        • Some minor things :

          • The number of lines given is, most of the time, increased by one unit

          • Presently, the Summary feature displays the document length in the Notepad++ buffer. I think it would be good to display, as well, the actual document length saved on disk. Incidentally, for just saved documents, it would give, by difference, the length of the possible Byte Order Mark, if its size wouldn’t be explicitly displayed !

          • For any encoded file, a decomposition, giving the number of chars coded with 1, 2, 3 and 4 bytes would be welcome !

        So, in brief, in the present Summary window :

        • The Characters (without line endings): number is wrong for the UTF-16 BE BOM or UTF-16 LE BOM encodings

        • The Words number is totally wrong, given the regex definition of a word character, whatever the encoding used

        • The Lines: number is wrong, by one unit, if a line-break ends the last line of current file, in any encoding

        • The Document length value, in N++ buffer, is wrong for the UTF-16 BE BOM or UTF-16 LE BOM encodings, as well as the Length: indication in the status bar


        To begin with, let’s me develop the… second bug ! After numerous tests, I determined that, in the present View > Summary... feature, the characters, considered a word character, are :

        • The C0 control characters, except for the Tabulation ( \x{0009} ) and the two EOL ( \x{000a} and \x{000d} ), so the regex (?![\t\r\n])[\x00-\x1F]

        • The number sign #

        • The 10 digits, so the regex [0-9] :

        • The 26 uppercase and lowercase letters, so the regex (?i)[A-Z]

        • The low line character _

        • All the characters, of the Basic Multilingual Plane ( BMP ), with code-point over \x{007E}, so the regex (?![\x{D800}-\x{DFFF}])[\x{007F}-\x{FFFF}] for a Unicode encoded file or [\x7F-\xFF] for an ANSI encoded file

        • All the characters, over the Basic Multilingual Plane, so the regex (?-s).[\x{D800}-\x{DFFF}] for an Unicode encoded file, only

        To simulate the present Words: number ( which is erroneous ! ), given by the summary feature, whatever the file encoding, simply use the regex below :

        [^\t\n\r\x20!"$%&'()*+,\-./:;<=>?@\x5B\x5C\x5D^\x60{|}~]+
        

        and click on the Count button of the Find dialog, with the Wrap around option ticked

        Obviously, this is not exact as a single word character is matched with the \w regex, which is the class [\u\l\d_], where \u, \l and \d represents any Unicode uppercase, lowercase and digit char or a related char, so, finally, much more than the simple [A-Za-z0-9] set !

        But , worse, it’s the notion of word which is practically, not consistent, most of the time ! Indeed, for instance, if we consider the French expression l'école ( the school ), the regex \w+ would return 2 words, which is correct as this expression can be mentally decomposed as la école. However, this regex would wrongly say the that the single word aujourd'hui ( today ) is a two-words expression. Of course, you could change the regex as [\w']+ which would return 1 word, but, this time, the expression l'école would wrongly be considered as a one-word string !

        In addition, what can be said about languages that do not use the Space character or where the use of the Space is discretionary ? Then, counting of words is impossible or rather non-significant ! This is developed in this Martin Haspelmath’s article, below :

        https://zenodo.org/record/225844/files/WordSegmentationFL.pdf

        At end of section 5, it is said : … On such a view, the claim that “all languages have words” (Radford et al. 1999: 145) would be interpretable only in the weaker sense that "all languages have a unit which falls between the minimal sign and the phrase” …

        And : … The basic problem remains the same: The units are defined in a language-specific way and cannot be equated across languages, and there is no reason to give special status to a unit called ‘word’. …

        At beginning of section, 7 : … Linguists have no good basis for identifying words across languages …

        And in the conclusion, section 10 : … I conclude, from the arguments presented in this article, that there is no definition of ‘word’ that can be applied to any language and that would yield consistent results …


        Now, the Unicode definition of a word character is :

        \p{gc=Alphabetic} | \p{gc=Mark} | \p{gc=Decimal_Number} | \p{gc=Connector_Punctuation} | \p{Join-Control}

        https://stackoverflow.com/questions/5555613/does-w-match-all-alphanumeric-characters-defined-in-the-unicode-standard

        https://www.unicode.org/reports/tr18/#Simple_Word_Boundaries

        So, in theory, the word_character class should include :

        • All values of the derived category Alphabetic ( = alpha = \p{alphabetic} ) so 132,875 chars, from the DerivedCoreProperties.txt file, which can be decomposed into :

          • Uppercase_Letter (Lu) + Lowercase_Letter (Ll) + Titlecase_Letter (Lt) + Modifier_Letter (Lm) + Other_Letter (Lo) + Letter_Number (Nl) + Other_Alphabetic, so the characters sum 1,791 + 2,155 + 31 + 260 + 127,004 + 236 + 1,398

          • Note : The last property Other_Alphabetic, from the Prop_list.txt file, contains some, but not all, characters from the 3 General_Categories Spacing_Mark ( Mc ), Nonspacing_Mark ( Mn ) and Other_Symbol ( So ), so the characters sum 417 + 851 + 130

        • All values with General_Category = Decimal_Number, from the DerivedGeneralCategory.txt file, so 650 characters

          ( These are characters, with defined values in the three fields 6, 7 and 8 of the UnicodeData.txt file

        • All values with General_Category = Connector_Punctuation, from the DerivedGeneralCategory.txt file, so 10 characters

        • All values with the binary Property Join_Control, from the PropList.txt file, so 2 characters

        So, if we include all Unicode languages, even historical ones :

        => Total number of Unicode word characters = 132,875 + 650 + 10 + 2 = 133,537 characters, with version UNICODE 13.0.0 !!

        Notes :

        • The different files mentioned can be downloaded from the Unicode Character Database ( UCD ) or sub-directories, below :

        http://www.unicode.org/Public/UCD/latest/ucd/

        • And refer to the sites, below, for additional information to this topic :

        https://www.unicode.org/reports/tr18/#Compatibility_Properties

        https://www.unicode.org/reports/tr29/#Word_Boundaries

        https://www.unicode.org/reports/tr31/    for tables 4, 5 and 6 of section 2.4

        https://www.unicode.org/reports/tr44/#UnicodeData.txt


        If someone did click on the links to the Unicode Consortium, above, one understood, very quickly, that word characters and word boundaries notions are a real nightmare !

        Even if we restrict the definition of word chars to Unicode living scripts, forgetting all the historical scripts not in use, and also leaving aside all scripts which do not use the space char to, systematically, delimit words, we still have a list of about 21,000 characters which should be considered as word character ! I tried to build up such a list, with the help of these sites :

        https://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries

        https://linguistlist.org/issues/6/6-1302/

        https://unicode-org.github.io/cldr-staging/charts/37/supplemental/scripts_and_languages.html

        https://scriptsource.org/cms/scripts/page.php?item_id=script_overview

        https://r12a.github.io/scripts/featurelist/

        And I ended up with this list of 46 living scripts which always use a Space character between words :

        •------------------------•----------------•-------------------•-----------------•
        |                        |    SCRIPT      |   SPACE between   |  UNICODE Script |
        |                        |      Type :    |      Words :      |     Class :     |
        |                        •----------------•-------------------•-----------------•
        |           SCRIPT       |  (L)iving      |  (Y)es            |  (R)ecommended  |
        |                        |                |  (U)nspecified    |  (L)imited      |
        |                        |  (H)istorical  |  (D)iscretionary  |  (E)xcluded     |
        |                        |                |  (N)o             |                 |
        •------------------------•----------------•-------------------•-----------------•
        |  ARMENIAN              |       L        |         Y         |        R        |
        |  ADLAM                 |       L        |         Y         |        L        |
        |  ARABIC                |       L        |         Y         |        R        |
        |  BAMUM                 |       L        |         Y         |        L        |
        |  BASSA VAH             |       L        |         Y         |        E        |
        |  BENGALI ( Assamese )  |       L        |         Y         |        R        |
        |  BOPOMOFO              |       L        |         Y         |        R        |
        |  BUGINESE              |       L        |         D         |        E        |
        |  CANADIAN SYLLABICS    |       L        |         Y         |        L        |
        |  CHEROKEE              |       L        |         Y         |        L        |
        |  CYRILLIC              |       L        |         Y         |        R        |
        |  DEVANAGARI            |       L        |         Y         |        R        |
        |  ETHIOPIC (Ge'ez)      |       L        |         Y         |        R        |
        |  GEORGIAN              |       L        |         Y         |        R        |
        |  GREEK                 |       L        |         Y         |        R        |
        |  GUJARATI              |       L        |         Y         |        R        |
        |  GURMUKHI              |       L        |         Y         |        R        |
        |  HANGUL                |       L        |         Y         |        R        |
        |  HANIFI ROHINGYA       |       L        |         Y         |        L        |
        |  HEBREW                |       L        |         Y         |        R        |
        |  KANNADA               |       L        |         Y         |        R        |
        |  KAYAH LI              |       L        |         Y         |        L        |
        |  LATIN                 |       L        |         Y         |        R        |
        |  LIMBU                 |       L        |         Y         |        L        |
        |  MALAYALAM             |       L        |         D         |        R        |
        |  MANDAIC               |       H        |         Y         |        L        |
        |  MEETEI MAYEK          |       L        |         Y         |        L        |
        |  MIAO (Pollard)        |       L        |         Y         |        L        |
        |  MONGOLIAN             |       L        |         Y         |        E        |
        |  NEWA                  |       L        |         Y         |        L        |
        |  NKO                   |       L        |         Y         |        L        |
        |  OL CHIKI              |       L        |         Y         |        L        |
        |  ORIYA (Odia)          |       L        |         Y         |        R        |
        |  OSAGE                 |       L        |         Y         |        L        |
        |  SINHALA               |       L        |         Y         |        R        |
        |  SUNDANESE             |       L        |         Y         |        L        |
        |  SYLOTI NAGRI          |       L        |         Y         |        L        |
        |  SYRIAC                |       L        |         Y         |        L        |
        |  TAi VIET              |       L        |         Y         |        L        |
        |  TAMIL                 |       L        |         Y         |        R        |
        |  TELUGU                |       L        |         Y         |        R        |
        |  THAANA                |       L        |         D         |        R        |
        |  TIFINAGH (Berber)     |       L        |         Y         |        L        |
        |  VAI                   |       L        |         Y         |        L        |
        |  WANCHO                |       L        |         Y         |        L        |
        |  YI                    |       L        |         Y         |        L        |
        •------------------------•----------------•-------------------•-----------------•
        

        These scripts involve 101 legal Unicode scripts, from Basic Latin ( 0000 - 007F ) till Symbols for Legacy Computing ( 1FB00 - 1FBFF )


        You may, also, have a look to these sites for general information :

        https://en.wikipedia.org/wiki/List_of_Unicode_characters

        https://en.wikipedia.org/wiki/Scriptio_continua#Decline

        https://glottolog.org/glottolog/language    especially to locate the area where a language is used

        Continued discussion in the next post

        guy038

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by

          Hi All,

          Then, with the help of the excellent Babel Map software, updated for Unicode v13.0``

          https://www.babelstone.co.uk/Software/BabelMap.html

          I succeeded to create a list of the 21,143 remaining characters, from the living scripts, above, which should be truly considered as word character, without any ambiguity

          On the other hand, with the help of my Total_Chars.txt, which contains 325,590 characters, I detected 48,031 word chars with the simple search of the \w regex. This number seems important but include all the Chinese characters and equivalent chars which cannot be truly counted as word chars because of their vertical / horizontal way of writing !

          In addition, when applying the regex \t\w\t against this list above, I got a total of 17,307 word characters, only, because, probably, Notepad++ does not use the Boost regex library with FULL Unicode support

          Indeed, after some verifications :

          • The Boost definition of the regex \w does not consider all the characters over the BMP

          • Some characters of the BMP, although alphabetic, are not considered, yet, as word chars

          For instance, in this short list, below, each Unicode char, surrounded with two tabulation chars, cannot be found with the regex \t\w\t, although that each char is, indeed, seen as a word by the Unicode Consortium` :-((

           24B6   Ⓐ     ; Other_Symbol     # So         CIRCLED LATIN CAPITAL LETTER A
          1D400   𝐀     ; Uppercase_Letter # Lu         MATHEMATICAL BOLD CAPITAL A
          1D70B   𝜋     ; Lowercase_Letter # Ll         MATHEMATICAL ITALIC SMALL PI
          1F150   🅐     ; Other_symbol     # So         NEGATIVE CIRCLED LATIN CAPITAL LETTER A
          

          To my mind, for all these reasons, as we cannot rely on the word notion, the View > Summary... feature should just ignore the number of words or, at least, add the indication With caution !


          By contrast, I think that it would be useful to count the number of Non_Space strings, determined with the regex \S+. Indeed, we would get more confident results ! The boundaries of Non_Space strings, which are the Space characters, belong to the well-defined list of the 25 Unicode characters with the binary property White_Space, from the PropList.txt file. Refer to the very beginning of this file :

          http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

          As a reminder, the regex \s is identical to \h|\v. So, it represents the complete character class [\t\x20\xA0\x{1680}\x{2000}-\x{200B}\x{202F}\x{3000}]|[\n\x0B\f\r\x85\x{2028}\x{2029}] which can be re-ordered as :

          \s = [\t\n\x0B\f\r\x20\x85\xA0\x{1680}\x{2000}-\x{200B}\x{2028}\x{2029}\x{202F}\x{3000}]

          Note that, in practice, the \s regex is mainly equivalent to the simple regex [\t\n\r\x20]

          Here is that Unicode list of all Unicode characters with the property White_Space, with their name and their General_Category value :

          0009  TAB  ; White_Space    # Cc    TABULATION  <control-0009>
          000A  LF   ; White_Space    # Cc    LINE FEED  <control-000A>
          000B       ; White_Space    # Cc    VERTICAL TABULATION  <control-000B>
          000C    ; White_Space    # Cc    FORM FEED  <control-000C>
          000D  CR   ; White_Space    # Cc    CARRIAGE RETURN  <control-000D>
          0020       ; White_Space    # Zs    SPACE
          0085…    ; White_Space    # Cc    NEXT LINE  <control-0085>
          00A0       ; White_Space    # Zs    NO-BREAK SPACE
          1680       ; White_Space    # Zs    OGHAM SPACE MARK
          2000       ; White_Space    # Zs    EN QUAD
          2001       ; White_Space    # Zs    EM QUAD
          2002       ; White_Space    # Zs    EN SPACE
          2003       ; White_Space    # Zs    EM SPACE
          2004       ; White_Space    # Zs    THREE-PER-EM SPACE
          2005       ; White_Space    # Zs    FOUR-PER-EM SPACE
          2006       ; White_Space    # Zs    SIX-PER-EM SPACE
          2007       ; White_Space    # Zs    FIGURE SPACE
          2008       ; White_Space    # Zs    PUNCTUATION SPACE
          2009       ; White_Space    # Zs    THIN SPACE
          200A       ; White_Space    # Zs    HAIR SPACE
          2028
    ; White_Space    # Zl    LINE SEPARATOR
          2029
    ; White_Space    # Zp    PARAGRAPH SEPARATOR
          202F       ; White_Space    # Zs    NARROW NO-BREAK SPACE
          205F       ; White_Space    # Zs    MEDIUM MATHEMATICAL SPACE
          3000      ; White_Space    # Zs    IDEOGRAPHIC SPACE
          

          Note that I used the notations TAB, LF and CR, standing for the three characters \t, \n and \r, instead of the chars themselves

          So, in order to get the number of Non_Space strings, we should, normally, use the simple regex \S+. However, it does not give the right number. Indeed, when several characters, with code-point over the BMP, are consecutive, they are not seen as a global Non_Space string but as individual characters :-((

          You may test my statement with this string, composed of four consecutive emoji chars 👨👩👦👧. The regex \S+ returns four Non_Space strings, whereas I would have expected only one string !

          Consequently, I verified that, when the number of four bytes chars is > 0, the suitable regex to count all the Non_Space strings of a file, whatever their Unicode code-point, is rather the regex ((?!\s).[\x{D800}-\x{DFFF}]?)+ ( longer, I agree but exact ! )


          So, I would like to propose a new layout of an summary feature, which should be more informative. It contains a list of regexes which allow you to count different subsets of characters from the current file contents. Of course, tick the Wrap around option, in the Find dialog and click on the Count button for tests !

          IMPORTANT : In the list below, any text, before the colon character of each line, is the name which should be displayed in the new Summary dialog !

           FULL File Path    :  X:\....\....\
          
           CREATION     Date :  Name Month Day 22-05-26 Year
           MODIFICATION Date :  Name Month Day 22-05-26 Year
          
           READ-ONLY flag    :  YES / NO
           READ-ONLY editor  :  YES / NO
          
          
           Current VIEW      :  MAIN view / SECONDARY view
          
           Current ENCODING  :  UTF-... / ANSI
          
           Current LANGUAGE  :  TXT ( Normal txt file) / ...
          
           Current Line END  :  Windows (CR LF) / Macintosh (CR) / Unix (LF)
          
           Current WRAPPING  :  YES / NO
          
          •------------------------------------------------•----------------------------------------------------------------------------•------------------------------------------------•-----------------------------------•
                                                           |                                 UTF-8 [-BOM]                               |             UCS-2/UTF-16 BE/LE BOM             |                ANSI
          •------------------------------------------------•----------------------------------------------------------------------------•------------------------------------------------•-----------------------------------•
                                                           |                                                                            |                                                |
           1-BYTE  Chars     :  N1                         | (?![\r\n])[\x{0000}-\x{007F}]                                              |                        0                       |               [^\r\n]
           2-BYTES Chars     :  N2                         | [\x{0080}-\x{07FF}]                                                        | (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}] |                  0
           3-BYTES Chars     :  N3                         | (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]                                 |                        0                       |                  0
                                                           |                                                                            |                                                |
           Sum BMP Chars     :  N1 + N2 + N3               | (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}] or [^\r\n\x{D800}-\x{DFFF}] |                      idem                      |               [^\r\n]
           4-BYTES Chars     :  N4                         | (?-s).[\x{D800}-\x{DFFF}]  or  [\x{D800}-\x{DFFF}]                         |                      idem                      |                  0
                                                           |                                                                            |                                                |
           Chars w/o CR|LF   :  N1 + N2 + N3 + N4          | [^\r\n]                                                                    |                      idem                      |                idem
           EOL ( CR or LF )  :  N0                         | \r|\n                                                                      |                      idem                      |                idem
                                                           |                                                                            |                                                |
           TOTAL Characters  :  N0 + N1 + N2 + N3 + N4     | (?s).                                                                      |                      idem                      |                idem
                                                           |                                                                            |                                                |
                                                           |                                                                            |                                                |
           BYTE Length       :                             | N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4                                         |           N0 × 2 + N2 × 2 +  N4 ×    4         |               NO + N1
                                                           |                                                                            |                                                |
           Byte Order Mark   :                             | 0 ( UTF-8)  or  3 ( UTF-8-BOM )                                            |                        2                       |                  0
                                                           |                                                                            |                                                |
           BUFFER Length     :  BYTE length  +  BOM        |                                                                            |                                                |
                                                           |                                                                            |                                                |
           Length on DISK    :  Length CURRENT file on DISK|                                                                            |                                                |
                                                           |                                                                            |                                                |
                                                           |                                                                            |                                                |
           NON BLANK chars   :                             | [^\r\n\t\x20]                                                              |                       idem                     |                idem
                                                           |                                                                            |                                                |
           WORDS     count   :     (Caution !)             | \w+                                                                        |                       idem                     |                idem
                                                           |                                                                            |                                                |
           NON-SPACE count   :                             | (?:(?!\s).[\x{D800}-\x{DFFF}]?)+  or  \S+                                  |                       idem                     |                \S+
                                                           |                                                                            |                                                |
                                                           |                                                                            |                                                |
           True EMPTY lines  :  L1                         | (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)                           |                       idem                     | (?<!\f)^(?:\r\n|\r|\n)
                                                           |                                                                            |                                                |
           True BLANK lines  :  L2                         | (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)               |                       idem                     | (?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)
                                                           |                                                                            |                                                |
                                                           |                                                                            |                                                |
           EMPTY/BLANK lines :  L1 + L2                    | (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z)               |                       idem                     | (?<!\f)^[\t\x20]*(?:\r\n|\r|\n|\z)
                                                           |                                                                            |                                                |
           NON-BLANK lines   :                             | (?-s)(?!^[\t\x20]+$)^(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z)  |                       idem                     | (?-s)(?!^[\t\x20]+$)^(?:.|\f)+(?:\r\n|\r|\n|\z)
                                                           |                                                                            |                                                |
           TOTAL lines       :                             | (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z                       |                       idem                     | (?-s)\r\n|\r|\n|(?:.|\f)\z
                                                           |                                                                            |                                                |
                                                           |                                                                            |                                                |
           SELECTION(S)      :  X characters (Y bytes) in Z ranges                                                                      |                        idem                    |                idem
          •------------------------------------------------•----------------------------------------------------------------------------•------------------------------------------------•------------------------------------•
          

          Continued discussion in the next post

          guy038

          1 Reply Last reply Reply Quote 2
          • guy038G
            guy038
            last edited by guy038

            Hi, All,

            Remarks : Although most of the regexes, above, can be easily understood, here are some additional elements :

            • The regex (?-s).[\x{D800}-\x{DFFF}] is the sole correct syntax, with our Boost regex engine, to count all the characters over the BMP. But it may fail with the message Ran out of stack space trying to match the regular expression.. Luckily, I do not use it because it can be deduced from the difference Total_Standard - Total_BMP

            • The regex (?s)((?!\s).[\x{D800}-\x{DFFF}]?)+, to count all the Non_Space strings, was explained before but may fail with the message Ran out of stack space trying to match the regular expression.

            • In all the regexes, relative to the counting of lines, you probably noticed the character class [\f\x{0085}\x{2028}\x{2029}]. It must be present because the four characters \f, \x{0085} , \x{2028} and \x{2029} are, both, considered as a start and an End of line, like the assertions ^ and $ !

              • For instance, if, in a new file, you insert one Next_Line char ( NEL ), of code-point \x{0085} and hit the Enter key, this sole line is wrongly seen as an empty line by the simple regex ^(?:\r\n|\r|\n) which matches the line-break after the Next_Line char !

            Here is the python script, split on two posts

            # encoding=utf-8
            
            #-------------------------------------------------------------------------
            #                    STATISTICS about the CURRENT file ( v0.6 )
            #-------------------------------------------------------------------------
            
            from __future__ import print_function    # for Python2 compatibility
            
            from Npp import *
            
            import re
            
            import os, time, datetime
            
            import ctypes
            
            from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT
            
            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
            #  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
            
            def npp_get_statusbar(statusbar_item_number):
            
                WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
                FindWindowW = ctypes.windll.user32.FindWindowW
                FindWindowExW = ctypes.windll.user32.FindWindowExW
                SendMessageW = ctypes.windll.user32.SendMessageW
                LRESULT = LPARAM
                SendMessageW.restype = LRESULT
                SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
                EnumChildWindows = ctypes.windll.user32.EnumChildWindows
                GetClassNameW = ctypes.windll.user32.GetClassNameW
                create_unicode_buffer = ctypes.create_unicode_buffer
            
                SBT_OWNERDRAW = 0x1000
                WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13
            
                npp_get_statusbar.STATUSBAR_HANDLE = None
            
                def get_result_from_statusbar(statusbar_item_number):
                    assert statusbar_item_number <= 5
                    retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
                    length = retcode & 0xFFFF
                    type = (retcode >> 16) & 0xFFFF
                    assert (type != SBT_OWNERDRAW)
                    text_buffer = create_unicode_buffer(length)
                    retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
                    retval = '{}'.format(text_buffer[:length])
                    return retval
            
                def EnumCallback(hwnd, lparam):
                    curr_class = create_unicode_buffer(256)
                    GetClassNameW(hwnd, curr_class, 256)
                    if curr_class.value.lower() == "msctls_statusbar32":
                        npp_get_statusbar.STATUSBAR_HANDLE = hwnd
                        return False  # stop the enumeration
                    return True  # continue the enumeration
            
                npp_hwnd = FindWindowW(u"Notepad++", None)
                EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
                if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
                assert False
            
            St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )
            
            

            See next post for continuation !

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by

              Continuation of the script :

              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              def number(occ):
                  global num
                  num += 1
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              Curr_encoding = str(notepad.getEncoding())
              
              if Curr_encoding == 'ENC8BIT':
                  Curr_encoding = 'ANSI'
              
              if Curr_encoding == 'COOKIE':
                  Curr_encoding = 'UTF-8'
              
              if Curr_encoding == 'UTF8':
                  Curr_encoding = 'UTF-8-BOM'
              
              if Curr_encoding == 'UCS2BE':
                  Curr_encoding = 'UTF-16 BE BOM'
              
              if Curr_encoding == 'UCS2LE':
                  Curr_encoding = 'UTF-16 LE BOM'
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                  Line_title = 95
              else:
                  Line_title = 75
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              File_name = notepad.getCurrentFilename()
              
              if os.path.isfile(File_name) == True:
              
                  Creation_date = time.ctime(os.path.getctime(File_name))
              
                  Modif_date = time.ctime(os.path.getmtime(File_name))
              
                  Size_length = os.path.getsize(File_name)
              
                  RO_flag = 'YES'
              
                  if os.access(File_name, os.W_OK):
                      RO_flag = 'NO'
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              RO_editor = 'NO'
              
              if editor.getReadOnly() == True:
                  RO_editor = 'YES'
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              if notepad.getCurrentView() == 0:
                  Curr_view = 'MAIN View'
              else:
                  Curr_view = 'SECONDARY view'
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              Curr_lang = notepad.getCurrentLang()
              
              Lang_desc = notepad.getLanguageDesc(Curr_lang)
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              if editor.getEOLMode() == 0:
                  Curr_eol = 'Windows (CR LF)'
              
              if editor.getEOLMode() == 1:
                  Curr_eol = 'Macintosh (CR)'
              
              if editor.getEOLMode() == 2:
                  Curr_eol = 'Unix (LF)'
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              Curr_wrap = 'NO'
              
              if editor.getWrapMode() == 1:
                  Curr_wrap = 'YES'
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              num = 0
              if Curr_encoding == 'ANSI':
                  editor.research(r'[^\r\n]', number)
              
              if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                  editor.research(r'(?![\r\n])[\x{0000}-\x{007F}]', number)
              
              Total_1_byte = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              num = 0
              if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                  editor.research(r'[\x{0080}-\x{07FF}]', number)
              
              if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                  editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP vchars ( With PYTHON, the [^\r\n\x{D800}-\x{DFFF}] syntax does NOT work properly !)
              
              Total_2_bytes = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              num = 0
              if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                  editor.research(r'(?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]', number)
              
              Total_3_bytes = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              num = 0
              editor.research(r'[^\r\n]', number)
              
              Total_standard = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              Total_4_bytes = 0  #  By default
              
              if Curr_encoding != 'ANSI':
                  Total_4_bytes = Total_standard - Total_BMP
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              num = 0
              editor.research(r'\r|\n', number)
              
              Total_EOL = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              Total_chars = Total_EOL + Total_standard
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              if Curr_encoding == 'ANSI':
                  Bytes_length = Total_EOL + Total_1_byte
              
              if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                  Bytes_length = Total_EOL + Total_1_byte + 2 * Total_2_bytes + 3 * Total_3_bytes + 4 * Total_4_bytes
              
              if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                  Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              BOM = 0  #  Default ANSI and UTF-8
              
              if Curr_encoding == 'UTF-8-BOM':
                  BOM = 3
              
              if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                  BOM = 2
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              Buffer_length = Bytes_length + BOM
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              num = 0
              editor.research(r'[^\r\n\t\x20]', number)
              
              Non_blank_chars = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              num = 0
              editor.research(r'\w+', number)
              
              Words_count = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              num = 0
              
              if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
                  editor.research(r'\S+', number)
              else:
                  editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
              
              Non_space_count = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              num = 0
              if Curr_encoding == 'ANSI':
                  editor.research(r'(?<!\f)^(?:\r\n|\r|\n)', number)
              else:
                  editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)', number)
              
              Empty_lines = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              num = 0
              if Curr_encoding == 'ANSI':
                  editor.research(r'(?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
              else:
                  editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
              
              Blank_lines = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              Emp_blk_lines = Empty_lines + Blank_lines
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              num = 0
              if Curr_encoding == 'ANSI':
                  editor.research(r'(?-s)\r\n|\r|\n|(?:.|\f)\z', number)
              else:
                  editor.research(r'(?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z', number)
              
              Total_lines = num
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              Non_blk_lines = Total_lines - Emp_blk_lines
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )
              
              # print ('Res = ', Num_sel)
              
              if Num_sel != 0:
              
                  Bytes_count = 0
                  Chars_count = 0
              
                  for n in range(Num_sel):
              
                      Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
              
                      Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
                  if Chars_count < 2:
                      Txt_chars = ' selected char ('
              
                  else:
                      Txt_chars = ' selected chars ('
              
              
                  if Bytes_count < 2:
                      Txt_bytes = ' selected byte) in '
              
                  else:
                      Txt_bytes = ' selected bytes) in '
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
                  if Num_sel < 2 and Bytes_count == 0:
                      Txt_ranges = ' EMPTY range\n'
              
                  if Num_sel < 2 and Bytes_count > 0:
                      Txt_ranges = ' range\n'
              
                  if Num_sel > 1 and Bytes_count == 0:
                      Txt_ranges = ' EMPTY ranges\n'
              
                  if Num_sel > 1 and Bytes_count > 0:
                      Txt_ranges = ' ranges (EMPTY or NOT)\n'
              
              # --------------------------------------------------------------------------------------------------------------------------------------------------------------
              
              line_list = []  # empty list
              
              line_list.append ('-' * Line_title)
              
              line_list.append (' ' * int((Line_title - 37) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()))
              
              line_list.append ('-' * Line_title +'\n')
              
              line_list.append (' FULL File Path    :  ' + File_name + '\n')
              
              if os.path.isfile(File_name) == True:
              
                  line_list.append(' CREATION     Date :  ' + Creation_date)
              
                  line_list.append(' MODIFICATION Date :  ' + Modif_date + '\n')
              
                  line_list.append(' READ-ONLY flag    :  ' + RO_flag )
              
              line_list.append (' READ-ONLY editor  :  ' + RO_editor + '\n\n')
              
              line_list.append (' Current VIEW      :  ' + Curr_view + '\n')
              
              line_list.append (' Current ENCODING  :  ' + Curr_encoding + '\n')
              
              line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')\n')
              
              line_list.append (' Current Line END  :  ' + Curr_eol + '\n')
              
              line_list.append (' Current WRAPPING  :  ' + Curr_wrap + '\n\n')
              
              line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))
              
              line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))
              
              line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + '\n')
              
              line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))
              
              line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + '\n')
              
              line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard))
              
              line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL) + '\n')
              
              line_list.append (' TOTAL characters  :  ' + str(Total_chars) + '\n\n')
              
              if Curr_encoding == 'ANSI':
                  line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')
              
              if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                  line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
                  + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')
              
              if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                  line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')
              
              line_list.append (' Byte Order Mark   :  ' + str(BOM) + '\n')
              
              line_list.append (' BUFFER Length     :  ' + str(Buffer_length))
              
              if os.path.isfile(File_name) == True:
                  line_list.append (' Length on DISK    :  ' + str(Size_length) + '\n\n')
              else:
                  line_list.append ('\n')
              
              line_list.append (' NON-Blank Chars   :  ' + str(Non_blank_chars) + '\n')
              
              line_list.append (' WORDS     Count   :  ' + str(Words_count) + ' (Caution !)\n')
              
              line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + '\n\n')
              
              line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))
              
              line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + '\n')
              
              line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + '\n')
              
              line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))
              
              line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + '\n\n')
              
              line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges)
              
              editor.copyText ('\r\n'.join(line_list))
              
              notepad.new()
              
              editor.paste()
              
              editor.copyText('')
              
              if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':
              
                  if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature
              
                      notepad.prompt ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                                      '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!', '')
              
              # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------
              

              The way to use this script is quite self-explanatory. Just three points to emphazise :

              • On the BUFFER length line, the values between parentheses :

                • Always begin with the number of EOL ( I omitted the b after x 1, on purpose ! )

                  • Followed with the number of the 1-BYTE for an ANSI encoded file

                  • Followed with the numbers of the 1-BYTE, 2-BYTES, 3-BYTES and 4-BYTES, for an UTF-8 or UTF-8-BOM encoded file

                  • Followed with the numbers of the 2-BYTES and 4-BYTES, for an UTF-16 BE BOM or UTF-16 LE BOM encoded file

              • Normally, when a file is saved the values BUFFEER length and Length on DISK should always be equal. If not, two cases are possible :

                • This file have been recently modified ( trivial case )

                • The file is not identified with a BOM and has been re-interpreted with an other NON-Unicode encoding. Then, apply the actions, indicated in the pop-up message !

              • For a new # file, some values are obviously absent. These are the MODIFICATION date, the CREATION date, the READ-ONLY flag and the Length on DISK ( size ) values

              Best Regards,

              guy038

              Mark OlsonM 1 Reply Last reply Reply Quote 2
              • Mark OlsonM
                Mark Olson @guy038
                last edited by

                @guy038 said in Tests and impressions on the "View > Summary..." functionality:

                editor.copyText (‘\r\n’.join(line_list))

                notepad.new()

                editor.paste()

                editor.copyText(‘’)

                Couldn’t you just do

                notepad.new()
                editor.setText('\r\n'.join(line_list))
                

                and thus avoid overwriting the user’s clipboard?

                1 Reply Last reply Reply Quote 1
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, All,

                  • So, I followed the excellent @mark-olson’s suggestion to bypass the clipboard functionality !

                  • Now, in case of a RuntimeError, when searching for the NON-SPACE count of characters, I used an exception which displays a warning message, if the Err_Regex is True. But, even when the Err_Regex variable is False, the result is not totally guaranteed too, if the analyzed file contains bytes over the BMP.

                  So, globally, whatever the Err_Regex status, the NON-SPACE count value may be increased or decreased by 1, in some cases ( still unclear ) !


                  Here is the v0.7 version of my script ( I indeed gave a version number to my successive attempts ! )

                  # encoding=utf-8
                  
                  #-------------------------------------------------------------------------
                  #                    STATISTICS about the CURRENT file ( v0.7 )
                  #-------------------------------------------------------------------------
                  
                  from __future__ import print_function    # for Python2 compatibility
                  
                  from Npp import *
                  
                  import re
                  
                  import os, time, datetime
                  
                  import ctypes
                  
                  from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT
                  
                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                  #  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                  
                  def npp_get_statusbar(statusbar_item_number):
                  
                      WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
                      FindWindowW = ctypes.windll.user32.FindWindowW
                      FindWindowExW = ctypes.windll.user32.FindWindowExW
                      SendMessageW = ctypes.windll.user32.SendMessageW
                      LRESULT = LPARAM
                      SendMessageW.restype = LRESULT
                      SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
                      EnumChildWindows = ctypes.windll.user32.EnumChildWindows
                      GetClassNameW = ctypes.windll.user32.GetClassNameW
                      create_unicode_buffer = ctypes.create_unicode_buffer
                  
                      SBT_OWNERDRAW = 0x1000
                      WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13
                  
                      npp_get_statusbar.STATUSBAR_HANDLE = None
                  
                      def get_result_from_statusbar(statusbar_item_number):
                          assert statusbar_item_number <= 5
                          retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
                          length = retcode & 0xFFFF
                          type = (retcode >> 16) & 0xFFFF
                          assert (type != SBT_OWNERDRAW)
                          text_buffer = create_unicode_buffer(length)
                          retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
                          retval = '{}'.format(text_buffer[:length])
                          return retval
                  
                      def EnumCallback(hwnd, lparam):
                          curr_class = create_unicode_buffer(256)
                          GetClassNameW(hwnd, curr_class, 256)
                          if curr_class.value.lower() == "msctls_statusbar32":
                              npp_get_statusbar.STATUSBAR_HANDLE = hwnd
                              return False  # stop the enumeration
                          return True  # continue the enumeration
                  
                      npp_hwnd = FindWindowW(u"Notepad++", None)
                      EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
                      if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
                      assert False
                  
                  St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )
                  
                  

                  Continuation on next post

                  guy038

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by

                    Hi all,

                    Continuation of version v0.7 of the script :

                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    def number(occ):
                        global num
                        num += 1
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Curr_encoding = str(notepad.getEncoding())
                    
                    if Curr_encoding == 'ENC8BIT':
                        Curr_encoding = 'ANSI'
                    
                    if Curr_encoding == 'COOKIE':
                        Curr_encoding = 'UTF-8'
                    
                    if Curr_encoding == 'UTF8':
                        Curr_encoding = 'UTF-8-BOM'
                    
                    if Curr_encoding == 'UCS2BE':
                        Curr_encoding = 'UTF-16 BE BOM'
                    
                    if Curr_encoding == 'UCS2LE':
                        Curr_encoding = 'UTF-16 LE BOM'
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                        Line_title = 95
                    else:
                        Line_title = 75
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    File_name = notepad.getCurrentFilename()
                    
                    if os.path.isfile(File_name) == True:
                    
                        Creation_date = time.ctime(os.path.getctime(File_name))
                    
                        Modif_date = time.ctime(os.path.getmtime(File_name))
                    
                        Size_length = os.path.getsize(File_name)
                    
                        RO_flag = 'YES'
                    
                        if os.access(File_name, os.W_OK):
                            RO_flag = 'NO'
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    RO_editor = 'NO'
                    
                    if editor.getReadOnly() == True:
                        RO_editor = 'YES'
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    if notepad.getCurrentView() == 0:
                        Curr_view = 'MAIN View'
                    else:
                        Curr_view = 'SECONDARY view'
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Curr_lang = notepad.getCurrentLang()
                    
                    Lang_desc = notepad.getLanguageDesc(Curr_lang)
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    if editor.getEOLMode() == 0:
                        Curr_eol = 'Windows (CR LF)'
                    
                    if editor.getEOLMode() == 1:
                        Curr_eol = 'Macintosh (CR)'
                    
                    if editor.getEOLMode() == 2:
                        Curr_eol = 'Unix (LF)'
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Curr_wrap = 'NO'
                    
                    if editor.getWrapMode() == 1:
                        Curr_wrap = 'YES'
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    num = 0
                    if Curr_encoding == 'ANSI':
                        editor.research(r'[^\r\n]', number)
                    
                    if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                        editor.research(r'(?![\r\n])[\x{0000}-\x{007F}]', number)
                    
                    Total_1_byte = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    num = 0
                    if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                        editor.research(r'[\x{0080}-\x{07FF}]', number)
                    
                    if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                        editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP vchars ( With PYTHON, the [^\r\n\x{D800}-\x{DFFF}] syntax does NOT work properly !)
                    
                    Total_2_bytes = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    num = 0
                    if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                        editor.research(r'(?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]', number)
                    
                    Total_3_bytes = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    num = 0
                    editor.research(r'[^\r\n]', number)
                    
                    Total_standard = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Total_4_bytes = 0  #  By default
                    
                    if Curr_encoding != 'ANSI':
                        Total_4_bytes = Total_standard - Total_BMP
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    num = 0
                    editor.research(r'\r|\n', number)
                    
                    Total_EOL = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Total_chars = Total_EOL + Total_standard
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    if Curr_encoding == 'ANSI':
                        Bytes_length = Total_EOL + Total_1_byte
                    
                    if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                        Bytes_length = Total_EOL + Total_1_byte + 2 * Total_2_bytes + 3 * Total_3_bytes + 4 * Total_4_bytes
                    
                    if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                        Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    BOM = 0  #  Default ANSI and UTF-8
                    
                    if Curr_encoding == 'UTF-8-BOM':
                        BOM = 3
                    
                    if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                        BOM = 2
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Buffer_length = Bytes_length + BOM
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    num = 0
                    editor.research(r'[^\r\n\t\x20]', number)
                    
                    Non_blank_chars = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    num = 0
                    editor.research(r'\w+', number)
                    
                    Words_count = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Err_Regex = False
                    
                    num = 0
                    
                    if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
                        editor.research(r'\S+', number)
                    else:
                        try:
                            editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
                        except RuntimeError:
                            Err_Regex = True
                    
                    Non_space_count = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    num = 0
                    if Curr_encoding == 'ANSI':
                        editor.research(r'(?<!\f)^(?:\r\n|\r|\n)', number)
                    else:
                        editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)', number)
                    
                    Empty_lines = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    num = 0
                    if Curr_encoding == 'ANSI':
                        editor.research(r'(?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
                    else:
                        editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
                    
                    Blank_lines = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Emp_blk_lines = Empty_lines + Blank_lines
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    num = 0
                    if Curr_encoding == 'ANSI':
                        editor.research(r'(?-s)\r\n|\r|\n|(?:.|\f)\z', number)
                    else:
                        editor.research(r'(?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z', number)
                    
                    Total_lines = num
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Non_blk_lines = Total_lines - Emp_blk_lines
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )
                    
                    # print ('Res = ', Num_sel)
                    
                    if Num_sel != 0:
                    
                        Bytes_count = 0
                        Chars_count = 0
                    
                        for n in range(Num_sel):
                    
                            Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
                    
                            Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                        if Chars_count < 2:
                            Txt_chars = ' selected char ('
                    
                        else:
                            Txt_chars = ' selected chars ('
                    
                    
                        if Bytes_count < 2:
                            Txt_bytes = ' selected byte) in '
                    
                        else:
                            Txt_bytes = ' selected bytes) in '
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                        if Num_sel < 2 and Bytes_count == 0:
                            Txt_ranges = ' EMPTY range\n'
                    
                        if Num_sel < 2 and Bytes_count > 0:
                            Txt_ranges = ' range\n'
                    
                        if Num_sel > 1 and Bytes_count == 0:
                            Txt_ranges = ' EMPTY ranges\n'
                    
                        if Num_sel > 1 and Bytes_count > 0:
                            Txt_ranges = ' ranges (EMPTY or NOT)\n'
                    
                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                    
                    line_list = []  # empty list
                    
                    line_list.append ('-' * Line_title)
                    
                    line_list.append (' ' * int((Line_title - 37) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()))
                    
                    line_list.append ('-' * Line_title +'\n')
                    
                    line_list.append (' FULL File Path    :  ' + File_name + '\n')
                    
                    if os.path.isfile(File_name) == True:
                    
                        line_list.append(' CREATION     Date :  ' + Creation_date)
                    
                        line_list.append(' MODIFICATION Date :  ' + Modif_date + '\n')
                    
                        line_list.append(' READ-ONLY flag    :  ' + RO_flag )
                    
                    line_list.append (' READ-ONLY editor  :  ' + RO_editor + '\n\n')
                    
                    line_list.append (' Current VIEW      :  ' + Curr_view + '\n')
                    
                    line_list.append (' Current ENCODING  :  ' + Curr_encoding + '\n')
                    
                    line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')\n')
                    
                    line_list.append (' Current Line END  :  ' + Curr_eol + '\n')
                    
                    line_list.append (' Current WRAPPING  :  ' + Curr_wrap + '\n\n')
                    
                    line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))
                    
                    line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))
                    
                    line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + '\n')
                    
                    line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))
                    
                    line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + '\n')
                    
                    line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard))
                    
                    line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL) + '\n')
                    
                    line_list.append (' TOTAL characters  :  ' + str(Total_chars) + '\n\n')
                    
                    if Curr_encoding == 'ANSI':
                        line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')
                    
                    if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                        line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
                        + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')
                    
                    if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                        line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')
                    
                    line_list.append (' Byte Order Mark   :  ' + str(BOM) + '\n')
                    
                    line_list.append (' BUFFER Length     :  ' + str(Buffer_length))
                    
                    if os.path.isfile(File_name) == True:
                        line_list.append (' Length on DISK    :  ' + str(Size_length) + '\n\n')
                    else:
                        line_list.append ('\n')
                    
                    line_list.append (' NON-Blank Chars   :  ' + str(Non_blank_chars) + '\n')
                    
                    line_list.append (' WORDS     Count   :  ' + str(Words_count) + ' (Caution !)\n')
                    
                    if Err_Regex == False:
                        line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + '\n\n')
                    else:
                        line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + ' (ERROR : Ran out of stack space trying to match the regular expressions !)\n\n')
                    
                    line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))
                    
                    line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + '\n')
                    
                    line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + '\n')
                    
                    line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))
                    
                    line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + '\n\n')
                    
                    line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges)
                    
                    notepad.new()
                    
                    editor.setText('\r\n'.join(line_list))
                    
                    if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':
                    
                        if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature
                    
                            notepad.prompt ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                                            '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!', '')
                    
                    # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------
                    

                    So, just test this script against any file, to get any possible bug or limitation !!

                    I’ve also heard of compiled regexes in Python. Would that be interesting for this script ?

                    Best Regards,

                    guy038

                    1 Reply Last reply Reply Quote 0
                    • guy038G
                      guy038
                      last edited by

                      Hi, All,

                      I realized that it was the mess regarding the line_endings, in the Summary report. Thus, by defining a Line_end variable equal to \r\n, the results are more harmonious !

                      One advantage : if you do not want any supplementary line-break, in the Summary report, simply change the line :

                      Line_end = '\r\n'
                      

                      by this one :

                      Line_end = ''
                      

                      So, here is the v0.8 version of my script :

                      # encoding=utf-8
                      
                      #-------------------------------------------------------------------------
                      #                    STATISTICS about the CURRENT file ( v0.8 )
                      #-------------------------------------------------------------------------
                      
                      from __future__ import print_function    # for Python2 compatibility
                      
                      from Npp import *
                      
                      import re
                      
                      import os, time, datetime
                      
                      import ctypes
                      
                      from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT
                      
                      # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                      #  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
                      # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                      
                      def npp_get_statusbar(statusbar_item_number):
                      
                          WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
                          FindWindowW = ctypes.windll.user32.FindWindowW
                          FindWindowExW = ctypes.windll.user32.FindWindowExW
                          SendMessageW = ctypes.windll.user32.SendMessageW
                          LRESULT = LPARAM
                          SendMessageW.restype = LRESULT
                          SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
                          EnumChildWindows = ctypes.windll.user32.EnumChildWindows
                          GetClassNameW = ctypes.windll.user32.GetClassNameW
                          create_unicode_buffer = ctypes.create_unicode_buffer
                      
                          SBT_OWNERDRAW = 0x1000
                          WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13
                      
                          npp_get_statusbar.STATUSBAR_HANDLE = None
                      
                          def get_result_from_statusbar(statusbar_item_number):
                              assert statusbar_item_number <= 5
                              retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
                              length = retcode & 0xFFFF
                              type = (retcode >> 16) & 0xFFFF
                              assert (type != SBT_OWNERDRAW)
                              text_buffer = create_unicode_buffer(length)
                              retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
                              retval = '{}'.format(text_buffer[:length])
                              return retval
                      
                          def EnumCallback(hwnd, lparam):
                              curr_class = create_unicode_buffer(256)
                              GetClassNameW(hwnd, curr_class, 256)
                              if curr_class.value.lower() == "msctls_statusbar32":
                                  npp_get_statusbar.STATUSBAR_HANDLE = hwnd
                                  return False  # stop the enumeration
                              return True  # continue the enumeration
                      
                          npp_hwnd = FindWindowW(u"Notepad++", None)
                          EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
                          if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
                          assert False
                      
                      St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )
                      
                      

                      Continuation on next post

                      guy038

                      1 Reply Last reply Reply Quote 1
                      • guy038G
                        guy038
                        last edited by

                        Hi all,

                        Continuation of version v0.8 of the script :

                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        def number(occ):
                            global num
                            num += 1
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Curr_encoding = str(notepad.getEncoding())
                        
                        if Curr_encoding == 'ENC8BIT':
                            Curr_encoding = 'ANSI'
                        
                        if Curr_encoding == 'COOKIE':
                            Curr_encoding = 'UTF-8'
                        
                        if Curr_encoding == 'UTF8':
                            Curr_encoding = 'UTF-8-BOM'
                        
                        if Curr_encoding == 'UCS2BE':
                            Curr_encoding = 'UTF-16 BE BOM'
                        
                        if Curr_encoding == 'UCS2LE':
                            Curr_encoding = 'UTF-16 LE BOM'
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                            Line_title = 95
                        else:
                            Line_title = 75
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        File_name = notepad.getCurrentFilename()
                        
                        if os.path.isfile(File_name) == True:
                        
                            Creation_date = time.ctime(os.path.getctime(File_name))
                        
                            Modif_date = time.ctime(os.path.getmtime(File_name))
                        
                            Size_length = os.path.getsize(File_name)
                        
                            RO_flag = 'YES'
                        
                            if os.access(File_name, os.W_OK):
                                RO_flag = 'NO'
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        RO_editor = 'NO'
                        
                        if editor.getReadOnly() == True:
                            RO_editor = 'YES'
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        if notepad.getCurrentView() == 0:
                            Curr_view = 'MAIN View'
                        else:
                            Curr_view = 'SECONDARY view'
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Curr_lang = notepad.getCurrentLang()
                        
                        Lang_desc = notepad.getLanguageDesc(Curr_lang)
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        if editor.getEOLMode() == 0:
                            Curr_eol = 'Windows (CR LF)'
                        
                        if editor.getEOLMode() == 1:
                            Curr_eol = 'Macintosh (CR)'
                        
                        if editor.getEOLMode() == 2:
                            Curr_eol = 'Unix (LF)'
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Curr_wrap = 'NO'
                        
                        if editor.getWrapMode() == 1:
                            Curr_wrap = 'YES'
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        num = 0
                        if Curr_encoding == 'ANSI':
                            editor.research(r'[^\r\n]', number)
                        
                        if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                            editor.research(r'(?![\r\n])[\x{0000}-\x{007F}]', number)
                        
                        Total_1_byte = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        num = 0
                        if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                            editor.research(r'[\x{0080}-\x{07FF}]', number)
                        
                        if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                            editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP vchars ( With PYTHON, the [^\r\n\x{D800}-\x{DFFF}] syntax does NOT work properly !)
                        
                        Total_2_bytes = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        num = 0
                        if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                            editor.research(r'(?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]', number)
                        
                        Total_3_bytes = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        num = 0
                        editor.research(r'[^\r\n]', number)
                        
                        Total_standard = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Total_4_bytes = 0  #  By default
                        
                        if Curr_encoding != 'ANSI':
                            Total_4_bytes = Total_standard - Total_BMP
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        num = 0
                        editor.research(r'\r|\n', number)
                        
                        Total_EOL = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Total_chars = Total_EOL + Total_standard
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        if Curr_encoding == 'ANSI':
                            Bytes_length = Total_EOL + Total_1_byte
                        
                        if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                            Bytes_length = Total_EOL + Total_1_byte + 2 * Total_2_bytes + 3 * Total_3_bytes + 4 * Total_4_bytes
                        
                        if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                            Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        BOM = 0  #  Default ANSI and UTF-8
                        
                        if Curr_encoding == 'UTF-8-BOM':
                            BOM = 3
                        
                        if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                            BOM = 2
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Buffer_length = Bytes_length + BOM
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        num = 0
                        editor.research(r'[^\r\n\t\x20]', number)
                        
                        Non_blank_chars = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        num = 0
                        editor.research(r'\w+', number)
                        
                        Words_count = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Err_Regex = False
                        
                        num = 0
                        
                        if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
                            editor.research(r'\S+', number)
                        else:
                            try:
                                editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
                            except RuntimeError:
                                Err_Regex = True
                        
                        Non_space_count = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        num = 0
                        if Curr_encoding == 'ANSI':
                            editor.research(r'(?<!\f)^(?:\r\n|\r|\n)', number)
                        else:
                            editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)', number)
                        
                        Empty_lines = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        num = 0
                        if Curr_encoding == 'ANSI':
                            editor.research(r'(?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
                        else:
                            editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
                        
                        Blank_lines = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Emp_blk_lines = Empty_lines + Blank_lines
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        num = 0
                        if Curr_encoding == 'ANSI':
                            editor.research(r'(?-s)\r\n|\r|\n|(?:.|\f)\z', number)
                        else:
                            editor.research(r'(?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z', number)
                        
                        Total_lines = num
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Non_blk_lines = Total_lines - Emp_blk_lines
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )
                        
                        # print ('Res = ', Num_sel)
                        
                        if Num_sel != 0:
                        
                            Bytes_count = 0
                            Chars_count = 0
                        
                            for n in range(Num_sel):
                        
                                Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
                        
                                Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                            if Chars_count < 2:
                                Txt_chars = ' selected char ('
                        
                            else:
                                Txt_chars = ' selected chars ('
                        
                        
                            if Bytes_count < 2:
                                Txt_bytes = ' selected byte) in '
                        
                            else:
                                Txt_bytes = ' selected bytes) in '
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                            if Num_sel < 2 and Bytes_count == 0:
                                Txt_ranges = ' EMPTY range\n'
                        
                            if Num_sel < 2 and Bytes_count > 0:
                                Txt_ranges = ' range\n'
                        
                            if Num_sel > 1 and Bytes_count == 0:
                                Txt_ranges = ' EMPTY ranges\n'
                        
                            if Num_sel > 1 and Bytes_count > 0:
                                Txt_ranges = ' ranges (EMPTY or NOT)\n'
                        
                        # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                        
                        line_list = []  # empty list
                        
                        Line_end = '\r\n'
                        
                        line_list.append ('-' * Line_title)
                        
                        line_list.append (' ' * int((Line_title - 37) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()))
                        
                        line_list.append ('-' * Line_title + Line_end)
                        
                        line_list.append (' FULL File Path    :  ' + File_name + Line_end)
                        
                        if os.path.isfile(File_name) == True:
                        
                            line_list.append(' CREATION     Date :  ' + Creation_date)
                        
                            line_list.append(' MODIFICATION Date :  ' + Modif_date + Line_end)
                        
                            line_list.append(' READ-ONLY flag    :  ' + RO_flag )
                        
                        line_list.append (' READ-ONLY editor  :  ' + RO_editor + Line_end * 2)
                        
                        line_list.append (' Current VIEW      :  ' + Curr_view + Line_end)
                        
                        line_list.append (' Current ENCODING  :  ' + Curr_encoding + Line_end)
                        
                        line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')' + Line_end)
                        
                        line_list.append (' Current Line END  :  ' + Curr_eol + Line_end)
                        
                        line_list.append (' Current WRAPPING  :  ' + Curr_wrap + Line_end * 2)
                        
                        line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))
                        
                        line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))
                        
                        line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + Line_end)
                        
                        line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))
                        
                        line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + Line_end)
                        
                        line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard))
                        
                        line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL) + Line_end)
                        
                        line_list.append (' TOTAL characters  :  ' + str(Total_chars) + Line_end * 2)
                        
                        if Curr_encoding == 'ANSI':
                            line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')
                        
                        if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                            line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
                            + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')
                        
                        if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                            line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')
                        
                        line_list.append (' Byte Order Mark   :  ' + str(BOM) + Line_end)
                        
                        line_list.append (' BUFFER Length     :  ' + str(Buffer_length))
                        
                        if os.path.isfile(File_name) == True:
                            line_list.append (' Length on DISK    :  ' + str(Size_length) + Line_end * 2)
                        else:
                            line_list.append ('\n')
                        
                        line_list.append (' NON-Blank Chars   :  ' + str(Non_blank_chars) + Line_end)
                        
                        line_list.append (' WORDS     Count   :  ' + str(Words_count) + ' (Caution !)' + Line_end)
                        
                        if Err_Regex == False:
                            line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + Line_end * 2)
                        else:
                            line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + ' (ERROR : Ran out of stack space trying to match the regular expressions !)' + Line_end * 2)
                        
                        line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))
                        
                        line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + Line_end)
                        
                        line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + Line_end)
                        
                        line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))
                        
                        line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + Line_end * 2)
                        
                        line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges)
                        
                        notepad.new()
                        
                        editor.setText('\r\n'.join(line_list))
                        
                        if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':
                        
                            if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature
                        
                                notepad.prompt ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                                                '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!', '')
                        
                        # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------
                        

                        Best Regards,

                        guy038

                        1 Reply Last reply Reply Quote 1
                        • guy038G
                          guy038
                          last edited by guy038

                          Hi, All,

                          You’ll find, below, the v1.0 version of my script. I changed a lot of things :

                          • I add a counter to get the execution time of the script, which is written right after the current date, at the beginning of the summary

                          • I modified some regexes in order to improve their performance as well as the order to search them for

                          • I used the Pythonscript methods .editor.getLength(), editor.countCharacters(0, editor.getLength()) and editor.getLineCount() to get, respectively, the bytes length ( without a possible BOM ) value, the Total_chars value and the Total_lines value. Note that, in case of an UTF-8 or UTF-8-BOM encoded file, we get two relations :

                            • (A) Buffer length - Total_EOL - Total_1_byte - 2 × Total_2_bytes - 3 × Total_3_bytes = 4 × Total_4_bytes
                            • (B) Total_Chars - Total_EOL - Total_1_byte - Total_2_bytes - Total_3_bytes           = Total_4_bytes

                          So, we can deduce, from the relation A-B, the equations :

                          Total_4_bytes = ( Total_length - Total_chars - Total_2_bytes - 2 × Total_3_bytes ) / 3

                          and then :

                          Total_1_byte = Total_chars - Total_EOL - Total_2_bytes - Total_3_bytes - Total_4_bytes

                          Thus, after counting the number of Total_2_bytes and Total_3_bytes, the two results Total_4_bytes and Total_1_byte are easily deduced. This new way decreases, from a factor 2 to 3, the execution time of the script, because, most of the time, the file contains only 1-byte chars :-))

                          However, the Buffer_length value wrongly remains the same, in case of an UTF-16 BE BOM or UTF-16 LE BOM encoded file. Thus, I needed to calcul the Total_4_bytes and Buffer_length values, from the number of Total_2_bytes, with the relations :

                          Total_4_bytes = Total_chars - Total_EOL - Total_2_bytes

                          Bytes_length = 2 * Total_EOL + 2 * Total_2_bytes + 4 × Total_4_bytes

                          • Now, because some huge files may lead to a long time before getting the Summary results ( even with the native N++ version, BTW ! ), you can follow the progression of the different searches on the Python console, which is automatically enabled at beginning of the script and disabled right before outputting the results

                          • At the end of the script, I just replace the notepad.prompt method by the notepad.messageBox method in order to display the warning ( more logical ! )


                          IMPORTANT :

                          • Never switch to an other tab when running this script. Else, you’ll probably get unpredictable or negative results !

                          • Thus, by viewing the console messages, if you think that the results seem too long to happen for a specific file and that you prefer to abort its Summary report, simply stop the current Python script with the classical Plugins > Python Script > Stop script menu option


                          Now, I was a bit upset by some inconsistant results regarding the number of NON-SPACE strings, when current file, with an Unicode encoding, contains some bytes over the BMP

                          So, I searched among all my posts, since 2013, as well as some others used as documentation, for only those containing some four-bytes characters and here is the list of these files with the reported results :

                          •=============================•===========•=================•==================•============•================•
                          |                             |           |    Expected     |  Summary Report  |            |                |
                          |           Filename          |   4_BYTES |         NON-SPACE count            | Difference |    Encoding    |
                          |                             |           | (?:(?!\s).[\x{D800}-\x{DFFF}]?)+   |            |                |
                          •=============================•===========•=================•==================•============•================•
                          |  Symbola_Monospacified.txt  |   11,951  |     199,891     |      199,882     |      - 9   |  UTF-8-BOM     |
                          |  Total_Chars.txt            |  262,136  |           9     |           18     |      + 9   |  UTF-8-BOM     |
                          •=============================•===========•=================•==================•============•================•
                          |  Caractères.txt             |    2,901  |       7,361     |        7,358     |      - 3   |  UTF-8-BOM     |
                          |  Test_2.txt                 |    1,276  |           8     |            9     |      + 1   |  UTF-8         |
                          |  Test_1.txt                 |      881  |           8     |            9     |      + 1   |  UTF-8         |
                          |  Plane_0.txt                |        0  |           9     |           10     |      + 1   |  UCS-2 BE BOM  |
                          |  Clemens.txt                |    3,968  |       2,816     |        2,818     |      + 2   |  UTF-8-BOM     |
                          |  Planes_0+1.txt             |   65,534  |           9     |           12     |      + 3   |  UTF-8-BOM     |
                          •=============================•===========•=================•==================•============•================•
                          |  Chars_Over_BMP.txt         |       28  |         455     |          455     |        0   |  UTF-8-BOM     |
                          |  Entites_by_Name.txt        |      133  |      15,968     |       15,968     |        0   |  UTF-8         |
                          |  Entites_by_Number.txt      |      133  |      15,968     |       15,968     |        0   |  UTF-8         |
                          |  Invisible_chars.txt        |       31  |       3,459     |        3,459     |        0   |  UTF-8-BOM     |
                          |  Osmanya_Tout.txt           |      119  |         605     |          605     |        0   |  UTF-8-BOM     |
                          |  Smileys.txt                |    1,031  |      10,157     |       10,157     |        0   |  UTF-8-BOM     |
                          |  Alan_K.txt                 |      114  |      46,082     |       46,082     |        0   |  UTF-8         |
                          |  Alexolog.txt               |       13  |       2,199     |        2,199     |        0   |  UTF-8         |
                          |  André_Z.txt                |        8  |       5,860     |        5,860     |        0   |  UTF-8         |
                          |  Bidule.txt                 |        1  |         327     |          327     |        0   |  UTF-8         |
                          |  Carypt.txt                 |        1  |       3,551     |        3,551     |        0   |  UTF-8         |
                          |  Dean_Corso.txt             |      761  |       9,632     |        9,632     |        0   |  UTF-8         |
                          |  Don_Ho.txt                 |        2  |      41,426     |       41,426     |        0   |  UTF-8         |
                          |  Durkin.txt                 |      144  |       4,638     |        4,638     |        0   |  UTF-8         |
                          |  Dylan.txt                  |       34  |       2,180     |        2,180     |        0   |  UTF-8         |
                          |  Furek.txt                  |       20  |         499     |          499     |        0   |  UTF-8         |
                          |  Gary_2.txt                 |        2  |         458     |          458     |        0   |  UTF-8         |
                          |  Haleba.txt                 |        5  |         817     |          817     |        0   |  UTF-8         |
                          |  ImSpecial.txt              |        1  |         161     |          161     |        0   |  UTF-8         |
                          |  Joss.txt                   |        6  |         105     |          105     |        0   |  UTF-8         |
                          |  JR.txt                     |       39  |       1,735     |        1,735     |        0   |  UTF-8         |
                          |  Mark_Olson.txt             |        1  |       3,652     |        3,652     |        0   |  UTF-8         |
                          |  Minus_Majus.txt            |       62  |       9,931     |        9,931     |        0   |  UTF-8         |
                          |  Niting-jain.txt            |        4  |         537     |          537     |        0   |  UTF-8         |
                          |  PeterCJ.txt                |       31  |      37,323     |       37,323     |        0   |  UTF-8         |
                          |  Petr_jaja.txt              |       14  |       3,168     |        3,168     |        0   |  UTF-8         |
                          |  Pintas.txt                 |        4  |         614     |          614     |        0   |  UTF-8         |
                          |  Register.txt               |       20  |         242     |          242     |        0   |  UTF-8         |
                          |  Scott_3.txt                |        4  |      42,552     |       42,552     |        0   |  UTF-8         |
                          |  Skevich.txt                |        6  |         715     |          715     |        0   |  UTF-8         |
                          |  Statistiques.txt           |        7  |       9,012     |        9,012     |        0   |  UTF-8         |
                          |  Summary.txt                |        7  |       4,322     |        4,322     |        0   |  UTF-8         |
                          |  Summary_NEW.txt            |       10  |       8,903     |        8,903     |        0   |  UTF-8         |
                          |  Uzivatel.txt               |        2  |         873     |          873     |        0   |  UTF-8         |
                          |  Xavier_mdq.txt             |       13  |       3,652     |        3,652     |        0   |  UTF-8         |
                          |  Text.txt                   |    2,400  |       1,000     |        1,000     |        0   |  UTF-8         |
                          •============================•============•=================•==================•============•================•
                          

                          From that list, I deduced that the number of NON-space chars is erroneous in very rare cases, especially when current file contains consecutively :

                          • All the characters of a font

                          • All the characters of an Unicode range

                          • All the characters of all Unicode ranges

                          Luckily, in all the other cases, with a random position of these four-bytes chars, the Summary report always gives the right results, regarding the NON-SPACE count !


                          Here is the v1.0 version of my script, split on two posts :

                          # encoding=utf-8
                          
                          #-------------------------------------------------------------------------
                          #                    STATISTICS about the CURRENT file ( v1.0 )
                          #-------------------------------------------------------------------------
                          
                          from __future__ import print_function    # for Python2 compatibility
                          
                          from Npp import *
                          
                          import re
                          
                          import os, time, datetime
                          
                          import ctypes
                          
                          from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT
                          
                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                          #  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                          
                          def npp_get_statusbar(statusbar_item_number):
                          
                              WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
                              FindWindowW = ctypes.windll.user32.FindWindowW
                              FindWindowExW = ctypes.windll.user32.FindWindowExW
                              SendMessageW = ctypes.windll.user32.SendMessageW
                              LRESULT = LPARAM
                              SendMessageW.restype = LRESULT
                              SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
                              EnumChildWindows = ctypes.windll.user32.EnumChildWindows
                              GetClassNameW = ctypes.windll.user32.GetClassNameW
                              create_unicode_buffer = ctypes.create_unicode_buffer
                          
                              SBT_OWNERDRAW = 0x1000
                              WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13
                          
                              npp_get_statusbar.STATUSBAR_HANDLE = None
                          
                              def get_result_from_statusbar(statusbar_item_number):
                                  assert statusbar_item_number <= 5
                                  retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
                                  length = retcode & 0xFFFF
                                  type = (retcode >> 16) & 0xFFFF
                                  assert (type != SBT_OWNERDRAW)
                                  text_buffer = create_unicode_buffer(length)
                                  retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
                                  retval = '{}'.format(text_buffer[:length])
                                  return retval
                          
                              def EnumCallback(hwnd, lparam):
                                  curr_class = create_unicode_buffer(256)
                                  GetClassNameW(hwnd, curr_class, 256)
                                  if curr_class.value.lower() == "msctls_statusbar32":
                                      npp_get_statusbar.STATUSBAR_HANDLE = hwnd
                                      return False  # stop the enumeration
                                  return True  # continue the enumeration
                          
                              npp_hwnd = FindWindowW(u"Notepad++", None)
                              EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
                              if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
                              assert False
                          
                          St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )
                          
                          

                          Continuation on next post

                          guy038

                          1 Reply Last reply Reply Quote 0
                          • guy038G
                            guy038
                            last edited by guy038

                            Hi all,

                            Continuation of version v1.0 of the script :

                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            def number(occ):
                                global num
                                num += 1
                            
                            console.show()
                            
                            console.clear()
                            
                            Start_time = time.time()
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Curr_encoding = str(notepad.getEncoding())
                            
                            if Curr_encoding == 'ENC8BIT':
                                Curr_encoding = 'ANSI'
                            
                            if Curr_encoding == 'COOKIE':
                                Curr_encoding = 'UTF-8'
                            
                            if Curr_encoding == 'UTF8':
                                Curr_encoding = 'UTF-8-BOM'
                            
                            if Curr_encoding == 'UCS2BE':
                                Curr_encoding = 'UTF-16 BE BOM'
                            
                            if Curr_encoding == 'UCS2LE':
                                Curr_encoding = 'UTF-16 LE BOM'
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                                Line_title = 95
                            else:
                                Line_title = 75
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            File_name = notepad.getCurrentFilename()
                            
                            if os.path.isfile(File_name) == True:
                            
                                Creation_date = time.ctime(os.path.getctime(File_name))
                            
                                Modif_date = time.ctime(os.path.getmtime(File_name))
                            
                                Size_length = os.path.getsize(File_name)
                            
                                RO_flag = 'YES'
                            
                                if os.access(File_name, os.W_OK):
                                    RO_flag = 'NO'
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            RO_editor = 'NO'
                            
                            if editor.getReadOnly() == True:
                                RO_editor = 'YES'
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            if notepad.getCurrentView() == 0:
                                Curr_view = 'MAIN View'
                            else:
                                Curr_view = 'SECONDARY view'
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Curr_lang = notepad.getCurrentLang()
                            
                            Lang_desc = notepad.getLanguageDesc(Curr_lang)
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            if editor.getEOLMode() == 0:
                                Curr_eol = 'Windows (CR LF)'
                            
                            if editor.getEOLMode() == 1:
                                Curr_eol = 'Macintosh (CR)'
                            
                            if editor.getEOLMode() == 2:
                                Curr_eol = 'Unix (LF)'
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Curr_wrap = 'NO'
                            
                            if editor.getWrapMode() == 1:
                                Curr_wrap = 'YES'
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            print ('START')
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Bytes_length = editor.getLength()
                            
                            Total_chars = editor.countCharacters(0, editor.getLength())
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            num = 0
                            editor.research(r'\r|\n', number)
                            
                            Total_EOL = num
                            
                            print ('EOL')
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Total_standard = Total_chars - Total_EOL
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            if Curr_encoding == 'ANSI':
                            
                                Total_BMP = Total_standard
                                
                                Total_1_byte = Total_BMP
                            
                                Total_2_bytes = 0
                            
                                Total_3_bytes = 0
                            
                                Total_4_bytes = 0
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                            
                                num = 0
                                editor.research(r'[\x{0080}-\x{07FF}]', number)
                            
                                Total_2_bytes = num
                            
                                print ('2-BYTES')
                            
                                # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                                num = 0
                                editor.research(r'[\x{0800}-\x{D7FF}\x{E000}-\x{FFFF}]', number)
                            
                                Total_3_bytes = num
                            
                                print ('3-BYTES')
                            
                                # -----------------------------------------------------------------------------------------------------------------------------
                            
                                Total_4_bytes = ( Bytes_length - Total_chars - Total_2_bytes - 2 * Total_3_bytes ) / 3
                            
                                Total_1_byte = Total_standard - Total_2_bytes - Total_3_bytes - Total_4_bytes
                            
                                Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            
                            if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                            
                                num = 0
                                editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP chars different from '\r' and '\n'
                            
                                Total_2_bytes = num
                            
                                Total_4_bytes = Total_standard - Total_2_bytes
                            
                                Total_BMP = Total_2_bytes
                            
                                Total_1_byte = 0
                            
                                Total_3_bytes = 0
                            
                                Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes
                            
                                print ('2-BYTES')
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            BOM = 0  #  Default ANSI and UTF-8
                            
                            if Curr_encoding == 'UTF-8-BOM':
                                BOM = 3
                            
                            if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                                BOM = 2
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Buffer_length = Bytes_length + BOM
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            num = 0
                            editor.research(r'\t|\x20', number)
                            
                            Non_blank_chars = Total_standard - num
                            
                            print ('NON-BLANK')
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            num = 0
                            editor.research(r'\w+', number)
                            
                            Words_count = num
                            
                            print ('WORDS')
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Err_regex = False
                            
                            num = 0
                            
                            if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
                                editor.research(r'\S+', number)
                            else:
                                try:
                                    editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
                                except RuntimeError:
                                    Err_regex = True
                            
                            Non_space_count = num
                            
                            print ('NON-SPACE')
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            num = 0
                            if Curr_encoding == 'ANSI':
                                editor.research(r'\f^(?:\r\n|\r|\n)', number)
                            else:
                                editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^(?:\r\n|\r|\n)', number)
                            
                            Special_empty = num
                            
                            num = 0
                            editor.research(r'^(?:\r\n|\r|\n)', number)
                            
                            Default_empty = num
                            
                            Empty_lines = Default_empty - Special_empty
                            
                            print ('EMPTY lines')
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            num = 0
                            if Curr_encoding == 'ANSI':
                                editor.research(r'\f^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
                            else:
                                editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
                            
                            Special_blank = num
                            
                            num = 0
                            editor.research(r'^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
                            
                            Default_blank = num
                            
                            Blank_lines = Default_blank - Special_blank
                            
                            print ('BLANK lines')
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Emp_blk_lines = Empty_lines + Blank_lines
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Total_lines = editor.getLineCount()
                            
                            num = 0
                            editor.research(r'(?-s)^.+\z', number)
                            
                            if num == 0:
                                Total_lines = Total_lines - 1
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Non_blk_lines = Total_lines - Emp_blk_lines
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )
                            
                            if Num_sel != 0:
                            
                                Bytes_count = 0
                                Chars_count = 0
                            
                                for n in range(Num_sel):
                            
                                    Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
                                    Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                                if Chars_count < 2:
                                    Txt_chars = ' selected char ('
                                else:
                                    Txt_chars = ' selected chars ('
                            
                            
                                if Bytes_count < 2:
                                    Txt_bytes = ' selected byte) in '
                                else:
                                    Txt_bytes = ' selected bytes) in '
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                                if Num_sel < 2 and Bytes_count == 0:
                                    Txt_ranges = ' EMPTY range\n'
                            
                                if Num_sel < 2 and Bytes_count > 0:
                                    Txt_ranges = ' range\n'
                            
                                if Num_sel > 1 and Bytes_count == 0:
                                    Txt_ranges = ' EMPTY ranges\n'
                            
                                if Num_sel > 1 and Bytes_count > 0:
                                    Txt_ranges = ' ranges (EMPTY or NOT)\n'
                            
                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                            
                            console.hide()
                            
                            line_list = []  # empty list
                            
                            Line_end = '\r\n'
                            
                            line_list.append ('-' * Line_title)
                            
                            line_list.append (' ' * int((Line_title - 54) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()) + ' ( ' + str(time.time() - Start_time) + ' )')
                            
                            line_list.append ('-' * Line_title + Line_end)
                            
                            line_list.append (' FULL File Path    :  ' + File_name + Line_end)
                            
                            if os.path.isfile(File_name) == True:
                            
                                line_list.append (' CREATION     Date :  ' + Creation_date)
                            
                                line_list.append (' MODIFICATION Date :  ' + Modif_date + Line_end)
                            
                                line_list.append (' READ-ONLY flag    :  ' + RO_flag)
                            
                            line_list.append (' READ-ONLY editor  :  ' + RO_editor + Line_end * 2)
                            
                            line_list.append (' Current VIEW      :  ' + Curr_view + Line_end)
                            
                            line_list.append (' Current ENCODING  :  ' + Curr_encoding + Line_end)
                            
                            line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')' + Line_end)
                            
                            line_list.append (' Current Line END  :  ' + Curr_eol + Line_end)
                            
                            line_list.append (' Current WRAPPING  :  ' + Curr_wrap + Line_end * 2)
                            
                            line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))
                            
                            line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))
                            
                            line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + Line_end)
                            
                            line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))
                            
                            line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + Line_end)
                            
                            line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard))
                            
                            line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL) + Line_end)
                            
                            line_list.append (' TOTAL characters  :  ' + str(Total_chars) + Line_end * 2)
                            
                            if Curr_encoding == 'ANSI':
                                line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')
                            
                            if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                                line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
                                + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')
                            
                            if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                                line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')
                            
                            line_list.append (' Byte Order Mark   :  ' + str(BOM) + Line_end)
                            
                            line_list.append (' BUFFER Length     :  ' + str(Buffer_length))
                            
                            if os.path.isfile(File_name) == True:
                                line_list.append (' Length on DISK    :  ' + str(Size_length) + Line_end * 2)
                            else:
                                if Line_end == '\r\n':
                                    line_list.append (Line_end)
                            
                            line_list.append (' NON-Blank Count   :  ' + str(Non_blank_chars) + Line_end)
                            
                            line_list.append (' WORDS     Count   :  ' + str(Words_count) + ' (Caution !)' + Line_end)
                            
                            if Err_regex == False:
                                line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + Line_end * 2)
                            else:
                                line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + ' (Caution as " RuntimeError " occured !)' + Line_end * 2)
                            
                            
                            line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))
                            
                            line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + Line_end)
                            
                            line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + Line_end)
                            
                            line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))
                            
                            line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + Line_end * 2)
                            
                            line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges)
                            
                            notepad.new()
                            
                            editor.setText('\r\n'.join(line_list))
                            
                            if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':
                            
                                if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature
                            
                                    notepad.messageBox ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                                                    '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!')
                            
                            # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------
                            

                            Remenber that you can use a shorter summary report by changing the line :

                            Line_end = '\r\n'
                            

                            by this one :

                            Line_end = ''
                            

                            Best Regards,

                            guy038

                            Alan KilbornA 1 Reply Last reply Reply Quote 0
                            • Alan KilbornA
                              Alan Kilborn @guy038
                              last edited by

                              @guy038

                              I was considering recommending your script as a basis for the solution to THIS inquiry, but then I noticed that your script doesn’t report word-count in selected text – perhaps it should do that as well?

                              1 Reply Last reply Reply Quote 2
                              • guy038G
                                guy038
                                last edited by guy038

                                Hello, @alan-kilborn and All,

                                Following your advice, I included the number of selected words \w+ in the last line of the summary report, regarding the different selections

                                If needed, the OP may choose this second syntax, which includes the hyphen, the apostrophe and the Right Single Quotation Mark, when surrounded by word chars, as true words chars !

                                SEARCH (?:(?<=\w)[-'’](?=\w)|\w)+

                                And thus, replace the line

                                        editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                                

                                by this one :

                                        editor.research(r'(?:(?<=\w)[-'’](?=\w)|\w)+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                                

                                So, here is the v1.1 version of my script, split on two posts :

                                # encoding=utf-8
                                
                                #-------------------------------------------------------------------------
                                #                    STATISTICS about the CURRENT file ( v1.1 )
                                #-------------------------------------------------------------------------
                                
                                from __future__ import print_function    # for Python2 compatibility
                                
                                from Npp import *
                                
                                import re
                                
                                import os, time, datetime
                                
                                import ctypes
                                
                                from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT
                                
                                # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                #  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
                                # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                
                                def npp_get_statusbar(statusbar_item_number):
                                
                                    WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
                                    FindWindowW = ctypes.windll.user32.FindWindowW
                                    FindWindowExW = ctypes.windll.user32.FindWindowExW
                                    SendMessageW = ctypes.windll.user32.SendMessageW
                                    LRESULT = LPARAM
                                    SendMessageW.restype = LRESULT
                                    SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
                                    EnumChildWindows = ctypes.windll.user32.EnumChildWindows
                                    GetClassNameW = ctypes.windll.user32.GetClassNameW
                                    create_unicode_buffer = ctypes.create_unicode_buffer
                                
                                    SBT_OWNERDRAW = 0x1000
                                    WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13
                                
                                    npp_get_statusbar.STATUSBAR_HANDLE = None
                                
                                    def get_result_from_statusbar(statusbar_item_number):
                                        assert statusbar_item_number <= 5
                                        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
                                        length = retcode & 0xFFFF
                                        type = (retcode >> 16) & 0xFFFF
                                        assert (type != SBT_OWNERDRAW)
                                        text_buffer = create_unicode_buffer(length)
                                        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
                                        retval = '{}'.format(text_buffer[:length])
                                        return retval
                                
                                    def EnumCallback(hwnd, lparam):
                                        curr_class = create_unicode_buffer(256)
                                        GetClassNameW(hwnd, curr_class, 256)
                                        if curr_class.value.lower() == "msctls_statusbar32":
                                            npp_get_statusbar.STATUSBAR_HANDLE = hwnd
                                            return False  # stop the enumeration
                                        return True  # continue the enumeration
                                
                                    npp_hwnd = FindWindowW(u"Notepad++", None)
                                    EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
                                    if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
                                    assert False
                                
                                St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )
                                
                                

                                Continuation on next post

                                guy038

                                1 Reply Last reply Reply Quote 0
                                • guy038G
                                  guy038
                                  last edited by guy038

                                  Hi Alan and all,

                                  Continuation of version v1.1 of the script :

                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  def number(occ):
                                      global num
                                      num += 1
                                  
                                  console.show()
                                  
                                  console.clear()
                                  
                                  Start_time = time.time()
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Curr_encoding = str(notepad.getEncoding())
                                  
                                  if Curr_encoding == 'ENC8BIT':
                                      Curr_encoding = 'ANSI'
                                  
                                  if Curr_encoding == 'COOKIE':
                                      Curr_encoding = 'UTF-8'
                                  
                                  if Curr_encoding == 'UTF8':
                                      Curr_encoding = 'UTF-8-BOM'
                                  
                                  if Curr_encoding == 'UCS2BE':
                                      Curr_encoding = 'UTF-16 BE BOM'
                                  
                                  if Curr_encoding == 'UCS2LE':
                                      Curr_encoding = 'UTF-16 LE BOM'
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                                      Line_title = 95
                                  else:
                                      Line_title = 75
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  File_name = notepad.getCurrentFilename().decode('utf-8')
                                  
                                  if os.path.isfile(File_name) == True:
                                  
                                      Creation_date = time.ctime(os.path.getctime(File_name))
                                  
                                      Modif_date = time.ctime(os.path.getmtime(File_name))
                                  
                                      Size_length = os.path.getsize(File_name)
                                  
                                      RO_flag = 'YES'
                                  
                                      if os.access(File_name, os.W_OK):
                                          RO_flag = 'NO'
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  RO_editor = 'NO'
                                  
                                  if editor.getReadOnly() == True:
                                      RO_editor = 'YES'
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  if notepad.getCurrentView() == 0:
                                      Curr_view = 'MAIN View'
                                  else:
                                      Curr_view = 'SECONDARY view'
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Curr_lang = notepad.getCurrentLang()
                                  
                                  Lang_desc = notepad.getLanguageDesc(Curr_lang)
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  if editor.getEOLMode() == 0:
                                      Curr_eol = 'Windows (CR LF)'
                                  
                                  if editor.getEOLMode() == 1:
                                      Curr_eol = 'Macintosh (CR)'
                                  
                                  if editor.getEOLMode() == 2:
                                      Curr_eol = 'Unix (LF)'
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Curr_wrap = 'NO'
                                  
                                  if editor.getWrapMode() == 1:
                                      Curr_wrap = 'YES'
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  print ('START')
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Bytes_length = editor.getLength()
                                  
                                  Total_chars = editor.countCharacters(0, editor.getLength())
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  num = 0
                                  editor.research(r'\r|\n', number)
                                  
                                  Total_EOL = num
                                  
                                  print ('EOL')
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Total_standard = Total_chars - Total_EOL
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  if Curr_encoding == 'ANSI':
                                  
                                      Total_BMP = Total_standard
                                      
                                      Total_1_byte = Total_BMP
                                  
                                      Total_2_bytes = 0
                                  
                                      Total_3_bytes = 0
                                  
                                      Total_4_bytes = 0
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                                  
                                      num = 0
                                      editor.research(r'[\x{0080}-\x{07FF}]', number)
                                  
                                      Total_2_bytes = num
                                  
                                      print ('2-BYTES')
                                  
                                      # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                      num = 0
                                      editor.research(r'[\x{0800}-\x{D7FF}\x{E000}-\x{FFFF}]', number)
                                  
                                      Total_3_bytes = num
                                  
                                      print ('3-BYTES')
                                  
                                      # -----------------------------------------------------------------------------------------------------------------------------
                                  
                                      Total_4_bytes = ( Bytes_length - Total_chars - Total_2_bytes - 2 * Total_3_bytes ) / 3
                                  
                                      Total_1_byte = Total_standard - Total_2_bytes - Total_3_bytes - Total_4_bytes
                                  
                                      Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  
                                  if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                                  
                                      num = 0
                                      editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP chars different from '\r' and '\n'
                                  
                                      Total_2_bytes = num
                                  
                                      Total_4_bytes = Total_standard - Total_2_bytes
                                  
                                      Total_BMP = Total_2_bytes
                                  
                                      Total_1_byte = 0
                                  
                                      Total_3_bytes = 0
                                  
                                      Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes
                                  
                                      print ('2-BYTES')
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  BOM = 0  #  Default ANSI and UTF-8
                                  
                                  if Curr_encoding == 'UTF-8-BOM':
                                      BOM = 3
                                  
                                  if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                                      BOM = 2
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Buffer_length = Bytes_length + BOM
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  num = 0
                                  editor.research(r'\t|\x20', number)
                                  
                                  Non_blank_chars = Total_standard - num
                                  
                                  print ('NON-BLANK')
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  num = 0
                                  editor.research(r'\w+', number)
                                  
                                  Words_total = num
                                  
                                  print ('WORDS')
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Err_regex = False
                                  
                                  num = 0
                                  
                                  if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
                                      editor.research(r'\S+', number)
                                  else:
                                      try:
                                          editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
                                      except RuntimeError:
                                          Err_regex = True
                                  
                                  Non_space_count = num
                                  
                                  print ('NON-SPACE')
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  num = 0
                                  if Curr_encoding == 'ANSI':
                                      editor.research(r'\f^(?:\r\n|\r|\n)', number)
                                  else:
                                      editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^(?:\r\n|\r|\n)', number)
                                  
                                  Special_empty = num
                                  
                                  num = 0
                                  editor.research(r'^(?:\r\n|\r|\n)', number)
                                  
                                  Default_empty = num
                                  
                                  Empty_lines = Default_empty - Special_empty
                                  
                                  print ('EMPTY lines')
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  num = 0
                                  if Curr_encoding == 'ANSI':
                                      editor.research(r'\f^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
                                  else:
                                      editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
                                  
                                  Special_blank = num
                                  
                                  num = 0
                                  editor.research(r'^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
                                  
                                  Default_blank = num
                                  
                                  Blank_lines = Default_blank - Special_blank
                                  
                                  print ('BLANK lines')
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Emp_blk_lines = Empty_lines + Blank_lines
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Total_lines = editor.getLineCount()
                                  
                                  num = 0
                                  editor.research(r'(?-s)^.+\z', number)
                                  
                                  if num == 0:
                                      Total_lines = Total_lines - 1
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Non_blk_lines = Total_lines - Emp_blk_lines
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )
                                  
                                  if Num_sel != 0:
                                  
                                      Bytes_count = 0
                                      Chars_count = 0
                                      Words_count = 0
                                  
                                      for n in range(Num_sel):
                                  
                                          Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
                                          Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                                  
                                          num = 0
                                          editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                                          Words_count += num
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                      if Bytes_count < 2:
                                          Txt_bytes = ' selected byte) in '
                                      else:
                                          Txt_bytes = ' selected bytes) in '
                                  
                                      if Chars_count < 2:
                                          Txt_chars = ' selected char, '
                                      else:
                                          Txt_chars = ' selected chars, '
                                  
                                      if Words_count < 2:
                                          Txt_words = ' selected word ('
                                      else:
                                          Txt_words = ' selected words ('
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                      if Num_sel < 2 and Bytes_count == 0:
                                          Txt_ranges = ' EMPTY range'
                                  
                                      if Num_sel < 2 and Bytes_count > 0:
                                          Txt_ranges = ' range'
                                  
                                      if Num_sel > 1 and Bytes_count == 0:
                                          Txt_ranges = ' EMPTY ranges'
                                  
                                      if Num_sel > 1 and Bytes_count > 0:
                                          Txt_ranges = ' ranges (EMPTY or NOT)'
                                  
                                  # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                  
                                  console.hide()
                                  
                                  line_list = []  # empty list
                                  
                                  Line_end = '\r\n'
                                  
                                  line_list.append ('-' * Line_title)
                                  
                                  line_list.append (' ' * int((Line_title - 54) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()) + ' ( ' + str(time.time() - Start_time) + ' )')
                                  
                                  line_list.append ('-' * Line_title + Line_end)
                                  
                                  line_list.append (' FULL File Path    :  ' + File_name + Line_end)
                                  
                                  if os.path.isfile(File_name) == True:
                                  
                                      line_list.append (' CREATION     Date :  ' + Creation_date)
                                  
                                      line_list.append (' MODIFICATION Date :  ' + Modif_date + Line_end)
                                  
                                      line_list.append (' READ-ONLY flag    :  ' + RO_flag)
                                  
                                  line_list.append (' READ-ONLY editor  :  ' + RO_editor + Line_end * 2)
                                  
                                  line_list.append (' Current VIEW      :  ' + Curr_view + Line_end)
                                  
                                  line_list.append (' Current ENCODING  :  ' + Curr_encoding + Line_end)
                                  
                                  line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')' + Line_end)
                                  
                                  line_list.append (' Current Line END  :  ' + Curr_eol + Line_end)
                                  
                                  line_list.append (' Current WRAPPING  :  ' + Curr_wrap + Line_end * 2)
                                  
                                  line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))
                                  
                                  line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))
                                  
                                  line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + Line_end)
                                  
                                  line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))
                                  
                                  line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + Line_end)
                                  
                                  line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard))
                                  
                                  line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL) + Line_end)
                                  
                                  line_list.append (' TOTAL characters  :  ' + str(Total_chars) + Line_end * 2)
                                  
                                  if Curr_encoding == 'ANSI':
                                      line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')
                                  
                                  if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                                      line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
                                      + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')
                                  
                                  if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                                      line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')
                                  
                                  line_list.append (' Byte Order Mark   :  ' + str(BOM) + Line_end)
                                  
                                  line_list.append (' BUFFER Length     :  ' + str(Buffer_length))
                                  
                                  if os.path.isfile(File_name) == True:
                                      line_list.append (' Length on DISK    :  ' + str(Size_length) + Line_end * 2)
                                  else:
                                      if Line_end == '\r\n':
                                          line_list.append (Line_end)
                                  
                                  line_list.append (' NON-Blank Chars   :  ' + str(Non_blank_chars) + Line_end)
                                  
                                  line_list.append (' WORDS     Count   :  ' + str(Words_total) + ' (Caution !)' + Line_end)
                                  
                                  if Err_regex == False:
                                      line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + Line_end * 2)
                                  else:
                                      line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + ' (Caution as " RuntimeError " occured !)' + Line_end * 2)
                                  
                                  
                                  line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))
                                  
                                  line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + Line_end)
                                  
                                  line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + Line_end)
                                  
                                  line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))
                                  
                                  line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + Line_end * 2)
                                  
                                  line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Words_count) + Txt_words + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges + Line_end)
                                  
                                  notepad.new()
                                  
                                  editor.setText('\r\n'.join(line_list))
                                  
                                  if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':
                                  
                                      if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature
                                  
                                          notepad.messageBox ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                                                          '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!')
                                  
                                  # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------
                                  

                                  Best Regards,

                                  guy038

                                  1 Reply Last reply Reply Quote 0
                                  • guy038G
                                    guy038
                                    last edited by guy038

                                    Hello, @alan-kilborn and Python gurus,

                                    I’ve just found out a bug when trying to run my script against à “French” file called Numéros ( which means Numbers ) :-((


                                    In that Python section of my script below, it detects if the current tab is associated with a true file, saved on disk, or if the current tab refers to a new # file, not saved yet

                                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    
                                    File_name = notepad.getCurrentFilename()
                                    
                                    if os.path.isfile(File_name) == True:
                                    
                                        Creation_date = time.ctime(os.path.getctime(File_name))
                                    
                                        Modif_date = time.ctime(os.path.getmtime(File_name))
                                    
                                        Size_length = os.path.getsize(File_name)
                                    
                                        RO_flag = 'YES'
                                    
                                        if os.access(File_name, os.W_OK):
                                            RO_flag = 'NO'
                                    
                                    # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                    

                                    And unfortunately, if current name contains accentuated characters, like Numéros, it wrongly suppose it’s a new # file !

                                    As soon as it is renamed as Numeros, everything is OK again

                                    So, how to recognize the filename even if current file or current path contain NON-ASCII characters ?

                                    TIA

                                    guy038

                                    Alan KilbornA 1 Reply Last reply Reply Quote 0
                                    • Alan KilbornA
                                      Alan Kilborn @guy038
                                      last edited by Alan Kilborn

                                      @guy038 said in Emulation of the "View > Summary" feature with a Python script:

                                      how to recognize the filename even if current file or current path contain NON-ASCII characters ?

                                      Short answer: This is better done with Python3, i.e., PythonScript 3.x. Then things “just work”. :-)

                                      But, for Python2, (and PS 2.x) you can make a call to .encode('utf-8') or .decode('utf-8') – depending upon your circumstance (I’m not commenting on your specific code) – in order to get what you need.

                                      Basically, if you have a Python2 string (in a variable s) and you want to get a Unicode string (for things like Windows pathnames with non-trivial characters), use s.decode('utf-8') and to go the other way, where you have a Unicode str (in a variable u) and you want a Python2 str, do u.encode('utf-8').

                                      1 Reply Last reply Reply Quote 2
                                      • guy038G
                                        guy038
                                        last edited by guy038

                                        Hi, @alan-kilborn,

                                        Many thanks for the tip ! I did some Google searches before, but just saw some obscur explanations. But, right now, trying again with this question :

                                        How to get "os.path.isfile(Filename)" == True: when Filename contains "NON ASCII" chars ?

                                        And reading the first article, named “python - UnicodeEncodeError on joining file name”, on Jan. 05 2010, from the site Stack Overflow, it is textually said, in the middle of the article :

                                        So I would first try filename = filename.decode('utf-8') -- that should allow the os.path.join to work


                                        Now, I won’t bother to re-edit my script with a new version number ! I just changed, in my v1.1 version, above, the line :

                                        File_name = notepad.getCurrentFilename()
                                        

                                        by this one :

                                        File_name = notepad.getCurrentFilename().decode('utf-8')
                                        

                                        BR

                                        guy038

                                        1 Reply Last reply Reply Quote 0
                                        • guy038G guy038 referenced this topic on
                                        • guy038G
                                          guy038
                                          last edited by guy038

                                          Hello, @alan-kilborn and All,

                                          Below, the v1.2 version of the Python script for an enhanced Summary feature :

                                          • I decomposed the total number of chars in 3 parts : EOL chars, Space and Tab chars and True chars ( [^\t\x20\r\n] )

                                          • I also decomposed the total number of word chars in 3 parts : letters chars, digits chars and low_line chars

                                          • I added a count of the paragraphs ( You may adapt the corresponding regex to your needs )

                                          • I added a count of the sentences ( You may adapt the corresponding regex to your needs )

                                          • I added some remarks at the end of the summary report, regarding the global accurancy of some results !


                                          Now, Alan, I needed to change this part, regarding the selections :

                                              for n in range(Num_sel):
                                          
                                                  Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
                                                  Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                                          
                                                  num = 0
                                                  editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                                                  Words_count += num
                                          

                                          by this one :

                                              for n in range(Num_sel):
                                          
                                                  Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
                                                  Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                                          
                                                  num = 0
                                                  if Bytes_count != 0:
                                                      editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                                                  Words_count += num
                                          

                                          Because, if the unique zero-length selection was on a pure empty line, it did write, as expected, the message :

                                          0 selected char, 0 selected word (0 selected byte) in 1 EMPTY range
                                          

                                          But if this unique zero-length selection was on a non-empty line, it would wrongly write, for example :

                                          0 selected char, **`568`** selected words (0 selected byte) in 1 EMPTY range
                                          

                                          Given that the total file contains 568 words


                                          So, here is the v1.2 version of my script, split on two posts :

                                          # encoding=utf-8
                                          
                                          #-------------------------------------------------------------------------
                                          #                    STATISTICS about the CURRENT file ( v1.2 )
                                          #-------------------------------------------------------------------------
                                          
                                          from __future__ import print_function    # for Python2 compatibility
                                          
                                          from Npp import *
                                          
                                          import re
                                          
                                          import os, time, datetime
                                          
                                          import ctypes
                                          
                                          from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT
                                          
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          #  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          
                                          def npp_get_statusbar(statusbar_item_number):
                                          
                                              WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
                                              FindWindowW = ctypes.windll.user32.FindWindowW
                                              FindWindowExW = ctypes.windll.user32.FindWindowExW
                                              SendMessageW = ctypes.windll.user32.SendMessageW
                                              LRESULT = LPARAM
                                              SendMessageW.restype = LRESULT
                                              SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
                                              EnumChildWindows = ctypes.windll.user32.EnumChildWindows
                                              GetClassNameW = ctypes.windll.user32.GetClassNameW
                                              create_unicode_buffer = ctypes.create_unicode_buffer
                                          
                                              SBT_OWNERDRAW = 0x1000
                                              WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13
                                          
                                              npp_get_statusbar.STATUSBAR_HANDLE = None
                                          
                                              def get_result_from_statusbar(statusbar_item_number):
                                                  assert statusbar_item_number <= 5
                                                  retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
                                                  length = retcode & 0xFFFF
                                                  type = (retcode >> 16) & 0xFFFF
                                                  assert (type != SBT_OWNERDRAW)
                                                  text_buffer = create_unicode_buffer(length)
                                                  retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
                                                  retval = '{}'.format(text_buffer[:length])
                                                  return retval
                                          
                                              def EnumCallback(hwnd, lparam):
                                                  curr_class = create_unicode_buffer(256)
                                                  GetClassNameW(hwnd, curr_class, 256)
                                                  if curr_class.value.lower() == "msctls_statusbar32":
                                                      npp_get_statusbar.STATUSBAR_HANDLE = hwnd
                                                      return False  # stop the enumeration
                                                  return True  # continue the enumeration
                                          
                                              npp_hwnd = FindWindowW(u"Notepad++", None)
                                              EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
                                              if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
                                              assert False
                                          
                                          St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )
                                          
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          
                                          def number(occ):
                                              global num
                                              num += 1
                                          
                                          console.show()
                                          
                                          console.clear()
                                          
                                          Start_time = time.time()
                                          
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          
                                          Curr_encoding = str(notepad.getEncoding())
                                          
                                          if Curr_encoding == 'ENC8BIT':
                                              Curr_encoding = 'ANSI'
                                          
                                          if Curr_encoding == 'COOKIE':
                                              Curr_encoding = 'UTF-8'
                                          
                                          if Curr_encoding == 'UTF8':
                                              Curr_encoding = 'UTF-8-BOM'
                                          
                                          if Curr_encoding == 'UCS2BE':
                                              Curr_encoding = 'UTF-16 BE BOM'
                                          
                                          if Curr_encoding == 'UCS2LE':
                                              Curr_encoding = 'UTF-16 LE BOM'
                                          
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          
                                          if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                                              Line_title = 95
                                          else:
                                              Line_title = 75
                                          
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          
                                          File_name = notepad.getCurrentFilename().decode('utf-8')
                                          
                                          if os.path.isfile(File_name) == True:
                                          
                                              Creation_date = time.ctime(os.path.getctime(File_name))
                                          
                                              Modif_date = time.ctime(os.path.getmtime(File_name))
                                          
                                              Size_length = os.path.getsize(File_name)
                                          
                                              RO_flag = 'YES'
                                          
                                              if os.access(File_name, os.W_OK):
                                                  RO_flag = 'NO'
                                          
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          
                                          RO_editor = 'NO'
                                          
                                          if editor.getReadOnly() == True:
                                              RO_editor = 'YES'
                                          
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          
                                          if notepad.getCurrentView() == 0:
                                              Curr_view = 'MAIN View'
                                          else:
                                              Curr_view = 'SECONDARY view'
                                          
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          
                                          Curr_lang = notepad.getCurrentLang()
                                          
                                          Lang_desc = notepad.getLanguageDesc(Curr_lang)
                                          
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          
                                          if editor.getEOLMode() == 0:
                                              Curr_eol = 'Windows (CR LF)'
                                          
                                          if editor.getEOLMode() == 1:
                                              Curr_eol = 'Macintosh (CR)'
                                          
                                          if editor.getEOLMode() == 2:
                                              Curr_eol = 'Unix (LF)'
                                          
                                          # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                          
                                          Curr_wrap = 'NO'
                                          
                                          if editor.getWrapMode() == 1:
                                              Curr_wrap = 'YES'
                                          
                                          

                                          Continuation on next post

                                          guy038

                                          Alan KilbornA 1 Reply Last reply Reply Quote 0
                                          • guy038G
                                            guy038
                                            last edited by

                                            Hi @alan-kilborn and all,

                                            Continuation of version v1.2 of the script :

                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            print ('START')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            Bytes_length = editor.getLength()
                                            
                                            Total_chars = editor.countCharacters(0, editor.getLength())
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            num = 0
                                            editor.research(r'\n|\r', number)
                                            
                                            Total_EOL = num
                                            
                                            print ('EOL')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            num = 0
                                            editor.research(r'\t|\x20', number)
                                            
                                            Blank_chars = num
                                            
                                            print ('BLANK')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            Total_standard = Total_chars - Total_EOL
                                            
                                            True_chars = Total_chars - Total_EOL - Blank_chars
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            if Curr_encoding == 'ANSI':
                                            
                                                Total_BMP = Total_standard
                                                
                                                Total_1_byte = Total_BMP
                                            
                                                Total_2_bytes = 0
                                            
                                                Total_3_bytes = 0
                                            
                                                Total_4_bytes = 0
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                                            
                                                num = 0
                                                editor.research(r'[\x{0080}-\x{07FF}]', number)
                                            
                                                Total_2_bytes = num
                                            
                                                print ('2-BYTES')
                                            
                                                # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                                num = 0
                                                editor.research(r'[\x{0800}-\x{D7FF}\x{E000}-\x{FFFF}]', number)
                                            
                                                Total_3_bytes = num
                                            
                                                print ('3-BYTES')
                                            
                                                # -----------------------------------------------------------------------------------------------------------------------------
                                            
                                                Total_4_bytes = ( Bytes_length - Total_chars - Total_2_bytes - 2 * Total_3_bytes ) / 3
                                            
                                                Total_1_byte = Total_standard - Total_2_bytes - Total_3_bytes - Total_4_bytes
                                            
                                                Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            
                                            if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                                            
                                                num = 0
                                                editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP chars different from '\r' and '\n'
                                            
                                                Total_2_bytes = num
                                            
                                                Total_4_bytes = Total_standard - Total_2_bytes
                                            
                                                Total_BMP = Total_2_bytes
                                            
                                                Total_1_byte = 0
                                            
                                                Total_3_bytes = 0
                                            
                                                Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes
                                            
                                                print ('2-BYTES')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            BOM = 0  #  Default ANSI and UTF-8
                                            
                                            if Curr_encoding == 'UTF-8-BOM':
                                                BOM = 3
                                            
                                            if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                                                BOM = 2
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            Buffer_length = Bytes_length + BOM
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            num = 0
                                            editor.research(r'\d', number)
                                            
                                            Number_chars = num
                                            
                                            print ('NUMBERS')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            num = 0
                                            editor.research(r'_', number)
                                            
                                            Lowline_chars = num
                                            
                                            print ('LOW_LINES')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            num = 0
                                            editor.research(r'\w', number)
                                            
                                            Word_chars = num
                                            
                                            print ('WORDS')
                                            
                                            Letter_chars = Word_chars - Number_chars - Lowline_chars
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            num = 0
                                            editor.research(r'\w+', number)
                                            
                                            Words_total = num
                                            
                                            print ('WORDS+')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            Err_regex_non_space = False
                                            
                                            num = 0
                                            
                                            if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
                                                editor.research(r'\S+', number)
                                            else:
                                                try:
                                                    editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
                                                except RuntimeError:
                                                    Err_regex_non_space = True
                                            
                                            Non_space_count = num
                                            
                                            print ('NON-SPACE+')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            Err_regex_sentence = False
                                            
                                            num = 0
                                            
                                            try:
                                                editor.research(r'(?-s)(?:\A|(?<=[\h\r\n.?!])).+?(?:(?=[.?!](\h|\R|\z))|(?=\R|\z))', number)
                                            except RuntimeError:
                                                Err_regex_sentence = True
                                            
                                            Sentence_count = num
                                            
                                            print ('SENTENCES')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            Err_regex_paragraph = False
                                            
                                            num = 0
                                            
                                            try:
                                                editor.research(r'(?-s)(?:(?:.[\x{D800}-\x{DFFF}]?)+(?:\r\n|\n|\r))+(?:\r\n|\n|\r){1,}(?:(?:.[\x{D800}-\x{DFFF}]?)+\z)?|(?:.[\x{D800}-\x{DFFF}]?)+\z', number)
                                            except RuntimeError:
                                                Err_regex_paragraph = True
                                            
                                            Paragraph_count = num
                                            
                                            print ('PARAGRAPHS')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            num = 0
                                            if Curr_encoding == 'ANSI':
                                                editor.research(r'\f^(?:\r\n|\n|\r)', number)
                                            else:
                                                editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^(?:\r\n|\n|\r)', number)
                                            
                                            Special_empty = num
                                            
                                            num = 0
                                            editor.research(r'^(?:\r\n|\n|\r)', number)
                                            
                                            Default_empty = num
                                            
                                            Empty_lines = Default_empty - Special_empty
                                            
                                            print ('EMPTY lines')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            num = 0
                                            if Curr_encoding == 'ANSI':
                                                editor.research(r'\f^[\t\x20]+(?:\r\n|\n|\r|\z)', number)
                                            else:
                                                editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^[\t\x20]+(?:\r\n|\n|\r|\z)', number)
                                            
                                            Special_blank = num
                                            
                                            num = 0
                                            editor.research(r'^[\t\x20]+(?:\r\n|\n|\r|\z)', number)
                                            
                                            Default_blank = num
                                            
                                            Blank_lines = Default_blank - Special_blank
                                            
                                            print ('BLANK lines')
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            Emp_blk_lines = Empty_lines + Blank_lines
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            Total_lines = editor.getLineCount()
                                            
                                            num = 0
                                            editor.research(r'(?-s)^.+\z', number)
                                            
                                            if num == 0:
                                                Total_lines = Total_lines - 1  #  Because LAST line totally EMPTY
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            Non_blk_lines = Total_lines - Emp_blk_lines
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )
                                            
                                            if Num_sel != 0:
                                            
                                                Bytes_count = 0
                                                Chars_count = 0
                                                Words_count = 0
                                            
                                                for n in range(Num_sel):
                                            
                                                    Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
                                                    Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                                            
                                                    num = 0
                                                    if Bytes_count != 0:
                                                        editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
                                                    Words_count += num
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                                if Bytes_count < 2:
                                                    Txt_bytes = ' selected byte) in '
                                                else:
                                                    Txt_bytes = ' selected bytes) in '
                                            
                                                if Chars_count < 2:
                                                    Txt_chars = ' selected char, '
                                                else:
                                                    Txt_chars = ' selected chars, '
                                            
                                                if Words_count < 2:
                                                    Txt_words = ' selected word ('
                                                else:
                                                    Txt_words = ' selected words ('
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                                if Num_sel < 2 and Bytes_count == 0:
                                                    Txt_ranges = ' EMPTY range'
                                            
                                                if Num_sel < 2 and Bytes_count > 0:
                                                    Txt_ranges = ' range'
                                            
                                                if Num_sel > 1 and Bytes_count == 0:
                                                    Txt_ranges = ' EMPTY ranges'
                                            
                                                if Num_sel > 1 and Bytes_count > 0:
                                                    Txt_ranges = ' ranges (EMPTY or NOT)'
                                            
                                            # --------------------------------------------------------------------------------------------------------------------------------------------------------------
                                            
                                            console.hide()
                                            
                                            line_list = []  # empty list
                                            
                                            Line_end = '\r\n'
                                            
                                            line_list.append ('-' * Line_title)
                                            
                                            line_list.append (' ' * int((Line_title - 54) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()) + ' ( ' + str(time.time() - Start_time) + ' )')
                                            
                                            line_list.append ('-' * Line_title + Line_end)
                                            
                                            line_list.append (' FULL File Path    :  ' + File_name + Line_end)
                                            
                                            if os.path.isfile(File_name) == True:
                                            
                                                line_list.append (' CREATION     Date :  ' + Creation_date)
                                            
                                                line_list.append (' MODIFICATION Date :  ' + Modif_date + Line_end)
                                            
                                                line_list.append (' READ-ONLY flag    :  ' + RO_flag)
                                            
                                            line_list.append (' READ-ONLY editor  :  ' + RO_editor + Line_end * 2)
                                            
                                            line_list.append (' Current VIEW      :  ' + Curr_view + Line_end)
                                            
                                            line_list.append (' Current ENCODING  :  ' + Curr_encoding + Line_end)
                                            
                                            line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')' + Line_end)
                                            
                                            line_list.append (' Current Line END  :  ' + Curr_eol + Line_end)
                                            
                                            line_list.append (' Current WRAPPING  :  ' + Curr_wrap + Line_end * 2)
                                            
                                            line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))
                                            
                                            line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))
                                            
                                            line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + Line_end)
                                            
                                            line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))
                                            
                                            line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + Line_end)
                                            
                                            line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard) + Line_end * 2)
                                            
                                            line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL))
                                            
                                            line_list.append (' SPC & TAB  Chars  :  ' + str(Blank_chars))
                                            
                                            line_list.append (' TRUE       Chars  :  ' + str(True_chars) + Line_end)
                                            
                                            line_list.append (' TOTAL characters  :  ' + str(Total_chars) + Line_end * 2)
                                            
                                            if Curr_encoding == 'ANSI':
                                                line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')
                                            
                                            if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
                                                line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
                                                + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')
                                            
                                            if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
                                                line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')
                                            
                                            line_list.append (' Byte Order Mark   :  ' + str(BOM) + Line_end)
                                            
                                            line_list.append (' BUFFER Length     :  ' + str(Buffer_length))
                                            
                                            if os.path.isfile(File_name) == True:
                                                line_list.append (' Length on DISK    :  ' + str(Size_length) + Line_end * 2)
                                            else:
                                                if Line_end == '\r\n':
                                                    line_list.append (Line_end)
                                            
                                            line_list.append (' NUMBER     Chars  :  ' + str(Number_chars) + '\t(*)')
                                            
                                            line_list.append (' LOW_LINE   Chars  :  ' + str(Lowline_chars))
                                            
                                            line_list.append (' LETTER     Chars  :  ' + str(Letter_chars) + '\t(*)' + Line_end)
                                            
                                            line_list.append (' WORD       Chars  :  ' + str(Word_chars) + '\t(*)' + Line_end * 2)
                                            
                                            line_list.append (' WORDS      Count  :  ' + str(Words_total) + '\t(*)' + Line_end)
                                            
                                            if Err_regex_non_space == False:
                                                line_list.append (' NON-SPACE  Count  :  ' + str(Non_space_count) + '\t(**)' + Line_end * 2)
                                            else:
                                                line_list.append (' NON-SPACE  Count  :  ' + str(Non_space_count) + '\t(Caution : a " RuntimeError " occured !)' + Line_end * 2)
                                            
                                            if Err_regex_sentence == False:
                                                line_list.append (' SENTENCES  Count  :  ' + str(Sentence_count) + '\t(**)' + Line_end)
                                            else:
                                                line_list.append (' SENTENCES  Count  :  ' + str(Sentence_count) + '\t(Caution : a " RuntimeError " occured !)' + Line_end)
                                            
                                            if Err_regex_paragraph == False:
                                                line_list.append (' PARAGRAPHS Count  :  ' + str(Paragraph_count) + '\t(**)' + Line_end * 2)
                                            else:
                                                line_list.append (' PARAGRAPHS Count  :  ' + str(Paragraph_count) + '\t(Caution : a " RuntimeError " occured !)' + Line_end * 2)
                                            
                                            line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))
                                            
                                            line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + Line_end)
                                            
                                            line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + Line_end)
                                            
                                            line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))
                                            
                                            line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + Line_end * 2)
                                            
                                            line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Words_count) + Txt_words + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges + '\r\n' + Line_end)
                                            
                                            line_list.append (' (*)   Our BOOST regex engine ignore all WORD, NUMBER and LETTER characters over the BMP and may ignore some others within the BMP !')
                                            
                                            line_list.append (' (**)  The results may NOT be very accurate for "technical" or "non-regular" files !' + Line_end)
                                            
                                            notepad.new()
                                            
                                            editor.setText('\r\n'.join(line_list))
                                            
                                            if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':
                                            
                                                if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature
                                            
                                                    notepad.messageBox ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                                                                    '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!')
                                            
                                            # ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------
                                            

                                            Best Regards,

                                            guy038

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors