Emulation of the "View > Summary" feature with a Python script

guy038

Hi All,

Then, with the help of the excellent Babel Map software, updated for Unicode v13.0``

https://www.babelstone.co.uk/Software/BabelMap.html

I succeeded to create a list of the 21,143 remaining characters, from the living scripts, above, which should be truly considered as word character, without any ambiguity

On the other hand, with the help of my Total_Chars.txt, which contains 325,590 characters, I detected 48,031 word chars with the simple search of the \w regex. This number seems important but include all the Chinese characters and equivalent chars which cannot be truly counted as word chars because of their vertical / horizontal way of writing !

In addition, when applying the regex \t\w\t against this list above, I got a total of 17,307 word characters, only, because, probably, Notepad++ does not use the Boost regex library with FULL Unicode support

Indeed, after some verifications :

The Boost definition of the regex \w does not consider all the characters over the BMP
Some characters of the BMP, although alphabetic, are not considered, yet, as word chars

For instance, in this short list, below, each Unicode char, surrounded with two tabulation chars, cannot be found with the regex \t\w\t, although that each char is, indeed, seen as a word by the Unicode Consortium` :-((

 24B6   Ⓐ     ; Other_Symbol     # So         CIRCLED LATIN CAPITAL LETTER A
1D400   𝐀     ; Uppercase_Letter # Lu         MATHEMATICAL BOLD CAPITAL A
1D70B   𝜋     ; Lowercase_Letter # Ll         MATHEMATICAL ITALIC SMALL PI
1F150   🅐     ; Other_symbol     # So         NEGATIVE CIRCLED LATIN CAPITAL LETTER A

To my mind, for all these reasons, as we cannot rely on the word notion, the View > Summary... feature should just ignore the number of words or, at least, add the indication With caution !

By contrast, I think that it would be useful to count the number of Non_Space strings, determined with the regex \S+. Indeed, we would get more confident results ! The boundaries of Non_Space strings, which are the Space characters, belong to the well-defined list of the 25 Unicode characters with the binary property White_Space, from the PropList.txt file. Refer to the very beginning of this file :

http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

As a reminder, the regex \s is identical to \h|\v. So, it represents the complete character class [\t\x20\xA0\x{1680}\x{2000}-\x{200B}\x{202F}\x{3000}]|[\n\x0B\f\r\x85\x{2028}\x{2029}] which can be re-ordered as :

\s = [\t\n\x0B\f\r\x20\x85\xA0\x{1680}\x{2000}-\x{200B}\x{2028}\x{2029}\x{202F}\x{3000}]

Note that, in practice, the \s regex is mainly equivalent to the simple regex [\t\n\r\x20]

Here is that Unicode list of all Unicode characters with the property White_Space, with their name and their General_Category value :

0009  TAB  ; White_Space    # Cc    TABULATION  <control-0009>
000A  LF   ; White_Space    # Cc    LINE FEED  <control-000A>
000B       ; White_Space    # Cc    VERTICAL TABULATION  <control-000B>
000C    ; White_Space    # Cc    FORM FEED  <control-000C>
000D  CR   ; White_Space    # Cc    CARRIAGE RETURN  <control-000D>
0020       ; White_Space    # Zs    SPACE
0085    ; White_Space    # Cc    NEXT LINE  <control-0085>
00A0       ; White_Space    # Zs    NO-BREAK SPACE
1680       ; White_Space    # Zs    OGHAM SPACE MARK
2000       ; White_Space    # Zs    EN QUAD
2001       ; White_Space    # Zs    EM QUAD
2002       ; White_Space    # Zs    EN SPACE
2003       ; White_Space    # Zs    EM SPACE
2004       ; White_Space    # Zs    THREE-PER-EM SPACE
2005       ; White_Space    # Zs    FOUR-PER-EM SPACE
2006       ; White_Space    # Zs    SIX-PER-EM SPACE
2007       ; White_Space    # Zs    FIGURE SPACE
2008       ; White_Space    # Zs    PUNCTUATION SPACE
2009       ; White_Space    # Zs    THIN SPACE
200A       ; White_Space    # Zs    HAIR SPACE
2028     ; White_Space    # Zl    LINE SEPARATOR
2029     ; White_Space    # Zp    PARAGRAPH SEPARATOR
202F       ; White_Space    # Zs    NARROW NO-BREAK SPACE
205F       ; White_Space    # Zs    MEDIUM MATHEMATICAL SPACE
3000   　  ; White_Space    # Zs    IDEOGRAPHIC SPACE

Note that I used the notations TAB, LF and CR, standing for the three characters \t, \n and \r, instead of the chars themselves

So, in order to get the number of Non_Space strings, we should, normally, use the simple regex \S+. However, it does not give the right number. Indeed, when several characters, with code-point over the BMP, are consecutive, they are not seen as a global Non_Space string but as individual characters :-((

You may test my statement with this string, composed of four consecutive emoji chars 👨👩👦👧. The regex \S+ returns four Non_Space strings, whereas I would have expected only one string !

Consequently, I verified that, when the number of four bytes chars is > 0, the suitable regex to count all the Non_Space strings of a file, whatever their Unicode code-point, is rather the regex ((?!\s).[\x{D800}-\x{DFFF}]?)+ ( longer, I agree but exact ! )

So, I would like to propose a new layout of an summary feature, which should be more informative. It contains a list of regexes which allow you to count different subsets of characters from the current file contents. Of course, tick the Wrap around option, in the Find dialog and click on the Count button for tests !

IMPORTANT : In the list below, any text, before the colon character of each line, is the name which should be displayed in the new Summary dialog !

 FULL File Path    :  X:\....\....\

 CREATION     Date :  Name Month Day 22-05-26 Year
 MODIFICATION Date :  Name Month Day 22-05-26 Year

 READ-ONLY flag    :  YES / NO
 READ-ONLY editor  :  YES / NO


 Current VIEW      :  MAIN view / SECONDARY view

 Current ENCODING  :  UTF-... / ANSI

 Current LANGUAGE  :  TXT ( Normal txt file) / ...

 Current Line END  :  Windows (CR LF) / Macintosh (CR) / Unix (LF)

 Current WRAPPING  :  YES / NO

•------------------------------------------------•----------------------------------------------------------------------------•------------------------------------------------•-----------------------------------•
                                                 |                                 UTF-8 [-BOM]                               |             UCS-2/UTF-16 BE/LE BOM             |                ANSI
•------------------------------------------------•----------------------------------------------------------------------------•------------------------------------------------•-----------------------------------•
                                                 |                                                                            |                                                |
 1-BYTE  Chars     :  N1                         | (?![\r\n])[\x{0000}-\x{007F}]                                              |                        0                       |               [^\r\n]
 2-BYTES Chars     :  N2                         | [\x{0080}-\x{07FF}]                                                        | (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}] |                  0
 3-BYTES Chars     :  N3                         | (?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]                                 |                        0                       |                  0
                                                 |                                                                            |                                                |
 Sum BMP Chars     :  N1 + N2 + N3               | (?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}] or [^\r\n\x{D800}-\x{DFFF}] |                      idem                      |               [^\r\n]
 4-BYTES Chars     :  N4                         | (?-s).[\x{D800}-\x{DFFF}]  or  [\x{D800}-\x{DFFF}]                         |                      idem                      |                  0
                                                 |                                                                            |                                                |
 Chars w/o CR|LF   :  N1 + N2 + N3 + N4          | [^\r\n]                                                                    |                      idem                      |                idem
 EOL ( CR or LF )  :  N0                         | \r|\n                                                                      |                      idem                      |                idem
                                                 |                                                                            |                                                |
 TOTAL Characters  :  N0 + N1 + N2 + N3 + N4     | (?s).                                                                      |                      idem                      |                idem
                                                 |                                                                            |                                                |
                                                 |                                                                            |                                                |
 BYTE Length       :                             | N0 + N1 + 2 × N2 + 3 × N3 + 4 × N4                                         |           N0 × 2 + N2 × 2 +  N4 ×    4         |               NO + N1
                                                 |                                                                            |                                                |
 Byte Order Mark   :                             | 0 ( UTF-8)  or  3 ( UTF-8-BOM )                                            |                        2                       |                  0
                                                 |                                                                            |                                                |
 BUFFER Length     :  BYTE length  +  BOM        |                                                                            |                                                |
                                                 |                                                                            |                                                |
 Length on DISK    :  Length CURRENT file on DISK|                                                                            |                                                |
                                                 |                                                                            |                                                |
                                                 |                                                                            |                                                |
 NON BLANK chars   :                             | [^\r\n\t\x20]                                                              |                       idem                     |                idem
                                                 |                                                                            |                                                |
 WORDS     count   :     (Caution !)             | \w+                                                                        |                       idem                     |                idem
                                                 |                                                                            |                                                |
 NON-SPACE count   :                             | (?:(?!\s).[\x{D800}-\x{DFFF}]?)+  or  \S+                                  |                       idem                     |                \S+
                                                 |                                                                            |                                                |
                                                 |                                                                            |                                                |
 True EMPTY lines  :  L1                         | (?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)                           |                       idem                     | (?<!\f)^(?:\r\n|\r|\n)
                                                 |                                                                            |                                                |
 True BLANK lines  :  L2                         | (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)               |                       idem                     | (?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)
                                                 |                                                                            |                                                |
                                                 |                                                                            |                                                |
 EMPTY/BLANK lines :  L1 + L2                    | (?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]*(?:\r\n|\r|\n|\z)               |                       idem                     | (?<!\f)^[\t\x20]*(?:\r\n|\r|\n|\z)
                                                 |                                                                            |                                                |
 NON-BLANK lines   :                             | (?-s)(?!^[\t\x20]+$)^(?:.|[\f\x{0085}\x{2028}\x{2029}])+(?:\r\n|\r|\n|\z)  |                       idem                     | (?-s)(?!^[\t\x20]+$)^(?:.|\f)+(?:\r\n|\r|\n|\z)
                                                 |                                                                            |                                                |
 TOTAL lines       :                             | (?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z                       |                       idem                     | (?-s)\r\n|\r|\n|(?:.|\f)\z
                                                 |                                                                            |                                                |
                                                 |                                                                            |                                                |
 SELECTION(S)      :  X characters (Y bytes) in Z ranges                                                                      |                        idem                    |                idem
•------------------------------------------------•----------------------------------------------------------------------------•------------------------------------------------•------------------------------------•

Continued discussion in the next post

guy038

guy038

Hi, All,

Remarks : Although most of the regexes, above, can be easily understood, here are some additional elements :

The regex (?-s).[\x{D800}-\x{DFFF}] is the sole correct syntax, with our Boost regex engine, to count all the characters over the BMP. But it may fail with the message Ran out of stack space trying to match the regular expression.. Luckily, I do not use it because it can be deduced from the difference Total_Standard - Total_BMP
The regex (?s)((?!\s).[\x{D800}-\x{DFFF}]?)+, to count all the Non_Space strings, was explained before but may fail with the message Ran out of stack space trying to match the regular expression.
In all the regexes, relative to the counting of lines, you probably noticed the character class [\f\x{0085}\x{2028}\x{2029}]. It must be present because the four characters \f, \x{0085} , \x{2028} and \x{2029} are, both, considered as a start and an End of line, like the assertions ^ and $ !
- For instance, if, in a new file, you insert one Next_Line char ( NEL ), of code-point \x{0085} and hit the Enter key, this sole line is wrongly seen as an empty line by the simple regex ^(?:\r\n|\r|\n) which matches the line-break after the Next_Line char !

Here is the python script, split on two posts

# encoding=utf-8

#-------------------------------------------------------------------------
#                    STATISTICS about the CURRENT file ( v0.6 )
#-------------------------------------------------------------------------

from __future__ import print_function    # for Python2 compatibility

from Npp import *

import re

import os, time, datetime

import ctypes

from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT

# --------------------------------------------------------------------------------------------------------------------------------------------------------------
#  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def npp_get_statusbar(statusbar_item_number):

    WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
    FindWindowW = ctypes.windll.user32.FindWindowW
    FindWindowExW = ctypes.windll.user32.FindWindowExW
    SendMessageW = ctypes.windll.user32.SendMessageW
    LRESULT = LPARAM
    SendMessageW.restype = LRESULT
    SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
    EnumChildWindows = ctypes.windll.user32.EnumChildWindows
    GetClassNameW = ctypes.windll.user32.GetClassNameW
    create_unicode_buffer = ctypes.create_unicode_buffer

    SBT_OWNERDRAW = 0x1000
    WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13

    npp_get_statusbar.STATUSBAR_HANDLE = None

    def get_result_from_statusbar(statusbar_item_number):
        assert statusbar_item_number <= 5
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
        length = retcode & 0xFFFF
        type = (retcode >> 16) & 0xFFFF
        assert (type != SBT_OWNERDRAW)
        text_buffer = create_unicode_buffer(length)
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
        retval = '{}'.format(text_buffer[:length])
        return retval

    def EnumCallback(hwnd, lparam):
        curr_class = create_unicode_buffer(256)
        GetClassNameW(hwnd, curr_class, 256)
        if curr_class.value.lower() == "msctls_statusbar32":
            npp_get_statusbar.STATUSBAR_HANDLE = hwnd
            return False  # stop the enumeration
        return True  # continue the enumeration

    npp_hwnd = FindWindowW(u"Notepad++", None)
    EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
    if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
    assert False

St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )

See next post for continuation !

guy038

Continuation of the script :

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def number(occ):
    global num
    num += 1

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_encoding = str(notepad.getEncoding())

if Curr_encoding == 'ENC8BIT':
    Curr_encoding = 'ANSI'

if Curr_encoding == 'COOKIE':
    Curr_encoding = 'UTF-8'

if Curr_encoding == 'UTF8':
    Curr_encoding = 'UTF-8-BOM'

if Curr_encoding == 'UCS2BE':
    Curr_encoding = 'UTF-16 BE BOM'

if Curr_encoding == 'UCS2LE':
    Curr_encoding = 'UTF-16 LE BOM'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    Line_title = 95
else:
    Line_title = 75

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

File_name = notepad.getCurrentFilename()

if os.path.isfile(File_name) == True:

    Creation_date = time.ctime(os.path.getctime(File_name))

    Modif_date = time.ctime(os.path.getmtime(File_name))

    Size_length = os.path.getsize(File_name)

    RO_flag = 'YES'

    if os.access(File_name, os.W_OK):
        RO_flag = 'NO'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

RO_editor = 'NO'

if editor.getReadOnly() == True:
    RO_editor = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if notepad.getCurrentView() == 0:
    Curr_view = 'MAIN View'
else:
    Curr_view = 'SECONDARY view'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_lang = notepad.getCurrentLang()

Lang_desc = notepad.getLanguageDesc(Curr_lang)

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if editor.getEOLMode() == 0:
    Curr_eol = 'Windows (CR LF)'

if editor.getEOLMode() == 1:
    Curr_eol = 'Macintosh (CR)'

if editor.getEOLMode() == 2:
    Curr_eol = 'Unix (LF)'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_wrap = 'NO'

if editor.getWrapMode() == 1:
    Curr_wrap = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'[^\r\n]', number)

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    editor.research(r'(?![\r\n])[\x{0000}-\x{007F}]', number)

Total_1_byte = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    editor.research(r'[\x{0080}-\x{07FF}]', number)

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP vchars ( With PYTHON, the [^\r\n\x{D800}-\x{DFFF}] syntax does NOT work properly !)

Total_2_bytes = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    editor.research(r'(?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]', number)

Total_3_bytes = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes

# --------------------------------------------------------------------------------------------------------------------------------------------------------------
num = 0
editor.research(r'[^\r\n]', number)

Total_standard = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_4_bytes = 0  #  By default

if Curr_encoding != 'ANSI':
    Total_4_bytes = Total_standard - Total_BMP

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\r|\n', number)

Total_EOL = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_chars = Total_EOL + Total_standard

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'ANSI':
    Bytes_length = Total_EOL + Total_1_byte

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    Bytes_length = Total_EOL + Total_1_byte + 2 * Total_2_bytes + 3 * Total_3_bytes + 4 * Total_4_bytes

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

BOM = 0  #  Default ANSI and UTF-8

if Curr_encoding == 'UTF-8-BOM':
    BOM = 3

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    BOM = 2

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Buffer_length = Bytes_length + BOM

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'[^\r\n\t\x20]', number)

Non_blank_chars = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\w+', number)

Words_count = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0

if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
    editor.research(r'\S+', number)
else:
    editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)

Non_space_count = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'(?<!\f)^(?:\r\n|\r|\n)', number)
else:
    editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)', number)

Empty_lines = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'(?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
else:
    editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)', number)

Blank_lines = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Emp_blk_lines = Empty_lines + Blank_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'(?-s)\r\n|\r|\n|(?:.|\f)\z', number)
else:
    editor.research(r'(?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z', number)

Total_lines = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Non_blk_lines = Total_lines - Emp_blk_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )

# print ('Res = ', Num_sel)

if Num_sel != 0:

    Bytes_count = 0
    Chars_count = 0

    for n in range(Num_sel):

        Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)

        Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Chars_count < 2:
        Txt_chars = ' selected char ('

    else:
        Txt_chars = ' selected chars ('


    if Bytes_count < 2:
        Txt_bytes = ' selected byte) in '

    else:
        Txt_bytes = ' selected bytes) in '

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Num_sel < 2 and Bytes_count == 0:
        Txt_ranges = ' EMPTY range\n'

    if Num_sel < 2 and Bytes_count > 0:
        Txt_ranges = ' range\n'

    if Num_sel > 1 and Bytes_count == 0:
        Txt_ranges = ' EMPTY ranges\n'

    if Num_sel > 1 and Bytes_count > 0:
        Txt_ranges = ' ranges (EMPTY or NOT)\n'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

line_list = []  # empty list

line_list.append ('-' * Line_title)

line_list.append (' ' * int((Line_title - 37) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()))

line_list.append ('-' * Line_title +'\n')

line_list.append (' FULL File Path    :  ' + File_name + '\n')

if os.path.isfile(File_name) == True:

    line_list.append(' CREATION     Date :  ' + Creation_date)

    line_list.append(' MODIFICATION Date :  ' + Modif_date + '\n')

    line_list.append(' READ-ONLY flag    :  ' + RO_flag )

line_list.append (' READ-ONLY editor  :  ' + RO_editor + '\n\n')

line_list.append (' Current VIEW      :  ' + Curr_view + '\n')

line_list.append (' Current ENCODING  :  ' + Curr_encoding + '\n')

line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')\n')

line_list.append (' Current Line END  :  ' + Curr_eol + '\n')

line_list.append (' Current WRAPPING  :  ' + Curr_wrap + '\n\n')

line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))

line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))

line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + '\n')

line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))

line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + '\n')

line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard))

line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL) + '\n')

line_list.append (' TOTAL characters  :  ' + str(Total_chars) + '\n\n')

if Curr_encoding == 'ANSI':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
    + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')

line_list.append (' Byte Order Mark   :  ' + str(BOM) + '\n')

line_list.append (' BUFFER Length     :  ' + str(Buffer_length))

if os.path.isfile(File_name) == True:
    line_list.append (' Length on DISK    :  ' + str(Size_length) + '\n\n')
else:
    line_list.append ('\n')

line_list.append (' NON-Blank Chars   :  ' + str(Non_blank_chars) + '\n')

line_list.append (' WORDS     Count   :  ' + str(Words_count) + ' (Caution !)\n')

line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + '\n\n')

line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))

line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + '\n')

line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + '\n')

line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))

line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + '\n\n')

line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges)

editor.copyText ('\r\n'.join(line_list))

notepad.new()

editor.paste()

editor.copyText('')

if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':

    if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature

        notepad.prompt ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                        '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!', '')

# ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------

The way to use this script is quite self-explanatory. Just three points to emphazise :

On the BUFFER length line, the values between parentheses :
- Always begin with the number of EOL ( I omitted the b after x 1, on purpose ! )
  - Followed with the number of the 1-BYTE for an ANSI encoded file
  - Followed with the numbers of the 1-BYTE, 2-BYTES, 3-BYTES and 4-BYTES, for an UTF-8 or UTF-8-BOM encoded file
  - Followed with the numbers of the 2-BYTES and 4-BYTES, for an UTF-16 BE BOM or UTF-16 LE BOM encoded file
Normally, when a file is saved the values BUFFEER length and Length on DISK should always be equal. If not, two cases are possible :
- This file have been recently modified ( trivial case )
- The file is not identified with a BOM and has been re-interpreted with an other NON-Unicode encoding. Then, apply the actions, indicated in the pop-up message !
For a new # file, some values are obviously absent. These are the MODIFICATION date, the CREATION date, the READ-ONLY flag and the Length on DISK ( size ) values

Best Regards,

guy038

Mark Olson

@guy038 said in Tests and impressions on the "View > Summary..." functionality:

editor.copyText (‘\r\n’.join(line_list))

notepad.new()

editor.paste()

editor.copyText(‘’)

Couldn’t you just do

notepad.new()
editor.setText('\r\n'.join(line_list))

and thus avoid overwriting the user’s clipboard?

guy038

Hello, All,

So, I followed the excellent @mark-olson’s suggestion to bypass the clipboard functionality !
Now, in case of a RuntimeError, when searching for the NON-SPACE count of characters, I used an exception which displays a warning message, if the Err_Regex is True. But, even when the Err_Regex variable is False, the result is not totally guaranteed too, if the analyzed file contains bytes over the BMP.

So, globally, whatever the Err_Regex status, the NON-SPACE count value may be increased or decreased by 1, in some cases ( still unclear ) !

Here is the v0.7 version of my script ( I indeed gave a version number to my successive attempts ! )

# encoding=utf-8

#-------------------------------------------------------------------------
#                    STATISTICS about the CURRENT file ( v0.7 )
#-------------------------------------------------------------------------

from __future__ import print_function    # for Python2 compatibility

from Npp import *

import re

import os, time, datetime

import ctypes

from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT

# --------------------------------------------------------------------------------------------------------------------------------------------------------------
#  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def npp_get_statusbar(statusbar_item_number):

    WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
    FindWindowW = ctypes.windll.user32.FindWindowW
    FindWindowExW = ctypes.windll.user32.FindWindowExW
    SendMessageW = ctypes.windll.user32.SendMessageW
    LRESULT = LPARAM
    SendMessageW.restype = LRESULT
    SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
    EnumChildWindows = ctypes.windll.user32.EnumChildWindows
    GetClassNameW = ctypes.windll.user32.GetClassNameW
    create_unicode_buffer = ctypes.create_unicode_buffer

    SBT_OWNERDRAW = 0x1000
    WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13

    npp_get_statusbar.STATUSBAR_HANDLE = None

    def get_result_from_statusbar(statusbar_item_number):
        assert statusbar_item_number <= 5
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
        length = retcode & 0xFFFF
        type = (retcode >> 16) & 0xFFFF
        assert (type != SBT_OWNERDRAW)
        text_buffer = create_unicode_buffer(length)
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
        retval = '{}'.format(text_buffer[:length])
        return retval

    def EnumCallback(hwnd, lparam):
        curr_class = create_unicode_buffer(256)
        GetClassNameW(hwnd, curr_class, 256)
        if curr_class.value.lower() == "msctls_statusbar32":
            npp_get_statusbar.STATUSBAR_HANDLE = hwnd
            return False  # stop the enumeration
        return True  # continue the enumeration

    npp_hwnd = FindWindowW(u"Notepad++", None)
    EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
    if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
    assert False

St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )

Continuation on next post

guy038

guy038

Hi all,

Continuation of version v0.7 of the script :

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def number(occ):
    global num
    num += 1

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_encoding = str(notepad.getEncoding())

if Curr_encoding == 'ENC8BIT':
    Curr_encoding = 'ANSI'

if Curr_encoding == 'COOKIE':
    Curr_encoding = 'UTF-8'

if Curr_encoding == 'UTF8':
    Curr_encoding = 'UTF-8-BOM'

if Curr_encoding == 'UCS2BE':
    Curr_encoding = 'UTF-16 BE BOM'

if Curr_encoding == 'UCS2LE':
    Curr_encoding = 'UTF-16 LE BOM'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    Line_title = 95
else:
    Line_title = 75

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

File_name = notepad.getCurrentFilename()

if os.path.isfile(File_name) == True:

    Creation_date = time.ctime(os.path.getctime(File_name))

    Modif_date = time.ctime(os.path.getmtime(File_name))

    Size_length = os.path.getsize(File_name)

    RO_flag = 'YES'

    if os.access(File_name, os.W_OK):
        RO_flag = 'NO'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

RO_editor = 'NO'

if editor.getReadOnly() == True:
    RO_editor = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if notepad.getCurrentView() == 0:
    Curr_view = 'MAIN View'
else:
    Curr_view = 'SECONDARY view'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_lang = notepad.getCurrentLang()

Lang_desc = notepad.getLanguageDesc(Curr_lang)

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if editor.getEOLMode() == 0:
    Curr_eol = 'Windows (CR LF)'

if editor.getEOLMode() == 1:
    Curr_eol = 'Macintosh (CR)'

if editor.getEOLMode() == 2:
    Curr_eol = 'Unix (LF)'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_wrap = 'NO'

if editor.getWrapMode() == 1:
    Curr_wrap = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'[^\r\n]', number)

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    editor.research(r'(?![\r\n])[\x{0000}-\x{007F}]', number)

Total_1_byte = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    editor.research(r'[\x{0080}-\x{07FF}]', number)

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP vchars ( With PYTHON, the [^\r\n\x{D800}-\x{DFFF}] syntax does NOT work properly !)

Total_2_bytes = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    editor.research(r'(?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]', number)

Total_3_bytes = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes

# --------------------------------------------------------------------------------------------------------------------------------------------------------------
num = 0
editor.research(r'[^\r\n]', number)

Total_standard = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_4_bytes = 0  #  By default

if Curr_encoding != 'ANSI':
    Total_4_bytes = Total_standard - Total_BMP

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\r|\n', number)

Total_EOL = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_chars = Total_EOL + Total_standard

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'ANSI':
    Bytes_length = Total_EOL + Total_1_byte

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    Bytes_length = Total_EOL + Total_1_byte + 2 * Total_2_bytes + 3 * Total_3_bytes + 4 * Total_4_bytes

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

BOM = 0  #  Default ANSI and UTF-8

if Curr_encoding == 'UTF-8-BOM':
    BOM = 3

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    BOM = 2

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Buffer_length = Bytes_length + BOM

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'[^\r\n\t\x20]', number)

Non_blank_chars = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\w+', number)

Words_count = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Err_Regex = False

num = 0

if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
    editor.research(r'\S+', number)
else:
    try:
        editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
    except RuntimeError:
        Err_Regex = True

Non_space_count = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'(?<!\f)^(?:\r\n|\r|\n)', number)
else:
    editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)', number)

Empty_lines = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'(?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
else:
    editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)', number)

Blank_lines = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Emp_blk_lines = Empty_lines + Blank_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'(?-s)\r\n|\r|\n|(?:.|\f)\z', number)
else:
    editor.research(r'(?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z', number)

Total_lines = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Non_blk_lines = Total_lines - Emp_blk_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )

# print ('Res = ', Num_sel)

if Num_sel != 0:

    Bytes_count = 0
    Chars_count = 0

    for n in range(Num_sel):

        Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)

        Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Chars_count < 2:
        Txt_chars = ' selected char ('

    else:
        Txt_chars = ' selected chars ('


    if Bytes_count < 2:
        Txt_bytes = ' selected byte) in '

    else:
        Txt_bytes = ' selected bytes) in '

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Num_sel < 2 and Bytes_count == 0:
        Txt_ranges = ' EMPTY range\n'

    if Num_sel < 2 and Bytes_count > 0:
        Txt_ranges = ' range\n'

    if Num_sel > 1 and Bytes_count == 0:
        Txt_ranges = ' EMPTY ranges\n'

    if Num_sel > 1 and Bytes_count > 0:
        Txt_ranges = ' ranges (EMPTY or NOT)\n'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

line_list = []  # empty list

line_list.append ('-' * Line_title)

line_list.append (' ' * int((Line_title - 37) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()))

line_list.append ('-' * Line_title +'\n')

line_list.append (' FULL File Path    :  ' + File_name + '\n')

if os.path.isfile(File_name) == True:

    line_list.append(' CREATION     Date :  ' + Creation_date)

    line_list.append(' MODIFICATION Date :  ' + Modif_date + '\n')

    line_list.append(' READ-ONLY flag    :  ' + RO_flag )

line_list.append (' READ-ONLY editor  :  ' + RO_editor + '\n\n')

line_list.append (' Current VIEW      :  ' + Curr_view + '\n')

line_list.append (' Current ENCODING  :  ' + Curr_encoding + '\n')

line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')\n')

line_list.append (' Current Line END  :  ' + Curr_eol + '\n')

line_list.append (' Current WRAPPING  :  ' + Curr_wrap + '\n\n')

line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))

line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))

line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + '\n')

line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))

line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + '\n')

line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard))

line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL) + '\n')

line_list.append (' TOTAL characters  :  ' + str(Total_chars) + '\n\n')

if Curr_encoding == 'ANSI':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
    + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')

line_list.append (' Byte Order Mark   :  ' + str(BOM) + '\n')

line_list.append (' BUFFER Length     :  ' + str(Buffer_length))

if os.path.isfile(File_name) == True:
    line_list.append (' Length on DISK    :  ' + str(Size_length) + '\n\n')
else:
    line_list.append ('\n')

line_list.append (' NON-Blank Chars   :  ' + str(Non_blank_chars) + '\n')

line_list.append (' WORDS     Count   :  ' + str(Words_count) + ' (Caution !)\n')

if Err_Regex == False:
    line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + '\n\n')
else:
    line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + ' (ERROR : Ran out of stack space trying to match the regular expressions !)\n\n')

line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))

line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + '\n')

line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + '\n')

line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))

line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + '\n\n')

line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges)

notepad.new()

editor.setText('\r\n'.join(line_list))

if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':

    if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature

        notepad.prompt ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                        '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!', '')

# ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------

So, just test this script against any file, to get any possible bug or limitation !!

I’ve also heard of compiled regexes in Python. Would that be interesting for this script ?

Best Regards,

guy038

guy038

Hi, All,

I realized that it was the mess regarding the line_endings, in the Summary report. Thus, by defining a Line_end variable equal to \r\n, the results are more harmonious !

One advantage : if you do not want any supplementary line-break, in the Summary report, simply change the line :

Line_end = '\r\n'

by this one :

Line_end = ''

So, here is the v0.8 version of my script :

# encoding=utf-8

#-------------------------------------------------------------------------
#                    STATISTICS about the CURRENT file ( v0.8 )
#-------------------------------------------------------------------------

from __future__ import print_function    # for Python2 compatibility

from Npp import *

import re

import os, time, datetime

import ctypes

from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT

# --------------------------------------------------------------------------------------------------------------------------------------------------------------
#  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def npp_get_statusbar(statusbar_item_number):

    WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
    FindWindowW = ctypes.windll.user32.FindWindowW
    FindWindowExW = ctypes.windll.user32.FindWindowExW
    SendMessageW = ctypes.windll.user32.SendMessageW
    LRESULT = LPARAM
    SendMessageW.restype = LRESULT
    SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
    EnumChildWindows = ctypes.windll.user32.EnumChildWindows
    GetClassNameW = ctypes.windll.user32.GetClassNameW
    create_unicode_buffer = ctypes.create_unicode_buffer

    SBT_OWNERDRAW = 0x1000
    WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13

    npp_get_statusbar.STATUSBAR_HANDLE = None

    def get_result_from_statusbar(statusbar_item_number):
        assert statusbar_item_number <= 5
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
        length = retcode & 0xFFFF
        type = (retcode >> 16) & 0xFFFF
        assert (type != SBT_OWNERDRAW)
        text_buffer = create_unicode_buffer(length)
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
        retval = '{}'.format(text_buffer[:length])
        return retval

    def EnumCallback(hwnd, lparam):
        curr_class = create_unicode_buffer(256)
        GetClassNameW(hwnd, curr_class, 256)
        if curr_class.value.lower() == "msctls_statusbar32":
            npp_get_statusbar.STATUSBAR_HANDLE = hwnd
            return False  # stop the enumeration
        return True  # continue the enumeration

    npp_hwnd = FindWindowW(u"Notepad++", None)
    EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
    if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
    assert False

St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )

Continuation on next post

guy038

guy038

Hi all,

Continuation of version v0.8 of the script :

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def number(occ):
    global num
    num += 1

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_encoding = str(notepad.getEncoding())

if Curr_encoding == 'ENC8BIT':
    Curr_encoding = 'ANSI'

if Curr_encoding == 'COOKIE':
    Curr_encoding = 'UTF-8'

if Curr_encoding == 'UTF8':
    Curr_encoding = 'UTF-8-BOM'

if Curr_encoding == 'UCS2BE':
    Curr_encoding = 'UTF-16 BE BOM'

if Curr_encoding == 'UCS2LE':
    Curr_encoding = 'UTF-16 LE BOM'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    Line_title = 95
else:
    Line_title = 75

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

File_name = notepad.getCurrentFilename()

if os.path.isfile(File_name) == True:

    Creation_date = time.ctime(os.path.getctime(File_name))

    Modif_date = time.ctime(os.path.getmtime(File_name))

    Size_length = os.path.getsize(File_name)

    RO_flag = 'YES'

    if os.access(File_name, os.W_OK):
        RO_flag = 'NO'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

RO_editor = 'NO'

if editor.getReadOnly() == True:
    RO_editor = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if notepad.getCurrentView() == 0:
    Curr_view = 'MAIN View'
else:
    Curr_view = 'SECONDARY view'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_lang = notepad.getCurrentLang()

Lang_desc = notepad.getLanguageDesc(Curr_lang)

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if editor.getEOLMode() == 0:
    Curr_eol = 'Windows (CR LF)'

if editor.getEOLMode() == 1:
    Curr_eol = 'Macintosh (CR)'

if editor.getEOLMode() == 2:
    Curr_eol = 'Unix (LF)'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_wrap = 'NO'

if editor.getWrapMode() == 1:
    Curr_wrap = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'[^\r\n]', number)

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    editor.research(r'(?![\r\n])[\x{0000}-\x{007F}]', number)

Total_1_byte = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    editor.research(r'[\x{0080}-\x{07FF}]', number)

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP vchars ( With PYTHON, the [^\r\n\x{D800}-\x{DFFF}] syntax does NOT work properly !)

Total_2_bytes = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    editor.research(r'(?![\x{D800}-\x{DFFF}])[\x{0800}-\x{FFFF}]', number)

Total_3_bytes = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes

# --------------------------------------------------------------------------------------------------------------------------------------------------------------
num = 0
editor.research(r'[^\r\n]', number)

Total_standard = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_4_bytes = 0  #  By default

if Curr_encoding != 'ANSI':
    Total_4_bytes = Total_standard - Total_BMP

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\r|\n', number)

Total_EOL = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_chars = Total_EOL + Total_standard

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'ANSI':
    Bytes_length = Total_EOL + Total_1_byte

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    Bytes_length = Total_EOL + Total_1_byte + 2 * Total_2_bytes + 3 * Total_3_bytes + 4 * Total_4_bytes

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

BOM = 0  #  Default ANSI and UTF-8

if Curr_encoding == 'UTF-8-BOM':
    BOM = 3

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    BOM = 2

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Buffer_length = Bytes_length + BOM

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'[^\r\n\t\x20]', number)

Non_blank_chars = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\w+', number)

Words_count = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Err_Regex = False

num = 0

if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
    editor.research(r'\S+', number)
else:
    try:
        editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
    except RuntimeError:
        Err_Regex = True

Non_space_count = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'(?<!\f)^(?:\r\n|\r|\n)', number)
else:
    editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^(?:\r\n|\r|\n)', number)

Empty_lines = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'(?<!\f)^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
else:
    editor.research(r'(?<![\f\x{0085}\x{2028}\x{2029}])^[\t\x20]+(?:\r\n|\r|\n|\z)', number)

Blank_lines = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Emp_blk_lines = Empty_lines + Blank_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'(?-s)\r\n|\r|\n|(?:.|\f)\z', number)
else:
    editor.research(r'(?-s)\r\n|\r|\n|(?:.|[\f\x{0085}\x{2028}\x{2029}])\z', number)

Total_lines = num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Non_blk_lines = Total_lines - Emp_blk_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )

# print ('Res = ', Num_sel)

if Num_sel != 0:

    Bytes_count = 0
    Chars_count = 0

    for n in range(Num_sel):

        Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)

        Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Chars_count < 2:
        Txt_chars = ' selected char ('

    else:
        Txt_chars = ' selected chars ('


    if Bytes_count < 2:
        Txt_bytes = ' selected byte) in '

    else:
        Txt_bytes = ' selected bytes) in '

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Num_sel < 2 and Bytes_count == 0:
        Txt_ranges = ' EMPTY range\n'

    if Num_sel < 2 and Bytes_count > 0:
        Txt_ranges = ' range\n'

    if Num_sel > 1 and Bytes_count == 0:
        Txt_ranges = ' EMPTY ranges\n'

    if Num_sel > 1 and Bytes_count > 0:
        Txt_ranges = ' ranges (EMPTY or NOT)\n'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

line_list = []  # empty list

Line_end = '\r\n'

line_list.append ('-' * Line_title)

line_list.append (' ' * int((Line_title - 37) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()))

line_list.append ('-' * Line_title + Line_end)

line_list.append (' FULL File Path    :  ' + File_name + Line_end)

if os.path.isfile(File_name) == True:

    line_list.append(' CREATION     Date :  ' + Creation_date)

    line_list.append(' MODIFICATION Date :  ' + Modif_date + Line_end)

    line_list.append(' READ-ONLY flag    :  ' + RO_flag )

line_list.append (' READ-ONLY editor  :  ' + RO_editor + Line_end * 2)

line_list.append (' Current VIEW      :  ' + Curr_view + Line_end)

line_list.append (' Current ENCODING  :  ' + Curr_encoding + Line_end)

line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')' + Line_end)

line_list.append (' Current Line END  :  ' + Curr_eol + Line_end)

line_list.append (' Current WRAPPING  :  ' + Curr_wrap + Line_end * 2)

line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))

line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))

line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + Line_end)

line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))

line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + Line_end)

line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard))

line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL) + Line_end)

line_list.append (' TOTAL characters  :  ' + str(Total_chars) + Line_end * 2)

if Curr_encoding == 'ANSI':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
    + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')

line_list.append (' Byte Order Mark   :  ' + str(BOM) + Line_end)

line_list.append (' BUFFER Length     :  ' + str(Buffer_length))

if os.path.isfile(File_name) == True:
    line_list.append (' Length on DISK    :  ' + str(Size_length) + Line_end * 2)
else:
    line_list.append ('\n')

line_list.append (' NON-Blank Chars   :  ' + str(Non_blank_chars) + Line_end)

line_list.append (' WORDS     Count   :  ' + str(Words_count) + ' (Caution !)' + Line_end)

if Err_Regex == False:
    line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + Line_end * 2)
else:
    line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + ' (ERROR : Ran out of stack space trying to match the regular expressions !)' + Line_end * 2)

line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))

line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + Line_end)

line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + Line_end)

line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))

line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + Line_end * 2)

line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges)

notepad.new()

editor.setText('\r\n'.join(line_list))

if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':

    if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature

        notepad.prompt ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                        '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!', '')

# ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------

Best Regards,

guy038

guy038

Hi, All,

You’ll find, below, the v1.0 version of my script. I changed a lot of things :

I add a counter to get the execution time of the script, which is written right after the current date, at the beginning of the summary
I modified some regexes in order to improve their performance as well as the order to search them for
I used the Pythonscript methods .editor.getLength(), editor.countCharacters(0, editor.getLength()) and editor.getLineCount() to get, respectively, the bytes length ( without a possible BOM ) value, the Total_chars value and the Total_lines value. Note that, in case of an UTF-8 or UTF-8-BOM encoded file, we get two relations :
- (A) Buffer length - Total_EOL - Total_1_byte - 2 × Total_2_bytes - 3 × Total_3_bytes = 4 × Total_4_bytes
- (B) Total_Chars - Total_EOL - Total_1_byte - Total_2_bytes - Total_3_bytes = Total_4_bytes

So, we can deduce, from the relation A-B, the equations :

Total_4_bytes = ( Total_length - Total_chars - Total_2_bytes - 2 × Total_3_bytes ) / 3

and then :

Total_1_byte = Total_chars - Total_EOL - Total_2_bytes - Total_3_bytes - Total_4_bytes

Thus, after counting the number of Total_2_bytes and Total_3_bytes, the two results Total_4_bytes and Total_1_byte are easily deduced. This new way decreases, from a factor 2 to 3, the execution time of the script, because, most of the time, the file contains only 1-byte chars :-))

However, the Buffer_length value wrongly remains the same, in case of an UTF-16 BE BOM or UTF-16 LE BOM encoded file. Thus, I needed to calcul the Total_4_bytes and Buffer_length values, from the number of Total_2_bytes, with the relations :

Total_4_bytes = Total_chars - Total_EOL - Total_2_bytes

Bytes_length = 2 * Total_EOL + 2 * Total_2_bytes + 4 × Total_4_bytes

Now, because some huge files may lead to a long time before getting the Summary results ( even with the native N++ version, BTW ! ), you can follow the progression of the different searches on the Python console, which is automatically enabled at beginning of the script and disabled right before outputting the results
At the end of the script, I just replace the notepad.prompt method by the notepad.messageBox method in order to display the warning ( more logical ! )

IMPORTANT :

Never switch to an other tab when running this script. Else, you’ll probably get unpredictable or negative results !
Thus, by viewing the console messages, if you think that the results seem too long to happen for a specific file and that you prefer to abort its Summary report, simply stop the current Python script with the classical Plugins > Python Script > Stop script menu option

Now, I was a bit upset by some inconsistant results regarding the number of NON-SPACE strings, when current file, with an Unicode encoding, contains some bytes over the BMP

So, I searched among all my posts, since 2013, as well as some others used as documentation, for only those containing some four-bytes characters and here is the list of these files with the reported results :

•=============================•===========•=================•==================•============•================•
|                             |           |    Expected     |  Summary Report  |            |                |
|           Filename          |   4_BYTES |         NON-SPACE count            | Difference |    Encoding    |
|                             |           | (?:(?!\s).[\x{D800}-\x{DFFF}]?)+   |            |                |
•=============================•===========•=================•==================•============•================•
|  Symbola_Monospacified.txt  |   11,951  |     199,891     |      199,882     |      - 9   |  UTF-8-BOM     |
|  Total_Chars.txt            |  262,136  |           9     |           18     |      + 9   |  UTF-8-BOM     |
•=============================•===========•=================•==================•============•================•
|  Caractères.txt             |    2,901  |       7,361     |        7,358     |      - 3   |  UTF-8-BOM     |
|  Test_2.txt                 |    1,276  |           8     |            9     |      + 1   |  UTF-8         |
|  Test_1.txt                 |      881  |           8     |            9     |      + 1   |  UTF-8         |
|  Plane_0.txt                |        0  |           9     |           10     |      + 1   |  UCS-2 BE BOM  |
|  Clemens.txt                |    3,968  |       2,816     |        2,818     |      + 2   |  UTF-8-BOM     |
|  Planes_0+1.txt             |   65,534  |           9     |           12     |      + 3   |  UTF-8-BOM     |
•=============================•===========•=================•==================•============•================•
|  Chars_Over_BMP.txt         |       28  |         455     |          455     |        0   |  UTF-8-BOM     |
|  Entites_by_Name.txt        |      133  |      15,968     |       15,968     |        0   |  UTF-8         |
|  Entites_by_Number.txt      |      133  |      15,968     |       15,968     |        0   |  UTF-8         |
|  Invisible_chars.txt        |       31  |       3,459     |        3,459     |        0   |  UTF-8-BOM     |
|  Osmanya_Tout.txt           |      119  |         605     |          605     |        0   |  UTF-8-BOM     |
|  Smileys.txt                |    1,031  |      10,157     |       10,157     |        0   |  UTF-8-BOM     |
|  Alan_K.txt                 |      114  |      46,082     |       46,082     |        0   |  UTF-8         |
|  Alexolog.txt               |       13  |       2,199     |        2,199     |        0   |  UTF-8         |
|  André_Z.txt                |        8  |       5,860     |        5,860     |        0   |  UTF-8         |
|  Bidule.txt                 |        1  |         327     |          327     |        0   |  UTF-8         |
|  Carypt.txt                 |        1  |       3,551     |        3,551     |        0   |  UTF-8         |
|  Dean_Corso.txt             |      761  |       9,632     |        9,632     |        0   |  UTF-8         |
|  Don_Ho.txt                 |        2  |      41,426     |       41,426     |        0   |  UTF-8         |
|  Durkin.txt                 |      144  |       4,638     |        4,638     |        0   |  UTF-8         |
|  Dylan.txt                  |       34  |       2,180     |        2,180     |        0   |  UTF-8         |
|  Furek.txt                  |       20  |         499     |          499     |        0   |  UTF-8         |
|  Gary_2.txt                 |        2  |         458     |          458     |        0   |  UTF-8         |
|  Haleba.txt                 |        5  |         817     |          817     |        0   |  UTF-8         |
|  ImSpecial.txt              |        1  |         161     |          161     |        0   |  UTF-8         |
|  Joss.txt                   |        6  |         105     |          105     |        0   |  UTF-8         |
|  JR.txt                     |       39  |       1,735     |        1,735     |        0   |  UTF-8         |
|  Mark_Olson.txt             |        1  |       3,652     |        3,652     |        0   |  UTF-8         |
|  Minus_Majus.txt            |       62  |       9,931     |        9,931     |        0   |  UTF-8         |
|  Niting-jain.txt            |        4  |         537     |          537     |        0   |  UTF-8         |
|  PeterCJ.txt                |       31  |      37,323     |       37,323     |        0   |  UTF-8         |
|  Petr_jaja.txt              |       14  |       3,168     |        3,168     |        0   |  UTF-8         |
|  Pintas.txt                 |        4  |         614     |          614     |        0   |  UTF-8         |
|  Register.txt               |       20  |         242     |          242     |        0   |  UTF-8         |
|  Scott_3.txt                |        4  |      42,552     |       42,552     |        0   |  UTF-8         |
|  Skevich.txt                |        6  |         715     |          715     |        0   |  UTF-8         |
|  Statistiques.txt           |        7  |       9,012     |        9,012     |        0   |  UTF-8         |
|  Summary.txt                |        7  |       4,322     |        4,322     |        0   |  UTF-8         |
|  Summary_NEW.txt            |       10  |       8,903     |        8,903     |        0   |  UTF-8         |
|  Uzivatel.txt               |        2  |         873     |          873     |        0   |  UTF-8         |
|  Xavier_mdq.txt             |       13  |       3,652     |        3,652     |        0   |  UTF-8         |
|  Text.txt                   |    2,400  |       1,000     |        1,000     |        0   |  UTF-8         |
•============================•============•=================•==================•============•================•

From that list, I deduced that the number of NON-space chars is erroneous in very rare cases, especially when current file contains consecutively :

All the characters of a font
All the characters of an Unicode range
All the characters of all Unicode ranges

Luckily, in all the other cases, with a random position of these four-bytes chars, the Summary report always gives the right results, regarding the NON-SPACE count !

Here is the v1.0 version of my script, split on two posts :

# encoding=utf-8

#-------------------------------------------------------------------------
#                    STATISTICS about the CURRENT file ( v1.0 )
#-------------------------------------------------------------------------

from __future__ import print_function    # for Python2 compatibility

from Npp import *

import re

import os, time, datetime

import ctypes

from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT

# --------------------------------------------------------------------------------------------------------------------------------------------------------------
#  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def npp_get_statusbar(statusbar_item_number):

    WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
    FindWindowW = ctypes.windll.user32.FindWindowW
    FindWindowExW = ctypes.windll.user32.FindWindowExW
    SendMessageW = ctypes.windll.user32.SendMessageW
    LRESULT = LPARAM
    SendMessageW.restype = LRESULT
    SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
    EnumChildWindows = ctypes.windll.user32.EnumChildWindows
    GetClassNameW = ctypes.windll.user32.GetClassNameW
    create_unicode_buffer = ctypes.create_unicode_buffer

    SBT_OWNERDRAW = 0x1000
    WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13

    npp_get_statusbar.STATUSBAR_HANDLE = None

    def get_result_from_statusbar(statusbar_item_number):
        assert statusbar_item_number <= 5
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
        length = retcode & 0xFFFF
        type = (retcode >> 16) & 0xFFFF
        assert (type != SBT_OWNERDRAW)
        text_buffer = create_unicode_buffer(length)
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
        retval = '{}'.format(text_buffer[:length])
        return retval

    def EnumCallback(hwnd, lparam):
        curr_class = create_unicode_buffer(256)
        GetClassNameW(hwnd, curr_class, 256)
        if curr_class.value.lower() == "msctls_statusbar32":
            npp_get_statusbar.STATUSBAR_HANDLE = hwnd
            return False  # stop the enumeration
        return True  # continue the enumeration

    npp_hwnd = FindWindowW(u"Notepad++", None)
    EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
    if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
    assert False

St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )

Continuation on next post

guy038

guy038

Hi all,

Continuation of version v1.0 of the script :

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def number(occ):
    global num
    num += 1

console.show()

console.clear()

Start_time = time.time()

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_encoding = str(notepad.getEncoding())

if Curr_encoding == 'ENC8BIT':
    Curr_encoding = 'ANSI'

if Curr_encoding == 'COOKIE':
    Curr_encoding = 'UTF-8'

if Curr_encoding == 'UTF8':
    Curr_encoding = 'UTF-8-BOM'

if Curr_encoding == 'UCS2BE':
    Curr_encoding = 'UTF-16 BE BOM'

if Curr_encoding == 'UCS2LE':
    Curr_encoding = 'UTF-16 LE BOM'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    Line_title = 95
else:
    Line_title = 75

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

File_name = notepad.getCurrentFilename()

if os.path.isfile(File_name) == True:

    Creation_date = time.ctime(os.path.getctime(File_name))

    Modif_date = time.ctime(os.path.getmtime(File_name))

    Size_length = os.path.getsize(File_name)

    RO_flag = 'YES'

    if os.access(File_name, os.W_OK):
        RO_flag = 'NO'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

RO_editor = 'NO'

if editor.getReadOnly() == True:
    RO_editor = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if notepad.getCurrentView() == 0:
    Curr_view = 'MAIN View'
else:
    Curr_view = 'SECONDARY view'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_lang = notepad.getCurrentLang()

Lang_desc = notepad.getLanguageDesc(Curr_lang)

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if editor.getEOLMode() == 0:
    Curr_eol = 'Windows (CR LF)'

if editor.getEOLMode() == 1:
    Curr_eol = 'Macintosh (CR)'

if editor.getEOLMode() == 2:
    Curr_eol = 'Unix (LF)'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_wrap = 'NO'

if editor.getWrapMode() == 1:
    Curr_wrap = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

print ('START')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Bytes_length = editor.getLength()

Total_chars = editor.countCharacters(0, editor.getLength())

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\r|\n', number)

Total_EOL = num

print ('EOL')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_standard = Total_chars - Total_EOL

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'ANSI':

    Total_BMP = Total_standard
    
    Total_1_byte = Total_BMP

    Total_2_bytes = 0

    Total_3_bytes = 0

    Total_4_bytes = 0

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':

    num = 0
    editor.research(r'[\x{0080}-\x{07FF}]', number)

    Total_2_bytes = num

    print ('2-BYTES')

    # --------------------------------------------------------------------------------------------------------------------------------------------------------------

    num = 0
    editor.research(r'[\x{0800}-\x{D7FF}\x{E000}-\x{FFFF}]', number)

    Total_3_bytes = num

    print ('3-BYTES')

    # -----------------------------------------------------------------------------------------------------------------------------

    Total_4_bytes = ( Bytes_length - Total_chars - Total_2_bytes - 2 * Total_3_bytes ) / 3

    Total_1_byte = Total_standard - Total_2_bytes - Total_3_bytes - Total_4_bytes

    Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes

# --------------------------------------------------------------------------------------------------------------------------------------------------------------


if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':

    num = 0
    editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP chars different from '\r' and '\n'

    Total_2_bytes = num

    Total_4_bytes = Total_standard - Total_2_bytes

    Total_BMP = Total_2_bytes

    Total_1_byte = 0

    Total_3_bytes = 0

    Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes

    print ('2-BYTES')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

BOM = 0  #  Default ANSI and UTF-8

if Curr_encoding == 'UTF-8-BOM':
    BOM = 3

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    BOM = 2

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Buffer_length = Bytes_length + BOM

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\t|\x20', number)

Non_blank_chars = Total_standard - num

print ('NON-BLANK')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\w+', number)

Words_count = num

print ('WORDS')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Err_regex = False

num = 0

if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
    editor.research(r'\S+', number)
else:
    try:
        editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
    except RuntimeError:
        Err_regex = True

Non_space_count = num

print ('NON-SPACE')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'\f^(?:\r\n|\r|\n)', number)
else:
    editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^(?:\r\n|\r|\n)', number)

Special_empty = num

num = 0
editor.research(r'^(?:\r\n|\r|\n)', number)

Default_empty = num

Empty_lines = Default_empty - Special_empty

print ('EMPTY lines')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'\f^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
else:
    editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^[\t\x20]+(?:\r\n|\r|\n|\z)', number)

Special_blank = num

num = 0
editor.research(r'^[\t\x20]+(?:\r\n|\r|\n|\z)', number)

Default_blank = num

Blank_lines = Default_blank - Special_blank

print ('BLANK lines')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Emp_blk_lines = Empty_lines + Blank_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_lines = editor.getLineCount()

num = 0
editor.research(r'(?-s)^.+\z', number)

if num == 0:
    Total_lines = Total_lines - 1

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Non_blk_lines = Total_lines - Emp_blk_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )

if Num_sel != 0:

    Bytes_count = 0
    Chars_count = 0

    for n in range(Num_sel):

        Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
        Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Chars_count < 2:
        Txt_chars = ' selected char ('
    else:
        Txt_chars = ' selected chars ('


    if Bytes_count < 2:
        Txt_bytes = ' selected byte) in '
    else:
        Txt_bytes = ' selected bytes) in '

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Num_sel < 2 and Bytes_count == 0:
        Txt_ranges = ' EMPTY range\n'

    if Num_sel < 2 and Bytes_count > 0:
        Txt_ranges = ' range\n'

    if Num_sel > 1 and Bytes_count == 0:
        Txt_ranges = ' EMPTY ranges\n'

    if Num_sel > 1 and Bytes_count > 0:
        Txt_ranges = ' ranges (EMPTY or NOT)\n'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

console.hide()

line_list = []  # empty list

Line_end = '\r\n'

line_list.append ('-' * Line_title)

line_list.append (' ' * int((Line_title - 54) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()) + ' ( ' + str(time.time() - Start_time) + ' )')

line_list.append ('-' * Line_title + Line_end)

line_list.append (' FULL File Path    :  ' + File_name + Line_end)

if os.path.isfile(File_name) == True:

    line_list.append (' CREATION     Date :  ' + Creation_date)

    line_list.append (' MODIFICATION Date :  ' + Modif_date + Line_end)

    line_list.append (' READ-ONLY flag    :  ' + RO_flag)

line_list.append (' READ-ONLY editor  :  ' + RO_editor + Line_end * 2)

line_list.append (' Current VIEW      :  ' + Curr_view + Line_end)

line_list.append (' Current ENCODING  :  ' + Curr_encoding + Line_end)

line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')' + Line_end)

line_list.append (' Current Line END  :  ' + Curr_eol + Line_end)

line_list.append (' Current WRAPPING  :  ' + Curr_wrap + Line_end * 2)

line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))

line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))

line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + Line_end)

line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))

line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + Line_end)

line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard))

line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL) + Line_end)

line_list.append (' TOTAL characters  :  ' + str(Total_chars) + Line_end * 2)

if Curr_encoding == 'ANSI':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
    + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')

line_list.append (' Byte Order Mark   :  ' + str(BOM) + Line_end)

line_list.append (' BUFFER Length     :  ' + str(Buffer_length))

if os.path.isfile(File_name) == True:
    line_list.append (' Length on DISK    :  ' + str(Size_length) + Line_end * 2)
else:
    if Line_end == '\r\n':
        line_list.append (Line_end)

line_list.append (' NON-Blank Count   :  ' + str(Non_blank_chars) + Line_end)

line_list.append (' WORDS     Count   :  ' + str(Words_count) + ' (Caution !)' + Line_end)

if Err_regex == False:
    line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + Line_end * 2)
else:
    line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + ' (Caution as " RuntimeError " occured !)' + Line_end * 2)


line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))

line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + Line_end)

line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + Line_end)

line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))

line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + Line_end * 2)

line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges)

notepad.new()

editor.setText('\r\n'.join(line_list))

if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':

    if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature

        notepad.messageBox ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                        '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!')

# ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------

Remenber that you can use a shorter summary report by changing the line :

Line_end = '\r\n'

by this one :

Line_end = ''

Best Regards,

guy038

Alan Kilborn

@guy038

I was considering recommending your script as a basis for the solution to THIS inquiry, but then I noticed that your script doesn’t report word-count in selected text – perhaps it should do that as well?

guy038

Hello, @alan-kilborn and All,

Following your advice, I included the number of selected words \w+ in the last line of the summary report, regarding the different selections

If needed, the OP may choose this second syntax, which includes the hyphen, the apostrophe and the Right Single Quotation Mark, when surrounded by word chars, as true words chars !

SEARCH (?:(?<=\w)[-'’](?=\w)|\w)+

And thus, replace the line

        editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

by this one :

        editor.research(r'(?:(?<=\w)[-'’](?=\w)|\w)+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

So, here is the v1.1 version of my script, split on two posts :

# encoding=utf-8

#-------------------------------------------------------------------------
#                    STATISTICS about the CURRENT file ( v1.1 )
#-------------------------------------------------------------------------

from __future__ import print_function    # for Python2 compatibility

from Npp import *

import re

import os, time, datetime

import ctypes

from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT

# --------------------------------------------------------------------------------------------------------------------------------------------------------------
#  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def npp_get_statusbar(statusbar_item_number):

    WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
    FindWindowW = ctypes.windll.user32.FindWindowW
    FindWindowExW = ctypes.windll.user32.FindWindowExW
    SendMessageW = ctypes.windll.user32.SendMessageW
    LRESULT = LPARAM
    SendMessageW.restype = LRESULT
    SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
    EnumChildWindows = ctypes.windll.user32.EnumChildWindows
    GetClassNameW = ctypes.windll.user32.GetClassNameW
    create_unicode_buffer = ctypes.create_unicode_buffer

    SBT_OWNERDRAW = 0x1000
    WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13

    npp_get_statusbar.STATUSBAR_HANDLE = None

    def get_result_from_statusbar(statusbar_item_number):
        assert statusbar_item_number <= 5
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
        length = retcode & 0xFFFF
        type = (retcode >> 16) & 0xFFFF
        assert (type != SBT_OWNERDRAW)
        text_buffer = create_unicode_buffer(length)
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
        retval = '{}'.format(text_buffer[:length])
        return retval

    def EnumCallback(hwnd, lparam):
        curr_class = create_unicode_buffer(256)
        GetClassNameW(hwnd, curr_class, 256)
        if curr_class.value.lower() == "msctls_statusbar32":
            npp_get_statusbar.STATUSBAR_HANDLE = hwnd
            return False  # stop the enumeration
        return True  # continue the enumeration

    npp_hwnd = FindWindowW(u"Notepad++", None)
    EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
    if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
    assert False

St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )

Continuation on next post

guy038

guy038

Hi Alan and all,

Continuation of version v1.1 of the script :

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def number(occ):
    global num
    num += 1

console.show()

console.clear()

Start_time = time.time()

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_encoding = str(notepad.getEncoding())

if Curr_encoding == 'ENC8BIT':
    Curr_encoding = 'ANSI'

if Curr_encoding == 'COOKIE':
    Curr_encoding = 'UTF-8'

if Curr_encoding == 'UTF8':
    Curr_encoding = 'UTF-8-BOM'

if Curr_encoding == 'UCS2BE':
    Curr_encoding = 'UTF-16 BE BOM'

if Curr_encoding == 'UCS2LE':
    Curr_encoding = 'UTF-16 LE BOM'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    Line_title = 95
else:
    Line_title = 75

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

File_name = notepad.getCurrentFilename().decode('utf-8')

if os.path.isfile(File_name) == True:

    Creation_date = time.ctime(os.path.getctime(File_name))

    Modif_date = time.ctime(os.path.getmtime(File_name))

    Size_length = os.path.getsize(File_name)

    RO_flag = 'YES'

    if os.access(File_name, os.W_OK):
        RO_flag = 'NO'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

RO_editor = 'NO'

if editor.getReadOnly() == True:
    RO_editor = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if notepad.getCurrentView() == 0:
    Curr_view = 'MAIN View'
else:
    Curr_view = 'SECONDARY view'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_lang = notepad.getCurrentLang()

Lang_desc = notepad.getLanguageDesc(Curr_lang)

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if editor.getEOLMode() == 0:
    Curr_eol = 'Windows (CR LF)'

if editor.getEOLMode() == 1:
    Curr_eol = 'Macintosh (CR)'

if editor.getEOLMode() == 2:
    Curr_eol = 'Unix (LF)'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_wrap = 'NO'

if editor.getWrapMode() == 1:
    Curr_wrap = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

print ('START')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Bytes_length = editor.getLength()

Total_chars = editor.countCharacters(0, editor.getLength())

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\r|\n', number)

Total_EOL = num

print ('EOL')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_standard = Total_chars - Total_EOL

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'ANSI':

    Total_BMP = Total_standard
    
    Total_1_byte = Total_BMP

    Total_2_bytes = 0

    Total_3_bytes = 0

    Total_4_bytes = 0

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':

    num = 0
    editor.research(r'[\x{0080}-\x{07FF}]', number)

    Total_2_bytes = num

    print ('2-BYTES')

    # --------------------------------------------------------------------------------------------------------------------------------------------------------------

    num = 0
    editor.research(r'[\x{0800}-\x{D7FF}\x{E000}-\x{FFFF}]', number)

    Total_3_bytes = num

    print ('3-BYTES')

    # -----------------------------------------------------------------------------------------------------------------------------

    Total_4_bytes = ( Bytes_length - Total_chars - Total_2_bytes - 2 * Total_3_bytes ) / 3

    Total_1_byte = Total_standard - Total_2_bytes - Total_3_bytes - Total_4_bytes

    Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes

# --------------------------------------------------------------------------------------------------------------------------------------------------------------


if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':

    num = 0
    editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP chars different from '\r' and '\n'

    Total_2_bytes = num

    Total_4_bytes = Total_standard - Total_2_bytes

    Total_BMP = Total_2_bytes

    Total_1_byte = 0

    Total_3_bytes = 0

    Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes

    print ('2-BYTES')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

BOM = 0  #  Default ANSI and UTF-8

if Curr_encoding == 'UTF-8-BOM':
    BOM = 3

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    BOM = 2

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Buffer_length = Bytes_length + BOM

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\t|\x20', number)

Non_blank_chars = Total_standard - num

print ('NON-BLANK')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\w+', number)

Words_total = num

print ('WORDS')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Err_regex = False

num = 0

if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
    editor.research(r'\S+', number)
else:
    try:
        editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
    except RuntimeError:
        Err_regex = True

Non_space_count = num

print ('NON-SPACE')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'\f^(?:\r\n|\r|\n)', number)
else:
    editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^(?:\r\n|\r|\n)', number)

Special_empty = num

num = 0
editor.research(r'^(?:\r\n|\r|\n)', number)

Default_empty = num

Empty_lines = Default_empty - Special_empty

print ('EMPTY lines')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'\f^[\t\x20]+(?:\r\n|\r|\n|\z)', number)
else:
    editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^[\t\x20]+(?:\r\n|\r|\n|\z)', number)

Special_blank = num

num = 0
editor.research(r'^[\t\x20]+(?:\r\n|\r|\n|\z)', number)

Default_blank = num

Blank_lines = Default_blank - Special_blank

print ('BLANK lines')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Emp_blk_lines = Empty_lines + Blank_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_lines = editor.getLineCount()

num = 0
editor.research(r'(?-s)^.+\z', number)

if num == 0:
    Total_lines = Total_lines - 1

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Non_blk_lines = Total_lines - Emp_blk_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )

if Num_sel != 0:

    Bytes_count = 0
    Chars_count = 0
    Words_count = 0

    for n in range(Num_sel):

        Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
        Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

        num = 0
        editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
        Words_count += num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Bytes_count < 2:
        Txt_bytes = ' selected byte) in '
    else:
        Txt_bytes = ' selected bytes) in '

    if Chars_count < 2:
        Txt_chars = ' selected char, '
    else:
        Txt_chars = ' selected chars, '

    if Words_count < 2:
        Txt_words = ' selected word ('
    else:
        Txt_words = ' selected words ('

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Num_sel < 2 and Bytes_count == 0:
        Txt_ranges = ' EMPTY range'

    if Num_sel < 2 and Bytes_count > 0:
        Txt_ranges = ' range'

    if Num_sel > 1 and Bytes_count == 0:
        Txt_ranges = ' EMPTY ranges'

    if Num_sel > 1 and Bytes_count > 0:
        Txt_ranges = ' ranges (EMPTY or NOT)'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

console.hide()

line_list = []  # empty list

Line_end = '\r\n'

line_list.append ('-' * Line_title)

line_list.append (' ' * int((Line_title - 54) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()) + ' ( ' + str(time.time() - Start_time) + ' )')

line_list.append ('-' * Line_title + Line_end)

line_list.append (' FULL File Path    :  ' + File_name + Line_end)

if os.path.isfile(File_name) == True:

    line_list.append (' CREATION     Date :  ' + Creation_date)

    line_list.append (' MODIFICATION Date :  ' + Modif_date + Line_end)

    line_list.append (' READ-ONLY flag    :  ' + RO_flag)

line_list.append (' READ-ONLY editor  :  ' + RO_editor + Line_end * 2)

line_list.append (' Current VIEW      :  ' + Curr_view + Line_end)

line_list.append (' Current ENCODING  :  ' + Curr_encoding + Line_end)

line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')' + Line_end)

line_list.append (' Current Line END  :  ' + Curr_eol + Line_end)

line_list.append (' Current WRAPPING  :  ' + Curr_wrap + Line_end * 2)

line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))

line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))

line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + Line_end)

line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))

line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + Line_end)

line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard))

line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL) + Line_end)

line_list.append (' TOTAL characters  :  ' + str(Total_chars) + Line_end * 2)

if Curr_encoding == 'ANSI':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
    + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')

line_list.append (' Byte Order Mark   :  ' + str(BOM) + Line_end)

line_list.append (' BUFFER Length     :  ' + str(Buffer_length))

if os.path.isfile(File_name) == True:
    line_list.append (' Length on DISK    :  ' + str(Size_length) + Line_end * 2)
else:
    if Line_end == '\r\n':
        line_list.append (Line_end)

line_list.append (' NON-Blank Chars   :  ' + str(Non_blank_chars) + Line_end)

line_list.append (' WORDS     Count   :  ' + str(Words_total) + ' (Caution !)' + Line_end)

if Err_regex == False:
    line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + Line_end * 2)
else:
    line_list.append (' NON-SPACE Count   :  ' + str(Non_space_count) + ' (Caution as " RuntimeError " occured !)' + Line_end * 2)


line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))

line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + Line_end)

line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + Line_end)

line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))

line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + Line_end * 2)

line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Words_count) + Txt_words + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges + Line_end)

notepad.new()

editor.setText('\r\n'.join(line_list))

if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':

    if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature

        notepad.messageBox ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                        '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!')

# ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------

Best Regards,

guy038

guy038

Hello, @alan-kilborn and Python gurus,

I’ve just found out a bug when trying to run my script against à “French” file called Numéros ( which means Numbers ) :-((

In that Python section of my script below, it detects if the current tab is associated with a true file, saved on disk, or if the current tab refers to a new # file, not saved yet

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

File_name = notepad.getCurrentFilename()

if os.path.isfile(File_name) == True:

    Creation_date = time.ctime(os.path.getctime(File_name))

    Modif_date = time.ctime(os.path.getmtime(File_name))

    Size_length = os.path.getsize(File_name)

    RO_flag = 'YES'

    if os.access(File_name, os.W_OK):
        RO_flag = 'NO'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

And unfortunately, if current name contains accentuated characters, like Numéros, it wrongly suppose it’s a new # file !

As soon as it is renamed as Numeros, everything is OK again

So, how to recognize the filename even if current file or current path contain NON-ASCII characters ?

TIA

guy038

Alan Kilborn

@guy038 said in Emulation of the "View > Summary" feature with a Python script:

how to recognize the filename even if current file or current path contain NON-ASCII characters ?

Short answer: This is better done with Python3, i.e., PythonScript 3.x. Then things “just work”. :-)

But, for Python2, (and PS 2.x) you can make a call to .encode('utf-8') or .decode('utf-8') – depending upon your circumstance (I’m not commenting on your specific code) – in order to get what you need.

Basically, if you have a Python2 string (in a variable s) and you want to get a Unicode string (for things like Windows pathnames with non-trivial characters), use s.decode('utf-8') and to go the other way, where you have a Unicode str (in a variable u) and you want a Python2 str, do u.encode('utf-8').

guy038

Hi, @alan-kilborn,

Many thanks for the tip ! I did some Google searches before, but just saw some obscur explanations. But, right now, trying again with this question :

How to get "os.path.isfile(Filename)" == True: when Filename contains "NON ASCII" chars ?

And reading the first article, named “python - UnicodeEncodeError on joining file name”, on Jan. 05 2010, from the site Stack Overflow, it is textually said, in the middle of the article :

So I would first try filename = filename.decode('utf-8') -- that should allow the os.path.join to work

Now, I won’t bother to re-edit my script with a new version number ! I just changed, in my v1.1 version, above, the line :

File_name = notepad.getCurrentFilename()

by this one :

File_name = notepad.getCurrentFilename().decode('utf-8')

BR

guy038

guy038

Hello, @alan-kilborn and All,

Below, the v1.2 version of the Python script for an enhanced Summary feature :

I decomposed the total number of chars in 3 parts : EOL chars, Space and Tab chars and True chars ( [^\t\x20\r\n] )
I also decomposed the total number of word chars in 3 parts : letters chars, digits chars and low_line chars
I added a count of the paragraphs ( You may adapt the corresponding regex to your needs )
I added a count of the sentences ( You may adapt the corresponding regex to your needs )
I added some remarks at the end of the summary report, regarding the global accurancy of some results !

Now, Alan, I needed to change this part, regarding the selections :

    for n in range(Num_sel):

        Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
        Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

        num = 0
        editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
        Words_count += num

by this one :

    for n in range(Num_sel):

        Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
        Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

        num = 0
        if Bytes_count != 0:
            editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
        Words_count += num

Because, if the unique zero-length selection was on a pure empty line, it did write, as expected, the message :

0 selected char, 0 selected word (0 selected byte) in 1 EMPTY range

But if this unique zero-length selection was on a non-empty line, it would wrongly write, for example :

0 selected char, **`568`** selected words (0 selected byte) in 1 EMPTY range

Given that the total file contains 568 words

So, here is the v1.2 version of my script, split on two posts :

# encoding=utf-8

#-------------------------------------------------------------------------
#                    STATISTICS about the CURRENT file ( v1.2 )
#-------------------------------------------------------------------------

from __future__ import print_function    # for Python2 compatibility

from Npp import *

import re

import os, time, datetime

import ctypes

from ctypes.wintypes import BOOL, HWND, WPARAM, LPARAM, UINT

# --------------------------------------------------------------------------------------------------------------------------------------------------------------
#  From @alan-kilborn, in post https://community.notepad-plus-plus.org/topic/21733/pythonscript-different-behavior-in-script-vs-in-immediate-mode/4
# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def npp_get_statusbar(statusbar_item_number):

    WNDENUMPROC = ctypes.WINFUNCTYPE(BOOL, HWND, LPARAM)
    FindWindowW = ctypes.windll.user32.FindWindowW
    FindWindowExW = ctypes.windll.user32.FindWindowExW
    SendMessageW = ctypes.windll.user32.SendMessageW
    LRESULT = LPARAM
    SendMessageW.restype = LRESULT
    SendMessageW.argtypes = [ HWND, UINT, WPARAM, LPARAM ]
    EnumChildWindows = ctypes.windll.user32.EnumChildWindows
    GetClassNameW = ctypes.windll.user32.GetClassNameW
    create_unicode_buffer = ctypes.create_unicode_buffer

    SBT_OWNERDRAW = 0x1000
    WM_USER = 0x400; SB_GETTEXTLENGTHW = WM_USER + 12; SB_GETTEXTW = WM_USER + 13

    npp_get_statusbar.STATUSBAR_HANDLE = None

    def get_result_from_statusbar(statusbar_item_number):
        assert statusbar_item_number <= 5
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTLENGTHW, statusbar_item_number, 0)
        length = retcode & 0xFFFF
        type = (retcode >> 16) & 0xFFFF
        assert (type != SBT_OWNERDRAW)
        text_buffer = create_unicode_buffer(length)
        retcode = SendMessageW(npp_get_statusbar.STATUSBAR_HANDLE, SB_GETTEXTW, statusbar_item_number, ctypes.addressof(text_buffer))
        retval = '{}'.format(text_buffer[:length])
        return retval

    def EnumCallback(hwnd, lparam):
        curr_class = create_unicode_buffer(256)
        GetClassNameW(hwnd, curr_class, 256)
        if curr_class.value.lower() == "msctls_statusbar32":
            npp_get_statusbar.STATUSBAR_HANDLE = hwnd
            return False  # stop the enumeration
        return True  # continue the enumeration

    npp_hwnd = FindWindowW(u"Notepad++", None)
    EnumChildWindows(npp_hwnd, WNDENUMPROC(EnumCallback), 0)
    if npp_get_statusbar.STATUSBAR_HANDLE: return get_result_from_statusbar(statusbar_item_number)
    assert False

St_bar = npp_get_statusbar(4)  # Zone 4 ( STATUSBARSECTION.UNICODETYPE )

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

def number(occ):
    global num
    num += 1

console.show()

console.clear()

Start_time = time.time()

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_encoding = str(notepad.getEncoding())

if Curr_encoding == 'ENC8BIT':
    Curr_encoding = 'ANSI'

if Curr_encoding == 'COOKIE':
    Curr_encoding = 'UTF-8'

if Curr_encoding == 'UTF8':
    Curr_encoding = 'UTF-8-BOM'

if Curr_encoding == 'UCS2BE':
    Curr_encoding = 'UTF-16 BE BOM'

if Curr_encoding == 'UCS2LE':
    Curr_encoding = 'UTF-16 LE BOM'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    Line_title = 95
else:
    Line_title = 75

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

File_name = notepad.getCurrentFilename().decode('utf-8')

if os.path.isfile(File_name) == True:

    Creation_date = time.ctime(os.path.getctime(File_name))

    Modif_date = time.ctime(os.path.getmtime(File_name))

    Size_length = os.path.getsize(File_name)

    RO_flag = 'YES'

    if os.access(File_name, os.W_OK):
        RO_flag = 'NO'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

RO_editor = 'NO'

if editor.getReadOnly() == True:
    RO_editor = 'YES'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if notepad.getCurrentView() == 0:
    Curr_view = 'MAIN View'
else:
    Curr_view = 'SECONDARY view'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_lang = notepad.getCurrentLang()

Lang_desc = notepad.getLanguageDesc(Curr_lang)

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if editor.getEOLMode() == 0:
    Curr_eol = 'Windows (CR LF)'

if editor.getEOLMode() == 1:
    Curr_eol = 'Macintosh (CR)'

if editor.getEOLMode() == 2:
    Curr_eol = 'Unix (LF)'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Curr_wrap = 'NO'

if editor.getWrapMode() == 1:
    Curr_wrap = 'YES'

Continuation on next post

guy038

guy038

Hi @alan-kilborn and all,

Continuation of version v1.2 of the script :

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

print ('START')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Bytes_length = editor.getLength()

Total_chars = editor.countCharacters(0, editor.getLength())

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\n|\r', number)

Total_EOL = num

print ('EOL')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\t|\x20', number)

Blank_chars = num

print ('BLANK')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_standard = Total_chars - Total_EOL

True_chars = Total_chars - Total_EOL - Blank_chars

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'ANSI':

    Total_BMP = Total_standard
    
    Total_1_byte = Total_BMP

    Total_2_bytes = 0

    Total_3_bytes = 0

    Total_4_bytes = 0

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':

    num = 0
    editor.research(r'[\x{0080}-\x{07FF}]', number)

    Total_2_bytes = num

    print ('2-BYTES')

    # --------------------------------------------------------------------------------------------------------------------------------------------------------------

    num = 0
    editor.research(r'[\x{0800}-\x{D7FF}\x{E000}-\x{FFFF}]', number)

    Total_3_bytes = num

    print ('3-BYTES')

    # -----------------------------------------------------------------------------------------------------------------------------

    Total_4_bytes = ( Bytes_length - Total_chars - Total_2_bytes - 2 * Total_3_bytes ) / 3

    Total_1_byte = Total_standard - Total_2_bytes - Total_3_bytes - Total_4_bytes

    Total_BMP = Total_1_byte + Total_2_bytes + Total_3_bytes

# --------------------------------------------------------------------------------------------------------------------------------------------------------------


if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':

    num = 0
    editor.research(r'(?![\r\n\x{D800}-\x{DFFF}])[\x{0000}-\x{FFFF}]', number)  #  ALL BMP chars different from '\r' and '\n'

    Total_2_bytes = num

    Total_4_bytes = Total_standard - Total_2_bytes

    Total_BMP = Total_2_bytes

    Total_1_byte = 0

    Total_3_bytes = 0

    Bytes_length = 2 * Total_EOL + 2 * Total_BMP + 4 * Total_4_bytes

    print ('2-BYTES')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

BOM = 0  #  Default ANSI and UTF-8

if Curr_encoding == 'UTF-8-BOM':
    BOM = 3

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    BOM = 2

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Buffer_length = Bytes_length + BOM

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\d', number)

Number_chars = num

print ('NUMBERS')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'_', number)

Lowline_chars = num

print ('LOW_LINES')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\w', number)

Word_chars = num

print ('WORDS')

Letter_chars = Word_chars - Number_chars - Lowline_chars

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
editor.research(r'\w+', number)

Words_total = num

print ('WORDS+')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Err_regex_non_space = False

num = 0

if Curr_encoding == 'ANSI' or Total_4_bytes == 0:
    editor.research(r'\S+', number)
else:
    try:
        editor.research(r'(?:(?!\s).[\x{D800}-\x{DFFF}]?)+', number)
    except RuntimeError:
        Err_regex_non_space = True

Non_space_count = num

print ('NON-SPACE+')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Err_regex_sentence = False

num = 0

try:
    editor.research(r'(?-s)(?:\A|(?<=[\h\r\n.?!])).+?(?:(?=[.?!](\h|\R|\z))|(?=\R|\z))', number)
except RuntimeError:
    Err_regex_sentence = True

Sentence_count = num

print ('SENTENCES')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------
Err_regex_paragraph = False

num = 0

try:
    editor.research(r'(?-s)(?:(?:.[\x{D800}-\x{DFFF}]?)+(?:\r\n|\n|\r))+(?:\r\n|\n|\r){1,}(?:(?:.[\x{D800}-\x{DFFF}]?)+\z)?|(?:.[\x{D800}-\x{DFFF}]?)+\z', number)
except RuntimeError:
    Err_regex_paragraph = True

Paragraph_count = num

print ('PARAGRAPHS')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'\f^(?:\r\n|\n|\r)', number)
else:
    editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^(?:\r\n|\n|\r)', number)

Special_empty = num

num = 0
editor.research(r'^(?:\r\n|\n|\r)', number)

Default_empty = num

Empty_lines = Default_empty - Special_empty

print ('EMPTY lines')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

num = 0
if Curr_encoding == 'ANSI':
    editor.research(r'\f^[\t\x20]+(?:\r\n|\n|\r|\z)', number)
else:
    editor.research(r'[\f\x{0085}\x{2028}\x{2029}]^[\t\x20]+(?:\r\n|\n|\r|\z)', number)

Special_blank = num

num = 0
editor.research(r'^[\t\x20]+(?:\r\n|\n|\r|\z)', number)

Default_blank = num

Blank_lines = Default_blank - Special_blank

print ('BLANK lines')

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Emp_blk_lines = Empty_lines + Blank_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Total_lines = editor.getLineCount()

num = 0
editor.research(r'(?-s)^.+\z', number)

if num == 0:
    Total_lines = Total_lines - 1  #  Because LAST line totally EMPTY

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Non_blk_lines = Total_lines - Emp_blk_lines

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )

if Num_sel != 0:

    Bytes_count = 0
    Chars_count = 0
    Words_count = 0

    for n in range(Num_sel):

        Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
        Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

        num = 0
        if Bytes_count != 0:
            editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
        Words_count += num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Bytes_count < 2:
        Txt_bytes = ' selected byte) in '
    else:
        Txt_bytes = ' selected bytes) in '

    if Chars_count < 2:
        Txt_chars = ' selected char, '
    else:
        Txt_chars = ' selected chars, '

    if Words_count < 2:
        Txt_words = ' selected word ('
    else:
        Txt_words = ' selected words ('

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

    if Num_sel < 2 and Bytes_count == 0:
        Txt_ranges = ' EMPTY range'

    if Num_sel < 2 and Bytes_count > 0:
        Txt_ranges = ' range'

    if Num_sel > 1 and Bytes_count == 0:
        Txt_ranges = ' EMPTY ranges'

    if Num_sel > 1 and Bytes_count > 0:
        Txt_ranges = ' ranges (EMPTY or NOT)'

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

console.hide()

line_list = []  # empty list

Line_end = '\r\n'

line_list.append ('-' * Line_title)

line_list.append (' ' * int((Line_title - 54) / 2) + 'SUMMARY on ' + str(datetime.datetime.now()) + ' ( ' + str(time.time() - Start_time) + ' )')

line_list.append ('-' * Line_title + Line_end)

line_list.append (' FULL File Path    :  ' + File_name + Line_end)

if os.path.isfile(File_name) == True:

    line_list.append (' CREATION     Date :  ' + Creation_date)

    line_list.append (' MODIFICATION Date :  ' + Modif_date + Line_end)

    line_list.append (' READ-ONLY flag    :  ' + RO_flag)

line_list.append (' READ-ONLY editor  :  ' + RO_editor + Line_end * 2)

line_list.append (' Current VIEW      :  ' + Curr_view + Line_end)

line_list.append (' Current ENCODING  :  ' + Curr_encoding + Line_end)

line_list.append (' Current LANGUAGE  :  ' + str(Curr_lang) + '  (' + Lang_desc + ')' + Line_end)

line_list.append (' Current Line END  :  ' + Curr_eol + Line_end)

line_list.append (' Current WRAPPING  :  ' + Curr_wrap + Line_end * 2)

line_list.append (' 1-BYTE  Chars     :  ' + str(Total_1_byte))

line_list.append (' 2-BYTES Chars     :  ' + str(Total_2_bytes))

line_list.append (' 3-BYTES Chars     :  ' + str(Total_3_bytes) + Line_end)

line_list.append (' Sum BMP Chars     :  ' + str(Total_BMP))

line_list.append (' 4-BYTES Chars     :  ' + str(Total_4_bytes) + Line_end)

line_list.append (' CHARS w/o CR & LF :  ' + str(Total_standard) + Line_end * 2)

line_list.append (' EOL ( CR or LF )  :  ' + str(Total_EOL))

line_list.append (' SPC & TAB  Chars  :  ' + str(Blank_chars))

line_list.append (' TRUE       Chars  :  ' + str(True_chars) + Line_end)

line_list.append (' TOTAL characters  :  ' + str(Total_chars) + Line_end * 2)

if Curr_encoding == 'ANSI':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b)')

if Curr_encoding == 'UTF-8' or Curr_encoding == 'UTF-8-BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 1 + ' + str(Total_1_byte) + ' x 1b + '\
    + str(Total_2_bytes) + ' x 2b + ' + str(Total_3_bytes) + ' x 3b + ' + str(Total_4_bytes) + ' x 4b)')

if Curr_encoding == 'UTF-16 BE BOM' or Curr_encoding == 'UTF-16 LE BOM':
    line_list.append (' BYTES Length      :  ' + str(Bytes_length) + ' (' + str(Total_EOL) + ' x 2 + ' + str(Total_BMP) + ' x 2b + ' + str(Total_4_bytes) + ' x 4b)')

line_list.append (' Byte Order Mark   :  ' + str(BOM) + Line_end)

line_list.append (' BUFFER Length     :  ' + str(Buffer_length))

if os.path.isfile(File_name) == True:
    line_list.append (' Length on DISK    :  ' + str(Size_length) + Line_end * 2)
else:
    if Line_end == '\r\n':
        line_list.append (Line_end)

line_list.append (' NUMBER     Chars  :  ' + str(Number_chars) + '\t(*)')

line_list.append (' LOW_LINE   Chars  :  ' + str(Lowline_chars))

line_list.append (' LETTER     Chars  :  ' + str(Letter_chars) + '\t(*)' + Line_end)

line_list.append (' WORD       Chars  :  ' + str(Word_chars) + '\t(*)' + Line_end * 2)

line_list.append (' WORDS      Count  :  ' + str(Words_total) + '\t(*)' + Line_end)

if Err_regex_non_space == False:
    line_list.append (' NON-SPACE  Count  :  ' + str(Non_space_count) + '\t(**)' + Line_end * 2)
else:
    line_list.append (' NON-SPACE  Count  :  ' + str(Non_space_count) + '\t(Caution : a " RuntimeError " occured !)' + Line_end * 2)

if Err_regex_sentence == False:
    line_list.append (' SENTENCES  Count  :  ' + str(Sentence_count) + '\t(**)' + Line_end)
else:
    line_list.append (' SENTENCES  Count  :  ' + str(Sentence_count) + '\t(Caution : a " RuntimeError " occured !)' + Line_end)

if Err_regex_paragraph == False:
    line_list.append (' PARAGRAPHS Count  :  ' + str(Paragraph_count) + '\t(**)' + Line_end * 2)
else:
    line_list.append (' PARAGRAPHS Count  :  ' + str(Paragraph_count) + '\t(Caution : a " RuntimeError " occured !)' + Line_end * 2)

line_list.append (' True EMPTY lines  :  ' + str(Empty_lines))

line_list.append (' True BLANK lines  :  ' + str(Blank_lines) + Line_end)

line_list.append (' EMPTY/BLANK lines :  ' + str(Emp_blk_lines) + Line_end)

line_list.append (' NON-BLANK lines   :  ' + str(Non_blk_lines))

line_list.append (' TOTAL Lines       :  ' + str(Total_lines) + Line_end * 2)

line_list.append (' SELECTION(S)      :  ' + str(Chars_count) + Txt_chars + str(Words_count) + Txt_words + str(Bytes_count) + Txt_bytes + str(Num_sel) + Txt_ranges + '\r\n' + Line_end)

line_list.append (' (*)   Our BOOST regex engine ignore all WORD, NUMBER and LETTER characters over the BMP and may ignore some others within the BMP !')

line_list.append (' (**)  The results may NOT be very accurate for "technical" or "non-regular" files !' + Line_end)

notepad.new()

editor.setText('\r\n'.join(line_list))

if St_bar != 'ANSI' and St_bar != 'UTF-8' and St_bar != 'UTF-8-BOM' and St_bar != 'UTF-16 BE BOM' and St_bar != 'UTF-16 LE BOM':

    if Curr_encoding == 'UTF-8':  #  SAME value for both an 'UTF-8' or 'ANSI' file, when RE-INTERPRETED with the 'Encoding > Character Set > ...' feature

        notepad.messageBox ('CURRENT file re-interpreted as ' + St_bar + '  =>  Possible ERRONEOUS results' + \
                        '\nSo, CLOSE the file WITHOUT saving, RESTORE it (CTRL + SHIFT + T) and RESTART script', '!!! WARNING !!!')

# ----Aé☀𝜜-----------------------------------------------------------------------------------------------------------------------------------------------------

Best Regards,

guy038

Alan Kilborn

@guy038 said :

But if this unique zero-length selection was on a non-empty line, it would wrongly write…

I removed the if Bytes_count != 0: and tried to replicate the problem you indicated, but did not see the same issue. Can you provide more detail on your “steps to reproduce”?

Also, this line of your script gave me an error under Python3:

File_name = notepad.getCurrentFilename().decode('utf-8')

Here’s a way to make it work under Python2 or 3:

import sys
python3 = sys.version_info.major == 3
if python3:
    File_name = notepad.getCurrentFilename()
else:
    File_name = notepad.getCurrentFilename().decode('utf-8')

guy038

Hi, @alan-kilborn and All,

Ah… OK. No problem ! So, this script will work with both Python script 2 and 3, nice !

Regarding the bug, I can reproduce it very easily !

So, we use this part of the script, relative to selections, where I put the line if Bytes_count != 0: in comments :

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Num_sel = editor.getSelections()  # Get ALL selections ( EMPTY or NOT )

if Num_sel != 0:

    Bytes_count = 0
    Chars_count = 0
    Words_count = 0

    for n in range(Num_sel):

        Bytes_count += editor.getSelectionNEnd(n) - editor.getSelectionNStart(n)
        Chars_count += editor.countCharacters(editor.getSelectionNStart(n), editor.getSelectionNEnd(n))

        num = 0
#        if Bytes_count != 0:
        editor.research(r'\w+', number, 0, editor.getSelectionNStart(n), editor.getSelectionNEnd(n))
        Words_count += num

# --------------------------------------------------------------------------------------------------------------------------------------------------------------

Then :

Open, let’s say, the license.txt file
Move the caret to the very beginning of the license.txt file ( so, before the letter C of the word COPYING )
Do not do any selection
Run the script

=> You should see, in the SELECTION(S) line, a non-null number of words :

 SELECTION(S)      :  0 selected char, 5822 selected words (0 selected byte) in 1 EMPTY range

Now, just move the caret one character on the right ( so, between the C and the O letters of the word COPYING )
Do not do any selection, again
Re-run the script

=> This time, we get, for the SELECTION(S) line, the expected results :

 SELECTION(S)      :  0 selected char, 0 selected word (0 selected byte) in 1 EMPTY range

At first sight, this bug occurs only when the caret is at the very beginning of current file !

Once, you’ll find an explanation ( if any ! ), I will post the new version of the script.

BR

guy038

P.S. : May be, this bug do not occur with Python script 3 ?