Count the occurences of each line



  • Hello, I don’t know it this is possible with Notepad++, but this is what I want to do :

    imagine that you have a text file which contains :

    Hello my friend
    Ho
    A
    XT
    Hello my friend
    A
    Ha

    so I want to count the occurences of each line, so the output I want woud be like that :

    Hello my friend 2
    Ho 1
    A 2
    XT 1
    Ha 1

    Do you think it’s possible?

    Thank you !



  • @JSE-Faucet said:

    I don’t know it this is possible with Notepad++

    Unfortunately, not solely with Notepad++. When paired with one of the scripting-language plugins (PythonScript , LuaScript, or “jN Notepad++ Plugin”), you could have a script to do that inside Notepad++. (Or you could use any of those languages, or other programming languages, to do the same thing without Notepad++'s help.)



  • I decided since it had been a while since I’d last tested my PythonScript chops, and I had a few minutes, I would see if I could implement it. This replicates your results

    # encoding=utf-8
    """in response to https://notepad-plus-plus.org/community/topic/17744/"""
    from Npp import *
    
    def forum_post17744_FunctionName():
        """
        this uses each line as a key in a dictionary, to count how many entries;
        to preserve order, it also stores the key in a list
        """
    
        # initialize dictionary and ordered list
        count = dict()
        order = list()
    
        # parse the active editor's text
        for lnum in range(editor.getLineCount()):
            editor.gotoLine(lnum)
            key = editor.getCurLine().rstrip()
            if key in count:
                count[key] = count[key] + 1
            else:
                count[key] = 1
                order.append(key)
    
        # make the changes:
        #   delete old contents,
        #   insert key + count + EOL for each of the unique lines, in original order
        editor.beginUndoAction()
        editor.selectAll()
        editor.deleteBack()
        for key in order:
            editor.addText("{} {}\r\n".format(key, count[key]))
        editor.endUndoAction()
    
    
    if __name__ == '__main__': forum_post17744_FunctionName()
    

    To use this,

    1. Install PythonScript plugin using Plugins > Plugins Admin
    2. Plugins > Python Script > New Script, give it a name like CountUniqueLines.py
    3. Paste it the code and save
    4. Open your example file in the active editor pane
    5. Plugins > Python Script > Scripts > CountUniqueLines

    Enjoy your free code-writing service.



  • Are you sure that it will work with very big files ? (90MB)



  • @JSE-Faucet

    work with very big files

    One way to find out…

    If Notepad++ itself can handle the file (and it should be fine, 90MB is not THAT big), such a script can handle processing it…



  • Sadly, I can’t use PythonScript on my computer… I don’t know why though. I had to install it with their installer because it wasn’t in the list. Now it is installed but it does not even shows up on NPP :(



  • @JSE-Faucet said:

    Are you sure that it will work with very big files ? (90MB)

    No, I am not sure; though, as @Alan-Kilborn said, 90Mb is pretty small, in the modern scheme of things. Given your problem statement, I thought this was sufficient for getting through a one-time need. (And really, even that much is going above and beyond, since this is a forum about Notepad++, not a free code writing service).

    I have ideas on how the script could be modified to handle what you want. But it would take more investment of time for me, and I’ve already fulfilled my desire to see how I would give this a quicky implementation.

    And actually, I remembered that nearly this functionality already exists. The Linux command uniq -c will almost get you what you want. sort infile.txt | uniq -c will put all similar lines next to each other, and then count duplicates. However, it will not preserve your original order. If that’s of interest to you, see http://gnuwin32.sourceforge.net/packages/coreutils.htm for a windows-implementation.



  • @JSE-Faucet said:

    Sadly, I can’t use PythonScript on my computer… I don’t know why though. I had to install it with their installer because it wasn’t in the list. Now it is installed but it does not even shows up on NPP :(

    Oh, right, I forgot it doesn’t install quite-right out of the box for 7.6.x and newer. @Meta-Chuh wrote a Guide to Installing PythonScript Plugin on Notepad++ 7.6.3 and above.



  • @PeterJones said:

    I’ve already fulfilled my desire to see

    I guess I hadn’t. I wanted to see if my ideas worked. They did.

    I removed the order[] array, and just re-parsed the file to maintain order. I also changed the key from being the line of text (which in real data would typically be 20-80 characters, not just the dozen or less in the example) to using a crc32 of the line (so every line is mapped to a 32bit = 4 byte key). (*: 32 bits is too small for guaranteed-collision-free, but it’s likely good enough; if not, one could probably use a trick to do two different 32-bit hashes – like maybe crc32(txt)+crc32(txt.reverse), assuming the crc of the reverse text is different; I think so, but am not sure.) On the first pass, it builds the dictionary of crc32:count pairs; on the second pass, it either adds the count to the end of the line, or it deletes the line if that key-count has already been used.

    IyBlbmNvZGluZz11dGYtOA0KIiIiaW4gcmVzcG9uc2UgdG8gaHR0cHM6Ly9ub3Rl
    cGFkLXBsdXMtcGx1cy5vcmcvY29tbXVuaXR5L3RvcGljLzE3NzQ0LyIiIg0KZnJv
    bSBOcHAgaW1wb3J0ICoNCmltcG9ydCB6bGliDQoNCmNvbnNvbGUuY2xlYXIoKQ0K
    DQpkZWYgZm9ydW1fcG9zdDE3NzQ0X0Z1bmN0aW9uTmFtZSgpOg0KICAgICIiIg0K
    ICAgIHRoaXMgdXNlcyBhIGhhc2ggb2YgZWFjaCBsaW5lIGFzIGEga2V5IGluIGEg
    ZGljdGlvbmFyeSwgdG8gY291bnQgaG93IG1hbnkgZW50cmllczsNCiAgICB0byBw
    cmVzZXJ2ZSBvcmRlciwgaXQgd2lsbCBydW4gdGhyb3VnaCB0aGUgZmlsZSBhIHNl
    Y29uZCB0aW1lDQogICAgKHRoaXMgc2F2ZXMgbWVtb3J5IG9mIHdob2xlLWxpbmUg
    a2V5cywgYW5kIGFuIGFycmF5IHRvIGhvbGQgb3JkZXIpDQogICAgIiIiDQoNCiAg
    ICAjIGluaXRpYWxpemUgZGljdGlvbmFyeSBhbmQgb3JkZXJlZCBsaXN0DQogICAg
    Y291bnQgPSBkaWN0KCkNCg0KICAgICMgcGFyc2UgdGhlIGFjdGl2ZSBlZGl0b3In
    cyB0ZXh0DQogICAgZm9yIGxudW0gaW4gcmFuZ2UoZWRpdG9yLmdldExpbmVDb3Vu
    dCgpKToNCiAgICAgICAgZWRpdG9yLmdvdG9MaW5lKGxudW0pDQogICAgICAgIGtl
    eSA9IHpsaWIuY3JjMzIoZWRpdG9yLmdldEN1ckxpbmUoKS5yc3RyaXAoKSkgJiAw
    eEZGRkZGRkZGDQogICAgICAgIGlmIGtleSBpbiBjb3VudDoNCiAgICAgICAgICAg
    IGNvdW50W2tleV0gPSBjb3VudFtrZXldICsgMQ0KICAgICAgICBlbHNlOg0KICAg
    ICAgICAgICAgY291bnRba2V5XSA9IDENCg0KICAgICMgbWFrZSB0aGUgY2hhbmdl
    czoNCiAgICBlZGl0b3IuYmVnaW5VbmRvQWN0aW9uKCkNCiAgICBsbnVtID0gMA0K
    ICAgIHdoaWxlIGxudW0gPCBlZGl0b3IuZ2V0TGluZUNvdW50KCk6ICAgICAjIHVz
    ZSBhIHdoaWxlIGxvb3AgcmF0aGVyIHRoYW4gZm9yIGxvb3AsIHNvIEkgY2FuIGNo
    b29zZSBfbm90XyB0byBhZHZhbmNlIGxudW0gYWZ0ZXIgaSBkZWxldGUgdGhlIHJv
    dyAoYmVjYXVzZSAibmV4dCIgcm93IHdpbGwgaGF2ZSBzYW1lIGxudW0gYXMgdGhl
    IGRlbGV0ZWQgcm93KQ0KICAgICAgICBlZGl0b3IuZ290b0xpbmUobG51bSkNCiAg
    ICAgICAga2V5ID0gemxpYi5jcmMzMihlZGl0b3IuZ2V0Q3VyTGluZSgpLnJzdHJp
    cCgpKSAmIDB4RkZGRkZGRkYNCiAgICAgICAgaWYga2V5IGluIGNvdW50Og0KICAg
    ICAgICAgICAgI2NvbnNvbGUud3JpdGUoIns6MDEwWH18e318e31cbiIuZm9ybWF0
    KGtleSwgZWRpdG9yLmdldEN1ckxpbmUoKS5yc3RyaXAoKSwgY291bnRba2V5XSkp
    DQogICAgICAgICAgICBlZGl0b3IubGluZUVuZCgpDQogICAgICAgICAgICBlZGl0
    b3IuYWRkVGV4dCgiIHt9Ii5mb3JtYXQoY291bnRba2V5XSkpDQogICAgICAgICAg
    ICBkZWwgY291bnRba2V5XSAgIyBkb24ndCB3YW50IHRvIGhhdmUgZHVwbGljYXRl
    cywgc28gcmVtb3ZlIHRoZSBrZXkgdG8gaW5kaWNhdGUgSSdtIGRvbmUNCiAgICAg
    ICAgICAgIGxudW0gPSBsbnVtICsgMQ0KICAgICAgICBlbHNlOg0KICAgICAgICAg
    ICAgI2NvbnNvbGUud3JpdGUoIns6MDEwWH18e318e31cbiIuZm9ybWF0KGtleSwg
    ZWRpdG9yLmdldEN1ckxpbmUoKS5yc3RyaXAoKSwgIk5FRUQgVE8gREVMRVRFIExJ
    TkUiKSkNCiAgICAgICAgICAgIGVkaXRvci5saW5lRGVsZXRlKCkNCiAgICBlZGl0
    b3IuZW5kVW5kb0FjdGlvbigpDQoNCmlmIF9fbmFtZV9fID09ICdfX21haW5fXyc6
    IGZvcnVtX3Bvc3QxNzc0NF9GdW5jdGlvbk5hbWUoKQ0K
    

    (Obfuscated to avoid a spoiler for the solution. If you need a hint, that’s base64 encoded, and Notepad++ usually ships with MIME Tools plugin…)



  • @PeterJones

    OMG. Peter, I think this Community is like a drug for you…or maybe for more than just you.

    :)


Log in to reply