Count the occurences of each line
-
Hello, I don’t know it this is possible with Notepad++, but this is what I want to do :
imagine that you have a text file which contains :
Hello my friend
Ho
A
XT
Hello my friend
A
Haso I want to count the occurences of each line, so the output I want woud be like that :
Hello my friend 2
Ho 1
A 2
XT 1
Ha 1Do you think it’s possible?
Thank you !
-
@JSE-Faucet said:
I don’t know it this is possible with Notepad++
Unfortunately, not solely with Notepad++. When paired with one of the scripting-language plugins (PythonScript , LuaScript, or “jN Notepad++ Plugin”), you could have a script to do that inside Notepad++. (Or you could use any of those languages, or other programming languages, to do the same thing without Notepad++'s help.)
-
I decided since it had been a while since I’d last tested my PythonScript chops, and I had a few minutes, I would see if I could implement it. This replicates your results
# encoding=utf-8 """in response to https://notepad-plus-plus.org/community/topic/17744/""" from Npp import * def forum_post17744_FunctionName(): """ this uses each line as a key in a dictionary, to count how many entries; to preserve order, it also stores the key in a list """ # initialize dictionary and ordered list count = dict() order = list() # parse the active editor's text for lnum in range(editor.getLineCount()): editor.gotoLine(lnum) key = editor.getCurLine().rstrip() if key in count: count[key] = count[key] + 1 else: count[key] = 1 order.append(key) # make the changes: # delete old contents, # insert key + count + EOL for each of the unique lines, in original order editor.beginUndoAction() editor.selectAll() editor.deleteBack() for key in order: editor.addText("{} {}\r\n".format(key, count[key])) editor.endUndoAction() if __name__ == '__main__': forum_post17744_FunctionName()
To use this,
- Install PythonScript plugin using Plugins > Plugins Admin
- Plugins > Python Script > New Script, give it a name like
CountUniqueLines.py
- Paste it the code and save
- Open your example file in the active editor pane
- Plugins > Python Script > Scripts > CountUniqueLines
Enjoy your free code-writing service.
-
Are you sure that it will work with very big files ? (90MB)
-
@JSE-Faucet
work with very big files
One way to find out…
If Notepad++ itself can handle the file (and it should be fine, 90MB is not THAT big), such a script can handle processing it…
-
Sadly, I can’t use PythonScript on my computer… I don’t know why though. I had to install it with their installer because it wasn’t in the list. Now it is installed but it does not even shows up on NPP :(
-
@JSE-Faucet said:
Are you sure that it will work with very big files ? (90MB)
No, I am not sure; though, as @Alan-Kilborn said, 90Mb is pretty small, in the modern scheme of things. Given your problem statement, I thought this was sufficient for getting through a one-time need. (And really, even that much is going above and beyond, since this is a forum about Notepad++, not a free code writing service).
I have ideas on how the script could be modified to handle what you want. But it would take more investment of time for me, and I’ve already fulfilled my desire to see how I would give this a quicky implementation.
And actually, I remembered that nearly this functionality already exists. The Linux command
uniq -c
will almost get you what you want.sort infile.txt | uniq -c
will put all similar lines next to each other, and then count duplicates. However, it will not preserve your original order. If that’s of interest to you, see http://gnuwin32.sourceforge.net/packages/coreutils.htm for a windows-implementation. -
@JSE-Faucet said:
Sadly, I can’t use PythonScript on my computer… I don’t know why though. I had to install it with their installer because it wasn’t in the list. Now it is installed but it does not even shows up on NPP :(
Oh, right, I forgot it doesn’t install quite-right out of the box for 7.6.x and newer. @Meta-Chuh wrote a Guide to Installing PythonScript Plugin on Notepad++ 7.6.3 and above.
-
@PeterJones said:
I’ve already fulfilled my desire to see
I guess I hadn’t. I wanted to see if my ideas worked. They did.
I removed the order[] array, and just re-parsed the file to maintain order. I also changed the key from being the line of text (which in real data would typically be 20-80 characters, not just the dozen or less in the example) to using a crc32 of the line (so every line is mapped to a 32bit = 4 byte key). (*: 32 bits is too small for guaranteed-collision-free, but it’s likely good enough; if not, one could probably use a trick to do two different 32-bit hashes – like maybe crc32(txt)+crc32(txt.reverse), assuming the crc of the reverse text is different; I think so, but am not sure.) On the first pass, it builds the dictionary of crc32:count pairs; on the second pass, it either adds the count to the end of the line, or it deletes the line if that key-count has already been used.
IyBlbmNvZGluZz11dGYtOA0KIiIiaW4gcmVzcG9uc2UgdG8gaHR0cHM6Ly9ub3Rl cGFkLXBsdXMtcGx1cy5vcmcvY29tbXVuaXR5L3RvcGljLzE3NzQ0LyIiIg0KZnJv bSBOcHAgaW1wb3J0ICoNCmltcG9ydCB6bGliDQoNCmNvbnNvbGUuY2xlYXIoKQ0K DQpkZWYgZm9ydW1fcG9zdDE3NzQ0X0Z1bmN0aW9uTmFtZSgpOg0KICAgICIiIg0K ICAgIHRoaXMgdXNlcyBhIGhhc2ggb2YgZWFjaCBsaW5lIGFzIGEga2V5IGluIGEg ZGljdGlvbmFyeSwgdG8gY291bnQgaG93IG1hbnkgZW50cmllczsNCiAgICB0byBw cmVzZXJ2ZSBvcmRlciwgaXQgd2lsbCBydW4gdGhyb3VnaCB0aGUgZmlsZSBhIHNl Y29uZCB0aW1lDQogICAgKHRoaXMgc2F2ZXMgbWVtb3J5IG9mIHdob2xlLWxpbmUg a2V5cywgYW5kIGFuIGFycmF5IHRvIGhvbGQgb3JkZXIpDQogICAgIiIiDQoNCiAg ICAjIGluaXRpYWxpemUgZGljdGlvbmFyeSBhbmQgb3JkZXJlZCBsaXN0DQogICAg Y291bnQgPSBkaWN0KCkNCg0KICAgICMgcGFyc2UgdGhlIGFjdGl2ZSBlZGl0b3In cyB0ZXh0DQogICAgZm9yIGxudW0gaW4gcmFuZ2UoZWRpdG9yLmdldExpbmVDb3Vu dCgpKToNCiAgICAgICAgZWRpdG9yLmdvdG9MaW5lKGxudW0pDQogICAgICAgIGtl eSA9IHpsaWIuY3JjMzIoZWRpdG9yLmdldEN1ckxpbmUoKS5yc3RyaXAoKSkgJiAw eEZGRkZGRkZGDQogICAgICAgIGlmIGtleSBpbiBjb3VudDoNCiAgICAgICAgICAg IGNvdW50W2tleV0gPSBjb3VudFtrZXldICsgMQ0KICAgICAgICBlbHNlOg0KICAg ICAgICAgICAgY291bnRba2V5XSA9IDENCg0KICAgICMgbWFrZSB0aGUgY2hhbmdl czoNCiAgICBlZGl0b3IuYmVnaW5VbmRvQWN0aW9uKCkNCiAgICBsbnVtID0gMA0K ICAgIHdoaWxlIGxudW0gPCBlZGl0b3IuZ2V0TGluZUNvdW50KCk6ICAgICAjIHVz ZSBhIHdoaWxlIGxvb3AgcmF0aGVyIHRoYW4gZm9yIGxvb3AsIHNvIEkgY2FuIGNo b29zZSBfbm90XyB0byBhZHZhbmNlIGxudW0gYWZ0ZXIgaSBkZWxldGUgdGhlIHJv dyAoYmVjYXVzZSAibmV4dCIgcm93IHdpbGwgaGF2ZSBzYW1lIGxudW0gYXMgdGhl IGRlbGV0ZWQgcm93KQ0KICAgICAgICBlZGl0b3IuZ290b0xpbmUobG51bSkNCiAg ICAgICAga2V5ID0gemxpYi5jcmMzMihlZGl0b3IuZ2V0Q3VyTGluZSgpLnJzdHJp cCgpKSAmIDB4RkZGRkZGRkYNCiAgICAgICAgaWYga2V5IGluIGNvdW50Og0KICAg ICAgICAgICAgI2NvbnNvbGUud3JpdGUoIns6MDEwWH18e318e31cbiIuZm9ybWF0 KGtleSwgZWRpdG9yLmdldEN1ckxpbmUoKS5yc3RyaXAoKSwgY291bnRba2V5XSkp DQogICAgICAgICAgICBlZGl0b3IubGluZUVuZCgpDQogICAgICAgICAgICBlZGl0 b3IuYWRkVGV4dCgiIHt9Ii5mb3JtYXQoY291bnRba2V5XSkpDQogICAgICAgICAg ICBkZWwgY291bnRba2V5XSAgIyBkb24ndCB3YW50IHRvIGhhdmUgZHVwbGljYXRl cywgc28gcmVtb3ZlIHRoZSBrZXkgdG8gaW5kaWNhdGUgSSdtIGRvbmUNCiAgICAg ICAgICAgIGxudW0gPSBsbnVtICsgMQ0KICAgICAgICBlbHNlOg0KICAgICAgICAg ICAgI2NvbnNvbGUud3JpdGUoIns6MDEwWH18e318e31cbiIuZm9ybWF0KGtleSwg ZWRpdG9yLmdldEN1ckxpbmUoKS5yc3RyaXAoKSwgIk5FRUQgVE8gREVMRVRFIExJ TkUiKSkNCiAgICAgICAgICAgIGVkaXRvci5saW5lRGVsZXRlKCkNCiAgICBlZGl0 b3IuZW5kVW5kb0FjdGlvbigpDQoNCmlmIF9fbmFtZV9fID09ICdfX21haW5fXyc6 IGZvcnVtX3Bvc3QxNzc0NF9GdW5jdGlvbk5hbWUoKQ0K
(Obfuscated to avoid a spoiler for the solution. If you need a hint, that’s base64 encoded, and Notepad++ usually ships with MIME Tools plugin…)
-
OMG. Peter, I think this Community is like a drug for you…or maybe for more than just you.
:)