How to remove duplicate words in a list that are not consecutive?
-
I need to remove duplicate words in a list of words as lines. But these duplicates are not listed consecutively and are in various line numbers.
I tried the Edit>Line Operations>Remove Duplicate Lines, but this feature doesn’t work when there are thousands of lines in a list.
So I need to know upto how many lines this feature works?
Is there any other script to do this for lines more than 30k?
-
@Debojit-Acharjee said in How to remove duplicate words in a list that are not consecutive?:
I tried the Edit>Line Operations>Remove Duplicate Lines,
I think it should work, even for the number if lines you mention.
The online manual reference is here, down a bit, look for Line Operations. There is a requirement that the line endings are uniform and meet the file type as shown in the bottom bar. Have a read and check that your file meets those requirements.
Terry
-
The following PythonScript script is a fine solution:
''' ref: https://community.notepad-plus-plus.org/topic/25492/how-to-remove-duplicate-words-in-a-list-that-are-not-consecutive requires PythonScript: https://github.com/bruderstein/PythonScript ''' from Npp import * values = set() def callback(match): line = match.group(0) if line in values: return '' values.add(line) return line editor.rereplace('(?-s)^.*$', callback)
-
-
Notes on Mark’s script:
-
when it removes a duplicate line, it leaves an empty line at the position of the removal (perhaps the OP wants this, perhaps not; OP didn’t say)
-
naming the replacement function
callback
is something I don’t really like, as “callback” has a bit of a different connotation in PythonScript programming (but this is MY problem, not a problem with the script)
There are some other useful scripts for removing duplicate lines in THIS fairly old thread.
-
-
@Alan-Kilborn I can use script but I want to know why the “Remove Duplicate Lines” feature of “Line Operations” in Notepad++ doesn’t work when there are more than 30 thousand lines?
Is there any thing to do with the CPU register memory?
-
Did you check the online reference I linked to?
I just created a 30K plus line file with 1 word on every line. Since I had approximately 1500 words which I duplicated it was going to remove most lines when the “Remove Duplicate Lines” option was used, leaving just over 1100 words as seen in the image below. The removal was very quick, only about 1 second (or less).
I have shown the line ending in the file and also pointed out that the file is recognized as the same type (CR LF). That is what the online reference refers to. To show the line endings you use the View menu, then “Show Symbol”, then tick at least the “Show End of Line”.
Do that and take a picture of your file, post it here. Also copy and paste the version of your Notepad++ installation. It is under the ? menu, then “Debug Info”.
As your installation may well be using a different language these options will need to be translated to your language.
Without that additional information we have no way of identifying your problem, but rest assured that the “Remove Duplicate Lines” option does work if the requirements are met.
Terry
-
@Debojit-Acharjee The simplest solution is to use the Python script @Mark-Olson gave you. Please click and read, “How to install and run a script in Python Script” if you don’t know how to install and run a script in python script.
-
@Debojit-Acharjee said:
I can use script but I want to know why the “Remove Duplicate Lines” feature of “Line Operations” in Notepad++ doesn’t work when there are more than 30 thousand lines? Is there any thing to do with the CPU register memory?
The best way to explore this is to create an “issue” on the official bug reporting site (see HERE for info on that) and attach the 30K+ file where it fails.