Remove duplicate lines in separate files

Sarah Duong

I have 12 text files, each with a very large number of lines (each line has about 30 characters, or there are lines of only 5 characters). I am in need of removing duplicate lines in those 12 files. Although I know how to use expressions to eliminate duplicate lines in a file, in this case, the number of lines of each file is huge, so I cannot combine them. So how do I remove duplicate lines?
Hope everyone understands what I’m saying. Because I use Google to translate, sometimes the presentation is incorrect. Hope you understand me. Thank you

Terry R

@Sarah-Duong said in Remove duplicate lines in separate files:

I am in need of removing duplicate lines in those 12 files

If I read correctly you are saying each file only contains unique lines within that file, but that line may have a duplicate in another file. If that description is correct NPP cannot help by itself. There may be a plugin that could help but I don’t know of any.

I did have an idea to combine but you say that is not possible. Are you sure you cannot do so as if you could NPP with a number of steps such as indexing each line with a file and line number (at end of each line, and sorted) would be able to remove duplicates fairly easily. If this was possible you would need to stipulate precedence or priority of file number. I mean that if file 1 and 3 contained the same line which file has it removed?

Come back with some more detail such as file sizes, number of lines in each file and how you determine priority.

Terry

Sarah Duong

@Terry-R I could not combine the two files together, because at that time the number of lines was more than 5 million. If I use the expression to remove duplicate in notepad ++ with 5 million lines, my machine will hang. If the 12 files are combined, my computer will go to the doctor. Actually more than 100 files, not 12 files. I have the link below.
https://github.com/kennyn510/wpa2-wordlists
They have instructions on how to eliminate duplicates. However, it seems to be possible to do so using linux. Windows, I do not know how.
After translating your words, you seem to have something difficult for me? Hope you do not misunderstand what I present. I am very grateful for your help. I only present what is displayed on my computer.

Terry R

@Sarah-Duong said in Remove duplicate lines in separate files:

you seem to have something difficult for me?

My idea was not difficult, however there would have been a number of steps to do.

Now that you have changed the difficulty to 100 or so files I don’t think my idea as it was would be achievable without adding more steps

Often when a problem is too big it should be looked at a different way. So the new idea would still take a number of steps but might work.

For each file sort them alphabetically, if need to keep the current line order a line number can be added prior to sorting.
The file would then be cut into a number of smaller files, say along alphabetical lines. All lines starting with a in 1 file, b in another and so on. Depending on size it may even need to be “aa”, “ab”, “ac” etc.
Repeat above 2 steps for every file.
Combine all the “a” files together and find duplicates. Repeat for every letter.
Once duplicates found and removed sort files by the original file name they came from.
Combine lines back into the original file and resort according to original line.

This is a concept at a high level, obviously it will involve lots of steps and lots of manual work. The only other possibility I see is using some other product. You mentioned something in Linux, maybe that’s where you should direct your ideas.

Terry

Terry R

@Sarah-Duong said in Remove duplicate lines in separate files:

with 5 million lines, my machine will hang.

I’ve been reading again the other post you started where @guy038 provided you with a regular expression (regex) in
https://community.notepad-plus-plus.org/topic/19022/type-of-duplicate-lines/16
That regex could be causing your machine to hang. The rest of the posts in that article went on to test some timings with respect to @guy038 regex. I had also explained my natural aversion to using the lookahead that @guy038 has in his regex. I think that’s why your machine is hanging. At 5 million lines the lookahead is possibly not the best option for you.
So, some questions:

Do you need the files in each file to stay in the same order, or can they be sorted alphabetically?
Can you provide the file size of the largest file, the smallest file, average size and number of files you need to work on.

If we can sort the lines in the files then the built-in function to remove duplicate lines should work. You might need to combine files in various groups, keeping below a maximum size which you will need to determine by testing. It will take some time. Of course this is ONLY if you wish to remain with Windows and also to use Notepad++

Terry

Sarah Duong

Maybe I only know how to do this slowly according to your instructions. Because the actual number of files quite a lot. The average file has about 200,000 lines. This is really difficult for me. Thanks

Terry R

@Sarah-Duong said in Remove duplicate lines in separate files:

This is really difficult for me.

Well, you have presented quite a significant problem, mainly due to the size. The actual process (as I’ve outlined previously) is not difficult, but given the number and size of the files the solution will take some effort by you to complete.

So do you want a solution in the Windows environment or were you considering the Linux solution in which case this thread (collection of posts) can close?

You say the average filesize is 200000 lines, if we suggest an average of 18 (you said 5 to 30 characters per line) characters per line this makes an average filesize of 3.6MB. I haven’t personally worked on a file of this size, however I’m sure NPP is capable of filesizes much larger. It can depend upon the environment such as whether NPP used is 32bit or 64bit, and whether you have additional plugins loaded.

I still think it will only be possible to use NPP is the files are broken down into groups which will mean sorting each file first and breaking them apart by the first 1 or 2 characters. Then processing each group separately.

Please advise whether you still want to consider this approach, we (on the forum) can help you, but be aware it will be a lengthy job.

Terry