How to find duplicate words in two files?

Gabriela Stefanova

Hi, I try to find a solution since last week but I still can’t so I hope you can help me.
I have two huge files (the first is has 2 million lines and the second one 200k) The first has a names and a telephone numbers and the second file has only names.
Example:
File one : John Smith:+355867522
Second file: John Smith
The files are not the same so I can’t make a comparison. I can’t find duplicates lines because of the telephone number.
I need to find how much from the people in the smaller list are in the bigger one, and if the is a success to add the telephone number there.
Or I other words filter the bigger file only the names from the small but with the phone number.
I have tried a million things but unfortunately-no luck.
I hope someone can help me.
Thanks in advance

Alan Kilborn

@Gabriela-Stefanova

This works for me on some sample data:

add the contents of the second file to the first file at the bottom
sort the lines of the (new) first file using Edit > Line Operations > Sort Lines Lexocographically Ascending
use the Mark dialog to mark the following regular expression: ^(.+?):\+\d+(?=\R\1$) being sure to set the search mode correctly and to tick the Wrap around checkbox
copy the text now marked in red to the clipboard by pressing the Copy Marked Text button (in the Mark window)
paste the clipboard into a new file – this file is your desired “second file” data

guy038

Hello, @gabriela-stefanova, @alan-kilborn and All,

Alan, you said :

sort the lines of the (new) first file using Edit > Line Operations > Sort Lines Lexocographically Ascending

Personally, after testing, it seems that your correct regex S/R works only if you’re using, previously, the opposite sort :

Edit > Line Operations > Sort Lines Lexocographically Descending

BTW, very easy and nice solution ;-))

Regards,

guy038

Alan Kilborn

@guy038 said in How to find duplicate words in two files?:

Lexocographically Descending

Yes, dammit.
These type of typos are the death of me LATELY. :-(
I actually realized the typo while I was napping and came back to my PC to fix it, and found you already had.