How to find duplicate words in two files?
Gabriela Stefanova last edited by
Hi, I try to find a solution since last week but I still can’t so I hope you can help me.
I have two huge files (the first is has 2 million lines and the second one 200k) The first has a names and a telephone numbers and the second file has only names.
File one : John Smith:+355867522
Second file: John Smith
The files are not the same so I can’t make a comparison. I can’t find duplicates lines because of the telephone number.
I need to find how much from the people in the smaller list are in the bigger one, and if the is a success to add the telephone number there.
Or I other words filter the bigger file only the names from the small but with the phone number.
I have tried a million things but unfortunately-no luck.
I hope someone can help me.
Thanks in advance
Alan Kilborn last edited by
This works for me on some sample data:
- add the contents of the second file to the first file at the bottom
- sort the lines of the (new) first file using Edit > Line Operations > Sort Lines Lexocographically Ascending
- use the Mark dialog to mark the following regular expression:
^(.+?):\+\d+(?=\R\1$)being sure to set the search mode correctly and to tick the Wrap around checkbox
- copy the text now marked in red to the clipboard by pressing the Copy Marked Text button (in the Mark window)
- paste the clipboard into a new file – this file is your desired “second file” data
guy038 last edited by
Alan, you said :
sort the lines of the (new) first file using Edit > Line Operations > Sort Lines Lexocographically Ascending
Personally, after testing, it seems that your correct regex S/R works only if you’re using, previously, the opposite sort :
Edit > Line Operations > Sort Lines Lexocographically Descending
BTW, very easy and nice solution ;-))
Alan Kilborn last edited by Alan Kilborn