How to find duplicate words in two files?



  • Hi, I try to find a solution since last week but I still can’t so I hope you can help me.
    I have two huge files (the first is has 2 million lines and the second one 200k) The first has a names and a telephone numbers and the second file has only names.
    Example:
    File one : John Smith:+355867522
    Second file: John Smith
    The files are not the same so I can’t make a comparison. I can’t find duplicates lines because of the telephone number.
    I need to find how much from the people in the smaller list are in the bigger one, and if the is a success to add the telephone number there.
    Or I other words filter the bigger file only the names from the small but with the phone number.
    I have tried a million things but unfortunately-no luck.
    I hope someone can help me.
    Thanks in advance



  • @Gabriela-Stefanova

    This works for me on some sample data:

    1. add the contents of the second file to the first file at the bottom
    2. sort the lines of the (new) first file using Edit > Line Operations > Sort Lines Lexocographically Ascending
    3. use the Mark dialog to mark the following regular expression: ^(.+?):\+\d+(?=\R\1$) being sure to set the search mode correctly and to tick the Wrap around checkbox
    4. copy the text now marked in red to the clipboard by pressing the Copy Marked Text button (in the Mark window)
    5. paste the clipboard into a new file – this file is your desired “second file” data


  • Hello, @gabriela-stefanova, @alan-kilborn and All,

    Alan, you said :

    sort the lines of the (new) first file using Edit > Line Operations > Sort Lines Lexocographically Ascending

    Personally, after testing, it seems that your correct regex S/R works only if you’re using, previously, the opposite sort :

    Edit > Line Operations > Sort Lines Lexocographically Descending


    BTW, very easy and nice solution ;-))

    Regards,

    guy038



  • @guy038 said in How to find duplicate words in two files?:

    Lexocographically Descending

    Yes, dammit.
    These type of typos are the death of me LATELY. :-(
    I actually realized the typo while I was napping and came back to my PC to fix it, and found you already had.


Log in to reply