• Login
Community
  • Login

How to find duplicate words in two files?

Scheduled Pinned Locked Moved General Discussion
4 Posts 3 Posters 5.7k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • G
    Gabriela Stefanova
    last edited by Dec 23, 2020, 4:17 PM

    Hi, I try to find a solution since last week but I still can’t so I hope you can help me.
    I have two huge files (the first is has 2 million lines and the second one 200k) The first has a names and a telephone numbers and the second file has only names.
    Example:
    File one : John Smith:+355867522
    Second file: John Smith
    The files are not the same so I can’t make a comparison. I can’t find duplicates lines because of the telephone number.
    I need to find how much from the people in the smaller list are in the bigger one, and if the is a success to add the telephone number there.
    Or I other words filter the bigger file only the names from the small but with the phone number.
    I have tried a million things but unfortunately-no luck.
    I hope someone can help me.
    Thanks in advance

    A 1 Reply Last reply Dec 23, 2020, 4:35 PM Reply Quote 0
    • A
      Alan Kilborn @Gabriela Stefanova
      last edited by Dec 23, 2020, 4:35 PM

      @Gabriela-Stefanova

      This works for me on some sample data:

      1. add the contents of the second file to the first file at the bottom
      2. sort the lines of the (new) first file using Edit > Line Operations > Sort Lines Lexocographically Ascending
      3. use the Mark dialog to mark the following regular expression: ^(.+?):\+\d+(?=\R\1$) being sure to set the search mode correctly and to tick the Wrap around checkbox
      4. copy the text now marked in red to the clipboard by pressing the Copy Marked Text button (in the Mark window)
      5. paste the clipboard into a new file – this file is your desired “second file” data
      1 Reply Last reply Reply Quote 2
      • G
        guy038
        last edited by Dec 23, 2020, 6:12 PM

        Hello, @gabriela-stefanova, @alan-kilborn and All,

        Alan, you said :

        sort the lines of the (new) first file using Edit > Line Operations > Sort Lines Lexocographically Ascending

        Personally, after testing, it seems that your correct regex S/R works only if you’re using, previously, the opposite sort :

        Edit > Line Operations > Sort Lines Lexocographically Descending


        BTW, very easy and nice solution ;-))

        Regards,

        guy038

        A 1 Reply Last reply Dec 23, 2020, 6:17 PM Reply Quote 3
        • A
          Alan Kilborn @guy038
          last edited by Alan Kilborn Dec 23, 2020, 6:19 PM Dec 23, 2020, 6:17 PM

          @guy038 said in How to find duplicate words in two files?:

          Lexocographically Descending

          Yes, dammit.
          These type of typos are the death of me LATELY. :-(
          I actually realized the typo while I was napping and came back to my PC to fix it, and found you already had.

          1 Reply Last reply Reply Quote 1
          1 out of 4
          • First post
            1/4
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors