Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    How to find duplicate words in two files?

    General Discussion
    3
    4
    2111
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Gabriela Stefanova
      Gabriela Stefanova last edited by

      Hi, I try to find a solution since last week but I still can’t so I hope you can help me.
      I have two huge files (the first is has 2 million lines and the second one 200k) The first has a names and a telephone numbers and the second file has only names.
      Example:
      File one : John Smith:+355867522
      Second file: John Smith
      The files are not the same so I can’t make a comparison. I can’t find duplicates lines because of the telephone number.
      I need to find how much from the people in the smaller list are in the bigger one, and if the is a success to add the telephone number there.
      Or I other words filter the bigger file only the names from the small but with the phone number.
      I have tried a million things but unfortunately-no luck.
      I hope someone can help me.
      Thanks in advance

      Alan Kilborn 1 Reply Last reply Reply Quote 0
      • Alan Kilborn
        Alan Kilborn @Gabriela Stefanova last edited by

        @Gabriela-Stefanova

        This works for me on some sample data:

        1. add the contents of the second file to the first file at the bottom
        2. sort the lines of the (new) first file using Edit > Line Operations > Sort Lines Lexocographically Ascending
        3. use the Mark dialog to mark the following regular expression: ^(.+?):\+\d+(?=\R\1$) being sure to set the search mode correctly and to tick the Wrap around checkbox
        4. copy the text now marked in red to the clipboard by pressing the Copy Marked Text button (in the Mark window)
        5. paste the clipboard into a new file – this file is your desired “second file” data
        1 Reply Last reply Reply Quote 2
        • guy038
          guy038 last edited by

          Hello, @gabriela-stefanova, @alan-kilborn and All,

          Alan, you said :

          sort the lines of the (new) first file using Edit > Line Operations > Sort Lines Lexocographically Ascending

          Personally, after testing, it seems that your correct regex S/R works only if you’re using, previously, the opposite sort :

          Edit > Line Operations > Sort Lines Lexocographically Descending


          BTW, very easy and nice solution ;-))

          Regards,

          guy038

          Alan Kilborn 1 Reply Last reply Reply Quote 3
          • Alan Kilborn
            Alan Kilborn @guy038 last edited by Alan Kilborn

            @guy038 said in How to find duplicate words in two files?:

            Lexocographically Descending

            Yes, dammit.
            These type of typos are the death of me LATELY. :-(
            I actually realized the typo while I was napping and came back to my PC to fix it, and found you already had.

            1 Reply Last reply Reply Quote 1
            • First post
              Last post
            Copyright © 2014 NodeBB Forums | Contributors