separate only duplicate numbers from file



  • Hi Guys,

    kindly suggest me way to separate only duplicate data from below format… all kind of help appreciate… Thanks in advance… :)

    input is,

    919913209647 02:38:47
    919979418778 02:57:03
    918980055979 02:46:12
    919428616318 02:46:32
    919512672560 02:46:33
    919512646084 02:46:52
    919512497164 02:48:13
    919512497164 02:48:13
    919913029225 02:50:23
    917567814941 03:02:35
    919537722335 03:18:41
    918980299814 03:24:49
    919727009323 03:29:44

    Output

    919512497164 02:48:13
    919512497164 02:48:13



  • I would be inclined to use the “Mark” feature, possibly with the “Bookmark line” option enabled depending upon your real purpose here.

    Give it a try:
    Find dialog box
    select Mark tab
    checkmark in Bookmark line
    checkmark in Wraparound (you may want this)
    select “Regular expression”
    everything else unchecked
    Find what: (?s)^(.*?)$\s+?^(?=.*^\1$)
    Press “Mark All”

    This will not highlight/bookmark all the duplicate lines, but should highlight the first one of the set.

    Without knowing where you are going next with your data, it is tough to be more specific and/or suggest a better approach.



  • @Scott-Sumner No Sir, its is not working…



  • It works for me when I copy your sample data from here into a file and then step for step do what I said to, including copying and pasting the “Find what:” data…it highlights and bookmarks line 7 of your sample data.



  • ^(.+?)\R(\1\R?)+

    i found way to mark all duplicate number can you suggest way for how to separate only marked data ? file is having more then 90000 lines…



  • Here’s an AWK script that can do the trick for you:

    # If there is something other than whitespace on a line:
    NF {
        # Use the text as an array index and count how many times it appears
        Line[$0]++
    }
    
    # Once the whole file is done, spit out every line that was duplicated 2 or more
    # times, the number of times they were duplcated.
    #
    # If Line[line] == 1, then the line appeared only 1 time (it is unique).
    # If Line[line] > 1, then the line appeared that many times.
    END {
        for (line in Line) {
            for (i = 1; Line[line] > 1 && i <= Line[line]; i++) {
                print line
            }
        }
    }
    

    I use GNU AWK for windows (gawk.exe). If you save the script as dup.awk, then:

    gawk -f .\dup.awk <name of your 90000 line file>  > dupout.txt
    

    will create dupout.txt with all the duplicated lines. I used the data in your original post and let the output go to standard out:

    C:\temp\awk>type input.txt
    919913209647 02:38:47
    919979418778 02:57:03
    918980055979 02:46:12
    919428616318 02:46:32
    919512672560 02:46:33
    919512646084 02:46:52
    919512497164 02:48:13
    919512497164 02:48:13
    919913029225 02:50:23
    917567814941 03:02:35
    919537722335 03:18:41
    918980299814 03:24:49
    919727009323 03:29:44
    C:\temp\awk>gawk -f .\dup.awk input.txt
    919512497164 02:48:13
    919512497164 02:48:13
    
    C:\temp\awk>

Log in to reply