separate only duplicate numbers from file

gautam patel

Hi Guys,

kindly suggest me way to separate only duplicate data from below format… all kind of help appreciate… Thanks in advance… :)

input is,

919913209647 02:38:47
919979418778 02:57:03
918980055979 02:46:12
919428616318 02:46:32
919512672560 02:46:33
919512646084 02:46:52
919512497164 02:48:13
919512497164 02:48:13
919913029225 02:50:23
917567814941 03:02:35
919537722335 03:18:41
918980299814 03:24:49
919727009323 03:29:44

Output

919512497164 02:48:13
919512497164 02:48:13

Scott Sumner

I would be inclined to use the “Mark” feature, possibly with the “Bookmark line” option enabled depending upon your real purpose here.

Give it a try:
Find dialog box
select Mark tab
checkmark in Bookmark line
checkmark in Wraparound (you may want this)
select “Regular expression”
everything else unchecked
Find what: (?s)^(.*?)$\s+?^(?=.*^\1$)
Press “Mark All”

This will not highlight/bookmark all the duplicate lines, but should highlight the first one of the set.

Without knowing where you are going next with your data, it is tough to be more specific and/or suggest a better approach.

gautam patel

@Scott-Sumner No Sir, its is not working…

Scott Sumner

It works for me when I copy your sample data from here into a file and then step for step do what I said to, including copying and pasting the “Find what:” data…it highlights and bookmarks line 7 of your sample data.

gautam patel

^(.+?)\R(\1\R?)+

i found way to mark all duplicate number can you suggest way for how to separate only marked data ? file is having more then 90000 lines…

Jim Dailey

Here’s an AWK script that can do the trick for you:

# If there is something other than whitespace on a line:
NF {
    # Use the text as an array index and count how many times it appears
    Line[$0]++
}

# Once the whole file is done, spit out every line that was duplicated 2 or more
# times, the number of times they were duplcated.
#
# If Line[line] == 1, then the line appeared only 1 time (it is unique).
# If Line[line] > 1, then the line appeared that many times.
END {
    for (line in Line) {
        for (i = 1; Line[line] > 1 && i <= Line[line]; i++) {
            print line
        }
    }
}

I use GNU AWK for windows (gawk.exe). If you save the script as dup.awk, then:

gawk -f .\dup.awk <name of your 90000 line file>  > dupout.txt

will create dupout.txt with all the duplicated lines. I used the data in your original post and let the output go to standard out:

C:\temp\awk>type input.txt
919913209647 02:38:47
919979418778 02:57:03
918980055979 02:46:12
919428616318 02:46:32
919512672560 02:46:33
919512646084 02:46:52
919512497164 02:48:13
919512497164 02:48:13
919913029225 02:50:23
917567814941 03:02:35
919537722335 03:18:41
918980299814 03:24:49
919727009323 03:29:44
C:\temp\awk>gawk -f .\dup.awk input.txt
919512497164 02:48:13
919512497164 02:48:13

C:\temp\awk>