Is there way to Search for duplicate records based on the column in notepad++ and delete the particular row



  • Re: Is there a way to search for duplicate records in Notepad++?

    Here is my content of the txt file

    RecordNo,StartTime,CreationTime
    1,20201027000000,20201030204923
    2,20201028000000,20201030204919
    3,20201029000000,20201030155643
    4,20201029000000,20201030204820

    From the above data i need to find the duplicate record based on the Start Time and also CreationTime that doesn’t have latest timestamp. Here my start time has two duplicate record (Record 3 and 4) but i need to search and delete the record that has CreationTime that should NOT have latest timestamp (Record 3)

    Please help me with that i can do that in the excel with the below formula, but i want to know is there a way i can do it notepad++ ?
    =IF(B3=B4,IF(C3<C4,“DUPLICATE”),“NOT DUPLICATE”)



  • @Bharath-k said in Is there way to Search for duplicate records based on the column in notepad++ and delete the particular row:

    but i want to know is there a way i can do it notepad++ ?

    It looks like it should be an easy job to find duplicates. I’ll explain first the steps I’d take and then if you are fine with that I could provide some regular expressions to do it.

    As the first number is going to affect a numerical sort I’d want to move that to the end of the line. So the new line format would be StartTime followed by CreationTime and Record number last.
    Then a numerical sort would place the duplicate StartTime records together and given the next number is also numerical it would sort that in increasing number order. Thus the latest CreationTime number when duplicate records exist would be second (last) in that sequence.
    Then a simple regex (regular expression) would look at the current line StartTime and also at the next one. If the same then the current line would be removed which is your next criteria of not removing the latest timestamp.
    Lastly we can move the record numbers back to the start of the line. This should also work even if 3 (or more) duplicates of the same StartTime, leaving ONLY the latest record.

    How does that sound to you?

    Terry



  • Hi Terry,
    Thanks for the quick reply

    could you please provide the regular expression I will try that and come back

    Also this is how my data looks like after sorting

    AID,BID,Start,End,C5,C6,StartValue,EndValue,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,TimeZone,CreationTime
    17884284,25,20201025000000,20201025235959,Y,N,8700,8750,80,0,900,1,1,-1,0,0,CST,20201030204926
    17884284,25,20201026000000,20201026235959,Y,N,8750,8800,80,0,900,1,1,-1,0,0,CST,20201030204924
    17884284,25,20201027000000,20201027235959,Y,N,8800,8850,80,0,900,1,1,-1,0,0,CST,20201030204923
    17884284,25,20201028000000,20201028235959,Y,N,8850,8900,80,0,900,1,1,-1,0,0,CST,20201030204919
    17884284,25,20201029000000,20201029235959,Y,N,8900,8902,80,0,900,1,1,-1,0,0,CST,20201030134820
    17884284,25,20201029000000,20201029235959,Y,N,8900,8902,80,0,900,1,1,-1,0,0,CST,20201030204820

    The 5th record is the duplicate record that doesnt have latest time stamp for CreationTime when compared to 6th record that has the latest creation time



  • @Bharath-k said in Is there way to Search for duplicate records based on the column in notepad++ and delete the particular row:

    Also this is how my data looks like after sorting

    Already we have a problem. That is in your first post you said this is the content of the txt file. Now you show completely different data with a lot more fields.

    There is nothing more annoying to us that wish to help out newcomers like yourself than to provide a solution based on the data you provide, ONLY to later on get information that says “oh, by the way the actual data does not look like that, it looks like this”.

    At this point I’m deciding on whether I wish to proceed. The solution I gave (in sentences) still works, however I’m rather hacked off by your sudden change in examples. If I had spent much time behind the scenes to check that my steps would be achievable I’d be REALLY hacked off, wasting that time on invalid examples.

    You may (or not) get a solution from me. Possibly someone else may even want to provide another solution (or not) given your inability to provide valid examples. And by the way we prefer examples to be inside of the black boxes. This is achieved by inserting the examples into your post, selecting those lines and clicking on the </> button above. This helps to prevent the posting engine from possibly altering some of the characters.

    So your latest examples would appear like this:

    AID,BID,Start,End,C5,C6,StartValue,EndValue,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,TimeZone,CreationTime
    17884284,25,20201025000000,20201025235959,Y,N,8700,8750,80,0,900,1,1,-1,0,0,CST,20201030204926
    17884284,25,20201026000000,20201026235959,Y,N,8750,8800,80,0,900,1,1,-1,0,0,CST,20201030204924
    17884284,25,20201027000000,20201027235959,Y,N,8800,8850,80,0,900,1,1,-1,0,0,CST,20201030204923
    17884284,25,20201028000000,20201028235959,Y,N,8850,8900,80,0,900,1,1,-1,0,0,CST,20201030204919
    17884284,25,20201029000000,20201029235959,Y,N,8900,8902,80,0,900,1,1,-1,0,0,CST,20201030134820
    17884284,25,20201029000000,20201029235959,Y,N,8900,8902,80,0,900,1,1,-1,0,0,CST,20201030204820
    

    Terry



  • @Terry-R - Sorry for annoying you with my different examples. Thanks for your time and reply. I will figure out myself.



  • @Bharath-k said in Is there way to Search for duplicate records based on the column in notepad++ and delete the particular row:

    Sorry for annoying you with my different examples… I will figure (it) out myself.

    Better might have been for you to just confirm what your data actually looks like. I don’t think Terry is as bothered as he seems, and is probably still interested in helping.

    A problem that I don’t know how to solve is that with these types of questions, people asking tend to “simplify” their true situation because they don’t believe something is relevant, or maybe some data is private.
    While this approach is great when you are trying to give someone some simple, reproducible steps for bug-recreation, it can do more harm than good in a data-manipulation scenario.



  • Hi @Bharath-k, All:

    Given that the data provided is sorted, this regex may do the job:

    Search: (?-s)^((.*),\w+?$\R)(?=^\2)
    Replace: [leave empty]
    

    Put the caret at the very beginning of the document, select just the Regular Expressions mode and click on Replace All. Repeat the last command if a record has several duplicates.

    Have fun!



  • @astrosofista @Alan-Kilborn and everyone else who’s involved in this thread. As @Alan-Kilborn stated I would probably return to supply a solution sooner or later (well actually I had to do some real work, the boss insisted :-}}).
    So because I was really concerned on supplying anything I started pulling the latest examples apart, only to realise:

    1. There is no StartTime as the OP stated he wanted to use in his first post. Okay it probably is easy to see it is the Start field.
    2. The fields do NOT line up with the header line. There are 19 commas in the field headings and 17 commas amongst the actual data. Note I could see that CST was the TimeZone so that meant C17 & C18 were blank (actually could be ANY fields between EndValue and C18), not even empty fields (2 commas together) for those headers.

    It’s a bit difficult to see even padded out.

    AID      BID Start          End            C5 C6 StartValue EndValue C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 TimeZone CreationTime
    17884284 25  20201025000000 20201025235959 Y  N  8700       8750     80 0   900 1   1   -1  0   0           CST      20201030204926
    

    So we still have what appears to be artificially contrived data and it’s a waste of time even attempting a solution under these conditions.

    At this point I think I’m out!

    Terry

    PS the black box extended upon posting, so it’s easy to see now.



  • @Alan-Kilborn - Yes the actual data i am not able to publish here due to its client data and i am not allowed to share.



  • @astrosofista Thanks this helps a lot. Thanks for your time.



  • @Bharath-k said in Is there way to Search for duplicate records based on the column in notepad++ and delete the particular row:

    actual data i am not able to publish here due to its client data and i am not allowed to share.

    We understand that. When your “data sanitization” (removing secret info) is just putting in fake numbers and fake names (even fake column headers if you want), that’s one thing; the regex will likely still work, as long as it has the same type and width of data in each column. However, in your first post, you not only “sanitized” the data, you also changed the number of columns – that means that the regex is likely to change, because number of separators will be different, etc.

    The advice below might help you ask better questions next time, and reduce the number of back-and-forth exchanges before you have a working solution.

    ----

    Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as plain text using the </> toolbar button or manual Markdown syntax. Screenshots can be pasted from the clipboard to your post using Ctrl+V to show graphical items, but any text should be included as literal text in your post so we can easily copy/paste your data. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get… Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries.


Log in to reply