Community
    • Login

    Is there way to Search for duplicate records based on the column in notepad++ and delete the particular row

    Scheduled Pinned Locked Moved General Discussion
    11 Posts 5 Posters 1.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Terry RT
      Terry R
      last edited by Terry R

      @Bharath-k said in Is there way to Search for duplicate records based on the column in notepad++ and delete the particular row:

      but i want to know is there a way i can do it notepad++ ?

      It looks like it should be an easy job to find duplicates. I’ll explain first the steps I’d take and then if you are fine with that I could provide some regular expressions to do it.

      As the first number is going to affect a numerical sort I’d want to move that to the end of the line. So the new line format would be StartTime followed by CreationTime and Record number last.
      Then a numerical sort would place the duplicate StartTime records together and given the next number is also numerical it would sort that in increasing number order. Thus the latest CreationTime number when duplicate records exist would be second (last) in that sequence.
      Then a simple regex (regular expression) would look at the current line StartTime and also at the next one. If the same then the current line would be removed which is your next criteria of not removing the latest timestamp.
      Lastly we can move the record numbers back to the start of the line. This should also work even if 3 (or more) duplicates of the same StartTime, leaving ONLY the latest record.

      How does that sound to you?

      Terry

      1 Reply Last reply Reply Quote 1
      • Bharath kB
        Bharath k
        last edited by

        Hi Terry,
        Thanks for the quick reply

        could you please provide the regular expression I will try that and come back

        Also this is how my data looks like after sorting

        AID,BID,Start,End,C5,C6,StartValue,EndValue,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,TimeZone,CreationTime
        17884284,25,20201025000000,20201025235959,Y,N,8700,8750,80,0,900,1,1,-1,0,0,CST,20201030204926
        17884284,25,20201026000000,20201026235959,Y,N,8750,8800,80,0,900,1,1,-1,0,0,CST,20201030204924
        17884284,25,20201027000000,20201027235959,Y,N,8800,8850,80,0,900,1,1,-1,0,0,CST,20201030204923
        17884284,25,20201028000000,20201028235959,Y,N,8850,8900,80,0,900,1,1,-1,0,0,CST,20201030204919
        17884284,25,20201029000000,20201029235959,Y,N,8900,8902,80,0,900,1,1,-1,0,0,CST,20201030134820
        17884284,25,20201029000000,20201029235959,Y,N,8900,8902,80,0,900,1,1,-1,0,0,CST,20201030204820

        The 5th record is the duplicate record that doesnt have latest time stamp for CreationTime when compared to 6th record that has the latest creation time

        astrosofistaA 1 Reply Last reply Reply Quote -1
        • Terry RT
          Terry R
          last edited by

          @Bharath-k said in Is there way to Search for duplicate records based on the column in notepad++ and delete the particular row:

          Also this is how my data looks like after sorting

          Already we have a problem. That is in your first post you said this is the content of the txt file. Now you show completely different data with a lot more fields.

          There is nothing more annoying to us that wish to help out newcomers like yourself than to provide a solution based on the data you provide, ONLY to later on get information that says “oh, by the way the actual data does not look like that, it looks like this”.

          At this point I’m deciding on whether I wish to proceed. The solution I gave (in sentences) still works, however I’m rather hacked off by your sudden change in examples. If I had spent much time behind the scenes to check that my steps would be achievable I’d be REALLY hacked off, wasting that time on invalid examples.

          You may (or not) get a solution from me. Possibly someone else may even want to provide another solution (or not) given your inability to provide valid examples. And by the way we prefer examples to be inside of the black boxes. This is achieved by inserting the examples into your post, selecting those lines and clicking on the </> button above. This helps to prevent the posting engine from possibly altering some of the characters.

          So your latest examples would appear like this:

          AID,BID,Start,End,C5,C6,StartValue,EndValue,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,TimeZone,CreationTime
          17884284,25,20201025000000,20201025235959,Y,N,8700,8750,80,0,900,1,1,-1,0,0,CST,20201030204926
          17884284,25,20201026000000,20201026235959,Y,N,8750,8800,80,0,900,1,1,-1,0,0,CST,20201030204924
          17884284,25,20201027000000,20201027235959,Y,N,8800,8850,80,0,900,1,1,-1,0,0,CST,20201030204923
          17884284,25,20201028000000,20201028235959,Y,N,8850,8900,80,0,900,1,1,-1,0,0,CST,20201030204919
          17884284,25,20201029000000,20201029235959,Y,N,8900,8902,80,0,900,1,1,-1,0,0,CST,20201030134820
          17884284,25,20201029000000,20201029235959,Y,N,8900,8902,80,0,900,1,1,-1,0,0,CST,20201030204820
          

          Terry

          1 Reply Last reply Reply Quote 5
          • Bharath kB
            Bharath k
            last edited by

            @Terry-R - Sorry for annoying you with my different examples. Thanks for your time and reply. I will figure out myself.

            Alan KilbornA 1 Reply Last reply Reply Quote 0
            • Alan KilbornA
              Alan Kilborn @Bharath k
              last edited by Alan Kilborn

              @Bharath-k said in Is there way to Search for duplicate records based on the column in notepad++ and delete the particular row:

              Sorry for annoying you with my different examples… I will figure (it) out myself.

              Better might have been for you to just confirm what your data actually looks like. I don’t think Terry is as bothered as he seems, and is probably still interested in helping.

              A problem that I don’t know how to solve is that with these types of questions, people asking tend to “simplify” their true situation because they don’t believe something is relevant, or maybe some data is private.
              While this approach is great when you are trying to give someone some simple, reproducible steps for bug-recreation, it can do more harm than good in a data-manipulation scenario.

              Bharath kB 1 Reply Last reply Reply Quote 3
              • astrosofistaA
                astrosofista @Bharath k
                last edited by

                Hi @Bharath-k, All:

                Given that the data provided is sorted, this regex may do the job:

                Search: (?-s)^((.*),\w+?$\R)(?=^\2)
                Replace: [leave empty]
                

                Put the caret at the very beginning of the document, select just the Regular Expressions mode and click on Replace All. Repeat the last command if a record has several duplicates.

                Have fun!

                Bharath kB 1 Reply Last reply Reply Quote 2
                • Terry RT
                  Terry R
                  last edited by Terry R

                  @astrosofista @Alan-Kilborn and everyone else who’s involved in this thread. As @Alan-Kilborn stated I would probably return to supply a solution sooner or later (well actually I had to do some real work, the boss insisted :-}}).
                  So because I was really concerned on supplying anything I started pulling the latest examples apart, only to realise:

                  1. There is no StartTime as the OP stated he wanted to use in his first post. Okay it probably is easy to see it is the Start field.
                  2. The fields do NOT line up with the header line. There are 19 commas in the field headings and 17 commas amongst the actual data. Note I could see that CST was the TimeZone so that meant C17 & C18 were blank (actually could be ANY fields between EndValue and C18), not even empty fields (2 commas together) for those headers.

                  It’s a bit difficult to see even padded out.

                  AID      BID Start          End            C5 C6 StartValue EndValue C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 TimeZone CreationTime
                  17884284 25  20201025000000 20201025235959 Y  N  8700       8750     80 0   900 1   1   -1  0   0           CST      20201030204926
                  

                  So we still have what appears to be artificially contrived data and it’s a waste of time even attempting a solution under these conditions.

                  At this point I think I’m out!

                  Terry

                  PS the black box extended upon posting, so it’s easy to see now.

                  1 Reply Last reply Reply Quote 1
                  • Bharath kB
                    Bharath k @Alan Kilborn
                    last edited by

                    @Alan-Kilborn - Yes the actual data i am not able to publish here due to its client data and i am not allowed to share.

                    PeterJonesP 1 Reply Last reply Reply Quote 0
                    • Bharath kB
                      Bharath k @astrosofista
                      last edited by

                      @astrosofista Thanks this helps a lot. Thanks for your time.

                      1 Reply Last reply Reply Quote 0
                      • PeterJonesP
                        PeterJones @Bharath k
                        last edited by

                        @Bharath-k said in Is there way to Search for duplicate records based on the column in notepad++ and delete the particular row:

                        actual data i am not able to publish here due to its client data and i am not allowed to share.

                        We understand that. When your “data sanitization” (removing secret info) is just putting in fake numbers and fake names (even fake column headers if you want), that’s one thing; the regex will likely still work, as long as it has the same type and width of data in each column. However, in your first post, you not only “sanitized” the data, you also changed the number of columns – that means that the regex is likely to change, because number of separators will be different, etc.

                        The advice below might help you ask better questions next time, and reduce the number of back-and-forth exchanges before you have a working solution.

                        ----

                        Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as plain text using the </> toolbar button or manual Markdown syntax. Screenshots can be pasted from the clipboard to your post using Ctrl+V to show graphical items, but any text should be included as literal text in your post so we can easily copy/paste your data. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get… Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries.

                        1 Reply Last reply Reply Quote 1
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors