Community
    • Login

    Is there a way to search for duplicate records in Notepad++?

    Scheduled Pinned Locked Moved General Discussion
    9 Posts 7 Posters 210.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Adam BuxbaumA
      Adam Buxbaum
      last edited by

      Hi All,

      I currently and using Notepad++ to review user files before uploading into our provisioning system and I was curious if there was a way to search for duplicates (emails, UID’s, etc…) within Notepad++ or do I have to save the file and review it in excel to do this?

      All assistance is greatly appreciated.

      Best,
      Adam

      1 Reply Last reply Reply Quote 0
      • tomas-chrastinaT
        tomas-chrastina
        last edited by tomas-chrastina

        Hi Adam,

        I don’t think there is a way in meaning of search. If you don’t think like use Smart Highlighting, CTRL+F3 or just search.

        But there’s a way to remove duplicates from simple list without excel (I use it a lot). So if you have simple list of values:

        value1
        value2
        value2
        value3
        value2
        

        you can simply get list of unique values:

        value1
        value2
        value3
        

        like this:

        1. You need plugin TextFX Characters
        2. Backup your current editing file !!!
        3. Set TextFX: Menu -> TextFX -> TextFX Tools:
          ✓ +Sort ascending
          ✓ +Sort outputs only UNIQUE (at column) lines
        4. Select text
        5. Use one of the actions: Menu -> TextFX -> TextFX Tools:
          a) Sort lines case sensitive (at column)
          b) Sort lines case insensitive (at column)
        6. Remember to DISABLE option +Sort outputs only UNIQUE (at column) lines, so you won’t lose data when just sorting later!

        Still it won’t work for some complex multi-column data, where only Excel filters/remove duplicates of specific data will help.


        Best regards,
        Tomas

        1 Reply Last reply Reply Quote 0
        • rajeshp2408R
          rajeshp2408
          last edited by

          Thanks really helped…:-)

          1 Reply Last reply Reply Quote 0
          • Matthias HeimM
            Matthias Heim
            last edited by

            [Adding my own answer, since this answer gets so many views and was the top result on google]
            There is no need to use a plugin.

            You can easily find duplicate lines with the following regex:
            ^([^\r\n]+)$(?=.*?^\1$)

            This will find the all occurrences of duplicate lines except the last, so you can also use search and replace to delete them.

            You can see it in action here: https://regex101.com/r/5GPJfz/1

            Just make sure that you activate the option “. finds \r and \n” in the search-dialogue.

            Alan KilbornA 1 Reply Last reply Reply Quote 3
            • Alan KilbornA
              Alan Kilborn @Matthias Heim
              last edited by

              @Matthias-Heim

              For me, I like this one to do the same thing:

              ^((?-s).+?)\R(?=(?s).*?^\1(?:\R|\z))

              It has (at least) two advantages:

              • You don’t have to care about the state of the . matches newline box

              • The last line of the file doesn’t have to have a line-ending on it to be considered in the duplicate decision (the text itself decides that) – whether it is truly a duplicate then is up for debate, but I think it is

              1 Reply Last reply Reply Quote 5
              • guy038G
                guy038
                last edited by

                Hello, @matthias-heim, @alan-kilborn and All,

                Alan, I don’t think that the lazy quantifier, at beginning of the regex is necessary, as, obviously, the EOL chars must be matched, anyway !

                Hence, the syntax :

                (?-s)^(.+)\R(?=(?s).*?^\1(?:\R|\z))


                However, @matthias-heim be aware that in case of an important amount of lines between current line scanned and its nearest duplicate, the regex may completely fail to detect correct matches :-((

                Best Regards

                guy038

                Mohammed AsifM 1 Reply Last reply Reply Quote 4
                • Mohammed AsifM
                  Mohammed Asif @guy038
                  last edited by

                  @guy038 said in Is there a way to search for duplicate records in Notepad++?:

                  (?-s)^(.+)\R(?=(?s).*?^\1(?:\R|\z))

                  can you please tell me how to mark both lines (original+duplicate)?

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by

                    Hello, @mohammed-asif and All,

                    Before practically answering to your question, could you tell us some hints about your data :

                    • Why do you want to mark all the duplicate lines ? Do you intend to delete them all or copy them for other process or else ?

                    • How many lines, about, to processed and the average length of the lines ?

                    • How many lines, max, about, between two duplicate lines ?

                    May be, you could add a short example of your text ?


                    I’ve already found out a solution but it mainly depends on the data’s organization and on what kind of process is needed after bookmarking !

                    See you later,

                    Best Regards,

                    guy038

                    Alan KilbornA 1 Reply Last reply Reply Quote 2
                    • YaronY Yaron referenced this topic on
                    • Alan KilbornA
                      Alan Kilborn @guy038
                      last edited by

                      @guy038 said in Is there a way to search for duplicate records in Notepad++?:

                      Why do you want to mark all the duplicate lines ?

                      A practical reason (not involving copy/cut/delete) for this might be so that each duplicate line can be visited and manually edited to be made unique in some way.

                      1 Reply Last reply Reply Quote 4
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors