Community
    • Login

    sort file removing duplicates possible?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    75 Posts 5 Posters 45.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • patrickdrdP
      patrickdrd
      last edited by

      I had problem with a file today, I don’t know why,
      is there any way I can send you the file to take a look?

      Scott SumnerS 1 Reply Last reply Reply Quote 0
      • Scott SumnerS
        Scott Sumner @patrickdrd
        last edited by

        @patrickdrd said:

        send you the file to take a look

        Post its contents on http://textuploader.com/ and reply here with the link to it. Also indicate exactly what problem you are seeing if it isn’t obvious. No guarantees about how deeply I can get involved, but I’ll take a quick look… :-)

        1 Reply Last reply Reply Quote 0
        • patrickdrdP
          patrickdrd
          last edited by

          ok, it’s here:
          http://textuploader.com/d4dkr

          removing duplicates results contain a duplicate result

          1 Reply Last reply Reply Quote 0
          • patrickdrdP
            patrickdrd
            last edited by patrickdrd

            remove duplicates without sorting produces the result:
            DBServerIP=10.1.249.215
            DBServerIP=10.12.77.185
            DBServerIP=10.1.249.215

            (1st and 3rd exact duplicates)

            and sorting first and then removing dups results in:
            DBServerIP=10.12.22.129
            DBServerIP=10.12.77.185
            DBServerIP=10.12.77.185

            last 2 duplicated

            (using your regular expression in both cases of course)

            Scott SumnerS 2 Replies Last reply Reply Quote 0
            • Scott SumnerS
              Scott Sumner @patrickdrd
              last edited by

              @patrickdrd

              Okay, so a quick look told me what is happening. It’s interesting. :-)

              If you run only the sort on this data, you’ll get this at the end of the file:

              Imgur

              You’d think that the removal of duplicate lines after this would result in only a single occurrence of DBServerIP=10.12.77.185. However, if we look closer at what is really left after the duplicate removal (turning on visibility of line-endings!), we see:

              Imgur

              We see that the two lines with that IP address truly are different–one has a line-ending and one doesn’t–and because these two lines are not the same, the regular-expression replacement is working correctly by leaving both of these lines after it does its work.

              All of the original lines that ended in .185 had line-endings (in other words, were exactly the same), so I’d say this is an artifact resulting from the sort operation (in my mind this is a sort BUG!).

              But we can work around it. There could be a regular expression solution, but maybe the regex that removes duplicates is complicated enough. What I’d suggest here is to modify the original macro, after the sort but before the find+replace, to:

              • move caret position to the end of file (the sort operation leaves it at the beginning of file)
              • insert a (Windows style) line-ending
              • move caret position back to beginning of file (in preparation for the find+replace)

              Thus:

              <Macro name="test sort and del dupe lines 2" Ctrl="no" Alt="no" Shift="no" Key="0">
                  <Action type="2" message="0" wParam="42059" lParam="0" sParam="" />
                  <Action type="0" message="2318" wParam="0" lParam="0" sParam="" />
                  <Action type="1" message="2170" wParam="0" lParam="0" sParam="&#x000D;" />
                  <Action type="1" message="2170" wParam="0" lParam="0" sParam="&#x000A;" />
                  <Action type="0" message="2316" wParam="0" lParam="0" sParam="" />
                  <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
                  <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?-s)^(.*)(?:\R)(?s)(?=.*^\1\R)" />
                  <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
                  <Action type="3" message="1602" wParam="0" lParam="0" sParam="" />
                  <Action type="3" message="1702" wParam="0" lParam="512" sParam="" />
                  <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
              </Macro>
              

              Compare that with the earlier version much earlier in this thread.

              Running this new macro on your data results in:

              Imgur

              which is the desired result.

              1 Reply Last reply Reply Quote 1
              • Scott SumnerS
                Scott Sumner @patrickdrd
                last edited by

                @patrickdrd

                For the DBServerIP=10.1.249.215 case you mentioned, if you don’t do the sort first, then a .215 line is the last line of the data you turn over to the regular expression replace operation. Thus it lacks the trailing line-ending that the earlier occurrence of a similar line has. Same issue as the .185 case…

                1 Reply Last reply Reply Quote 0
                • patrickdrdP
                  patrickdrd
                  last edited by

                  thanks a ton!

                  I’d a like a more “generic” approach, so I called trim 42056 to clear empty lines first,
                  because sorting puts an empty line at the top if is finds one,
                  then going on as you suggested

                  Scott SumnerS 1 Reply Last reply Reply Quote 0
                  • Scott SumnerS
                    Scott Sumner @patrickdrd
                    last edited by Scott Sumner

                    @patrickdrd

                    I’m glad you have a solution. Yeah, the empty line thing with sorting is rather bad, but the end-user can’t control this behavior of the sorting. I think I’ve changed my mind and it is probably best to alter the regular expression a bit in order to handle the situation where there is a duplicate-but-without-line-ending at the end of the file. So I’d suggest changing it to:

                    (?-s)^(.*)\R(?s)(?=.*^\1(?:\R|\z))

                    I’ve done two things to this regex:

                    • I removed the (?: and ) around the first \R (a simplification discussed earlier so no need to say any more here)
                    • The final \R was changed to (?:\R|\z) (see discussion below)

                    The \R|\z part is what allows an almost-duplicate at the end-of-file-without-line-ending to be detected. The new part to this is the \z which roughly means “match only at the very end of the data”.

                    The (?: and ) was added so that the | only affects the \R that precedes it and the \z that follows it

                    1 Reply Last reply Reply Quote 2
                    • guy038G
                      guy038
                      last edited by

                      Hello @patrickdrd, @scott-sumner and All,

                      Generally speaking, when you want to remove duplicate lines, from a PREVIOUSLY SORTED list, just use this simple regex S/R, below :

                      SEARCH (?-s)(.*\R)\1+

                      REPLACE \1

                      This regex is quite fast, because, in case of numerous duplicates, the part \1+ grabs all the duplicates ( with their EOL characters ), at once and just rewrites the first item of each block :-))

                      IMPORTANT : the last item of your sorted list must be followed by EOL character(s) !

                      Cheers,

                      guy038

                      1 Reply Last reply Reply Quote 3
                      • patrickdrdP
                        patrickdrd
                        last edited by

                        after trying both the macro and the textfx solution for a long time,
                        I’ve seen that still ultraedit’s sorting works much better than both of them,
                        I first rejected the macro for textfx’s favor but I found out lately that neither the latter does a good job, at least I prefer the sorting done by ultraedit ,sorry,
                        I don’t have an example at the moment, I’ll post again when I do

                        Claudia FrankC 2 Replies Last reply Reply Quote 0
                        • Claudia FrankC
                          Claudia Frank @patrickdrd
                          last edited by

                          @patrickdrd

                          sorry, did not read the whole thread but what about a python one liner?

                          editor.setText('\r\n'.join(list(set(editor.getText().splitlines()))))
                          

                          Cheers
                          Claudia

                          1 Reply Last reply Reply Quote 0
                          • Claudia FrankC
                            Claudia Frank @patrickdrd
                            last edited by

                            @patrickdrd

                            forgot sorting

                            editor.setText('\r\n'.join(sorted(list(set(editor.getText().splitlines())))))
                            

                            Cheers
                            Claudia

                            1 Reply Last reply Reply Quote 1
                            • patrickdrdP
                              patrickdrd
                              last edited by

                              thanks, I’m keeping that too and I’ll let you know

                              1 Reply Last reply Reply Quote 0
                              • patrickdrdP
                                patrickdrd
                                last edited by patrickdrd

                                ok, I tested, this doesn’t work either somehow, at least doesn’t work like ue’s one, i.e.

                                I tested with easylist from: https://easylist.to/easylist/easylist.txt (69883 lines),
                                ue’s function cuts it to: 69238

                                npp results:
                                python scipt: 69818
                                macro (scott) is a bit slow and results in 19610 lines (!) and
                                textfx results in 69818 as well

                                guy038 regular expression results in 28109 lines,
                                according to my experience, I bet that ue’s result is the correct one,
                                at least to my taste

                                Claudia FrankC 1 Reply Last reply Reply Quote 0
                                • Claudia FrankC
                                  Claudia Frank @patrickdrd
                                  last edited by

                                  @patrickdrd

                                  could you, by any chance, upload the ue cutted list?
                                  To see the differences.

                                  Cheers
                                  Claudia

                                  1 Reply Last reply Reply Quote 0
                                  • patrickdrdP
                                    patrickdrd
                                    last edited by

                                    yes, of course, please tell me where, pastebin doesn’t work, it’s blocked here (at work),
                                    any other suggestions?

                                    1 Reply Last reply Reply Quote 0
                                    • Claudia FrankC
                                      Claudia Frank
                                      last edited by Claudia Frank

                                      actually pastebin is my first choice as well and haven’t used others for quite some time now.

                                      Heard about

                                      https://www.zippyshare.com/
                                      https://www.sendspace.com/

                                      should be good and anonymous but haven’t tried it so far.

                                      Cheers
                                      Claudia

                                      1 Reply Last reply Reply Quote 0
                                      • Scott SumnerS
                                        Scott Sumner
                                        last edited by

                                        Yea, I used to be a fan of regular expression replacement when doing this, but with “larger” datasets there always seems like there is so much tweaking and experimentation needed to get it right (for a particular dataset) that it is hardly worth it, unless you like playing with regular expressions all day instead of solving a particular problem and moving on quickly.

                                        A Pythonscript solution such as @Claudia-Frank 's seems fine…

                                        1 Reply Last reply Reply Quote 1
                                        • patrickdrdP
                                          patrickdrd
                                          last edited by

                                          ok, but why results are inconsistent (with large datasets)?

                                          @Claudia-Frank, unfortunately I’m not able to access zippyshare either,
                                          so I’ll upload to pastebin from home if we don’t find another solution

                                          Claudia FrankC Scott SumnerS 2 Replies Last reply Reply Quote 1
                                          • Claudia FrankC
                                            Claudia Frank @patrickdrd
                                            last edited by

                                            @patrickdrd

                                            I installed the ue trial version but can’t find the menu item to delete the duplicates.
                                            Is there anything I need to install in addition or am I blind and don’t see the obvious?

                                            Cheers
                                            Claudia

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors