Community
    • Login

    sort file removing duplicates possible?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    75 Posts 5 Posters 45.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Scott SumnerS
      Scott Sumner @patrickdrd
      last edited by

      @patrickdrd said:

      send you the file to take a look

      Post its contents on http://textuploader.com/ and reply here with the link to it. Also indicate exactly what problem you are seeing if it isn’t obvious. No guarantees about how deeply I can get involved, but I’ll take a quick look… :-)

      1 Reply Last reply Reply Quote 0
      • patrickdrdP
        patrickdrd
        last edited by

        ok, it’s here:
        http://textuploader.com/d4dkr

        removing duplicates results contain a duplicate result

        1 Reply Last reply Reply Quote 0
        • patrickdrdP
          patrickdrd
          last edited by patrickdrd

          remove duplicates without sorting produces the result:
          DBServerIP=10.1.249.215
          DBServerIP=10.12.77.185
          DBServerIP=10.1.249.215

          (1st and 3rd exact duplicates)

          and sorting first and then removing dups results in:
          DBServerIP=10.12.22.129
          DBServerIP=10.12.77.185
          DBServerIP=10.12.77.185

          last 2 duplicated

          (using your regular expression in both cases of course)

          Scott SumnerS 2 Replies Last reply Reply Quote 0
          • Scott SumnerS
            Scott Sumner @patrickdrd
            last edited by

            @patrickdrd

            Okay, so a quick look told me what is happening. It’s interesting. :-)

            If you run only the sort on this data, you’ll get this at the end of the file:

            Imgur

            You’d think that the removal of duplicate lines after this would result in only a single occurrence of DBServerIP=10.12.77.185. However, if we look closer at what is really left after the duplicate removal (turning on visibility of line-endings!), we see:

            Imgur

            We see that the two lines with that IP address truly are different–one has a line-ending and one doesn’t–and because these two lines are not the same, the regular-expression replacement is working correctly by leaving both of these lines after it does its work.

            All of the original lines that ended in .185 had line-endings (in other words, were exactly the same), so I’d say this is an artifact resulting from the sort operation (in my mind this is a sort BUG!).

            But we can work around it. There could be a regular expression solution, but maybe the regex that removes duplicates is complicated enough. What I’d suggest here is to modify the original macro, after the sort but before the find+replace, to:

            • move caret position to the end of file (the sort operation leaves it at the beginning of file)
            • insert a (Windows style) line-ending
            • move caret position back to beginning of file (in preparation for the find+replace)

            Thus:

            <Macro name="test sort and del dupe lines 2" Ctrl="no" Alt="no" Shift="no" Key="0">
                <Action type="2" message="0" wParam="42059" lParam="0" sParam="" />
                <Action type="0" message="2318" wParam="0" lParam="0" sParam="" />
                <Action type="1" message="2170" wParam="0" lParam="0" sParam="&#x000D;" />
                <Action type="1" message="2170" wParam="0" lParam="0" sParam="&#x000A;" />
                <Action type="0" message="2316" wParam="0" lParam="0" sParam="" />
                <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
                <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?-s)^(.*)(?:\R)(?s)(?=.*^\1\R)" />
                <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
                <Action type="3" message="1602" wParam="0" lParam="0" sParam="" />
                <Action type="3" message="1702" wParam="0" lParam="512" sParam="" />
                <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            </Macro>
            

            Compare that with the earlier version much earlier in this thread.

            Running this new macro on your data results in:

            Imgur

            which is the desired result.

            1 Reply Last reply Reply Quote 1
            • Scott SumnerS
              Scott Sumner @patrickdrd
              last edited by

              @patrickdrd

              For the DBServerIP=10.1.249.215 case you mentioned, if you don’t do the sort first, then a .215 line is the last line of the data you turn over to the regular expression replace operation. Thus it lacks the trailing line-ending that the earlier occurrence of a similar line has. Same issue as the .185 case…

              1 Reply Last reply Reply Quote 0
              • patrickdrdP
                patrickdrd
                last edited by

                thanks a ton!

                I’d a like a more “generic” approach, so I called trim 42056 to clear empty lines first,
                because sorting puts an empty line at the top if is finds one,
                then going on as you suggested

                Scott SumnerS 1 Reply Last reply Reply Quote 0
                • Scott SumnerS
                  Scott Sumner @patrickdrd
                  last edited by Scott Sumner

                  @patrickdrd

                  I’m glad you have a solution. Yeah, the empty line thing with sorting is rather bad, but the end-user can’t control this behavior of the sorting. I think I’ve changed my mind and it is probably best to alter the regular expression a bit in order to handle the situation where there is a duplicate-but-without-line-ending at the end of the file. So I’d suggest changing it to:

                  (?-s)^(.*)\R(?s)(?=.*^\1(?:\R|\z))

                  I’ve done two things to this regex:

                  • I removed the (?: and ) around the first \R (a simplification discussed earlier so no need to say any more here)
                  • The final \R was changed to (?:\R|\z) (see discussion below)

                  The \R|\z part is what allows an almost-duplicate at the end-of-file-without-line-ending to be detected. The new part to this is the \z which roughly means “match only at the very end of the data”.

                  The (?: and ) was added so that the | only affects the \R that precedes it and the \z that follows it

                  1 Reply Last reply Reply Quote 2
                  • guy038G
                    guy038
                    last edited by

                    Hello @patrickdrd, @scott-sumner and All,

                    Generally speaking, when you want to remove duplicate lines, from a PREVIOUSLY SORTED list, just use this simple regex S/R, below :

                    SEARCH (?-s)(.*\R)\1+

                    REPLACE \1

                    This regex is quite fast, because, in case of numerous duplicates, the part \1+ grabs all the duplicates ( with their EOL characters ), at once and just rewrites the first item of each block :-))

                    IMPORTANT : the last item of your sorted list must be followed by EOL character(s) !

                    Cheers,

                    guy038

                    1 Reply Last reply Reply Quote 3
                    • patrickdrdP
                      patrickdrd
                      last edited by

                      after trying both the macro and the textfx solution for a long time,
                      I’ve seen that still ultraedit’s sorting works much better than both of them,
                      I first rejected the macro for textfx’s favor but I found out lately that neither the latter does a good job, at least I prefer the sorting done by ultraedit ,sorry,
                      I don’t have an example at the moment, I’ll post again when I do

                      Claudia FrankC 2 Replies Last reply Reply Quote 0
                      • Claudia FrankC
                        Claudia Frank @patrickdrd
                        last edited by

                        @patrickdrd

                        sorry, did not read the whole thread but what about a python one liner?

                        editor.setText('\r\n'.join(list(set(editor.getText().splitlines()))))
                        

                        Cheers
                        Claudia

                        1 Reply Last reply Reply Quote 0
                        • Claudia FrankC
                          Claudia Frank @patrickdrd
                          last edited by

                          @patrickdrd

                          forgot sorting

                          editor.setText('\r\n'.join(sorted(list(set(editor.getText().splitlines())))))
                          

                          Cheers
                          Claudia

                          1 Reply Last reply Reply Quote 1
                          • patrickdrdP
                            patrickdrd
                            last edited by

                            thanks, I’m keeping that too and I’ll let you know

                            1 Reply Last reply Reply Quote 0
                            • patrickdrdP
                              patrickdrd
                              last edited by patrickdrd

                              ok, I tested, this doesn’t work either somehow, at least doesn’t work like ue’s one, i.e.

                              I tested with easylist from: https://easylist.to/easylist/easylist.txt (69883 lines),
                              ue’s function cuts it to: 69238

                              npp results:
                              python scipt: 69818
                              macro (scott) is a bit slow and results in 19610 lines (!) and
                              textfx results in 69818 as well

                              guy038 regular expression results in 28109 lines,
                              according to my experience, I bet that ue’s result is the correct one,
                              at least to my taste

                              Claudia FrankC 1 Reply Last reply Reply Quote 0
                              • Claudia FrankC
                                Claudia Frank @patrickdrd
                                last edited by

                                @patrickdrd

                                could you, by any chance, upload the ue cutted list?
                                To see the differences.

                                Cheers
                                Claudia

                                1 Reply Last reply Reply Quote 0
                                • patrickdrdP
                                  patrickdrd
                                  last edited by

                                  yes, of course, please tell me where, pastebin doesn’t work, it’s blocked here (at work),
                                  any other suggestions?

                                  1 Reply Last reply Reply Quote 0
                                  • Claudia FrankC
                                    Claudia Frank
                                    last edited by Claudia Frank

                                    actually pastebin is my first choice as well and haven’t used others for quite some time now.

                                    Heard about

                                    https://www.zippyshare.com/
                                    https://www.sendspace.com/

                                    should be good and anonymous but haven’t tried it so far.

                                    Cheers
                                    Claudia

                                    1 Reply Last reply Reply Quote 0
                                    • Scott SumnerS
                                      Scott Sumner
                                      last edited by

                                      Yea, I used to be a fan of regular expression replacement when doing this, but with “larger” datasets there always seems like there is so much tweaking and experimentation needed to get it right (for a particular dataset) that it is hardly worth it, unless you like playing with regular expressions all day instead of solving a particular problem and moving on quickly.

                                      A Pythonscript solution such as @Claudia-Frank 's seems fine…

                                      1 Reply Last reply Reply Quote 1
                                      • patrickdrdP
                                        patrickdrd
                                        last edited by

                                        ok, but why results are inconsistent (with large datasets)?

                                        @Claudia-Frank, unfortunately I’m not able to access zippyshare either,
                                        so I’ll upload to pastebin from home if we don’t find another solution

                                        Claudia FrankC Scott SumnerS 2 Replies Last reply Reply Quote 1
                                        • Claudia FrankC
                                          Claudia Frank @patrickdrd
                                          last edited by

                                          @patrickdrd

                                          I installed the ue trial version but can’t find the menu item to delete the duplicates.
                                          Is there anything I need to install in addition or am I blind and don’t see the obvious?

                                          Cheers
                                          Claudia

                                          1 Reply Last reply Reply Quote 0
                                          • Claudia FrankC
                                            Claudia Frank
                                            last edited by

                                            ok - found it - obviously blind :-D

                                            Cheers
                                            Claudia

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors