sort file removing duplicates possible?



  • I had problem with a file today, I don’t know why,
    is there any way I can send you the file to take a look?



  • @patrickdrd said:

    send you the file to take a look

    Post its contents on http://textuploader.com/ and reply here with the link to it. Also indicate exactly what problem you are seeing if it isn’t obvious. No guarantees about how deeply I can get involved, but I’ll take a quick look… :-)



  • ok, it’s here:
    http://textuploader.com/d4dkr

    removing duplicates results contain a duplicate result



  • remove duplicates without sorting produces the result:
    DBServerIP=10.1.249.215
    DBServerIP=10.12.77.185
    DBServerIP=10.1.249.215

    (1st and 3rd exact duplicates)

    and sorting first and then removing dups results in:
    DBServerIP=10.12.22.129
    DBServerIP=10.12.77.185
    DBServerIP=10.12.77.185

    last 2 duplicated

    (using your regular expression in both cases of course)



  • @patrickdrd

    Okay, so a quick look told me what is happening. It’s interesting. :-)

    If you run only the sort on this data, you’ll get this at the end of the file:

    Imgur

    You’d think that the removal of duplicate lines after this would result in only a single occurrence of DBServerIP=10.12.77.185. However, if we look closer at what is really left after the duplicate removal (turning on visibility of line-endings!), we see:

    Imgur

    We see that the two lines with that IP address truly are different–one has a line-ending and one doesn’t–and because these two lines are not the same, the regular-expression replacement is working correctly by leaving both of these lines after it does its work.

    All of the original lines that ended in .185 had line-endings (in other words, were exactly the same), so I’d say this is an artifact resulting from the sort operation (in my mind this is a sort BUG!).

    But we can work around it. There could be a regular expression solution, but maybe the regex that removes duplicates is complicated enough. What I’d suggest here is to modify the original macro, after the sort but before the find+replace, to:

    • move caret position to the end of file (the sort operation leaves it at the beginning of file)
    • insert a (Windows style) line-ending
    • move caret position back to beginning of file (in preparation for the find+replace)

    Thus:

    <Macro name="test sort and del dupe lines 2" Ctrl="no" Alt="no" Shift="no" Key="0">
        <Action type="2" message="0" wParam="42059" lParam="0" sParam="" />
        <Action type="0" message="2318" wParam="0" lParam="0" sParam="" />
        <Action type="1" message="2170" wParam="0" lParam="0" sParam="&#x000D;" />
        <Action type="1" message="2170" wParam="0" lParam="0" sParam="&#x000A;" />
        <Action type="0" message="2316" wParam="0" lParam="0" sParam="" />
        <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
        <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?-s)^(.*)(?:\R)(?s)(?=.*^\1\R)" />
        <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
        <Action type="3" message="1602" wParam="0" lParam="0" sParam="" />
        <Action type="3" message="1702" wParam="0" lParam="512" sParam="" />
        <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
    </Macro>
    

    Compare that with the earlier version much earlier in this thread.

    Running this new macro on your data results in:

    Imgur

    which is the desired result.



  • @patrickdrd

    For the DBServerIP=10.1.249.215 case you mentioned, if you don’t do the sort first, then a .215 line is the last line of the data you turn over to the regular expression replace operation. Thus it lacks the trailing line-ending that the earlier occurrence of a similar line has. Same issue as the .185 case…



  • thanks a ton!

    I’d a like a more “generic” approach, so I called trim 42056 to clear empty lines first,
    because sorting puts an empty line at the top if is finds one,
    then going on as you suggested



  • @patrickdrd

    I’m glad you have a solution. Yeah, the empty line thing with sorting is rather bad, but the end-user can’t control this behavior of the sorting. I think I’ve changed my mind and it is probably best to alter the regular expression a bit in order to handle the situation where there is a duplicate-but-without-line-ending at the end of the file. So I’d suggest changing it to:

    (?-s)^(.*)\R(?s)(?=.*^\1(?:\R|\z))

    I’ve done two things to this regex:

    • I removed the (?: and ) around the first \R (a simplification discussed earlier so no need to say any more here)
    • The final \R was changed to (?:\R|\z) (see discussion below)

    The \R|\z part is what allows an almost-duplicate at the end-of-file-without-line-ending to be detected. The new part to this is the \z which roughly means “match only at the very end of the data”.

    The (?: and ) was added so that the | only affects the \R that precedes it and the \z that follows it



  • Hello @patrickdrd, @scott-sumner and All,

    Generally speaking, when you want to remove duplicate lines, from a PREVIOUSLY SORTED list, just use this simple regex S/R, below :

    SEARCH (?-s)(.*\R)\1+

    REPLACE \1

    This regex is quite fast, because, in case of numerous duplicates, the part \1+ grabs all the duplicates ( with their EOL characters ), at once and just rewrites the first item of each block :-))

    IMPORTANT : the last item of your sorted list must be followed by EOL character(s) !

    Cheers,

    guy038



  • after trying both the macro and the textfx solution for a long time,
    I’ve seen that still ultraedit’s sorting works much better than both of them,
    I first rejected the macro for textfx’s favor but I found out lately that neither the latter does a good job, at least I prefer the sorting done by ultraedit ,sorry,
    I don’t have an example at the moment, I’ll post again when I do



  • @patrickdrd

    sorry, did not read the whole thread but what about a python one liner?

    editor.setText('\r\n'.join(list(set(editor.getText().splitlines()))))
    

    Cheers
    Claudia



  • @patrickdrd

    forgot sorting

    editor.setText('\r\n'.join(sorted(list(set(editor.getText().splitlines())))))
    

    Cheers
    Claudia



  • thanks, I’m keeping that too and I’ll let you know



  • ok, I tested, this doesn’t work either somehow, at least doesn’t work like ue’s one, i.e.

    I tested with easylist from: https://easylist.to/easylist/easylist.txt (69883 lines),
    ue’s function cuts it to: 69238

    npp results:
    python scipt: 69818
    macro (scott) is a bit slow and results in 19610 lines (!) and
    textfx results in 69818 as well

    guy038 regular expression results in 28109 lines,
    according to my experience, I bet that ue’s result is the correct one,
    at least to my taste



  • @patrickdrd

    could you, by any chance, upload the ue cutted list?
    To see the differences.

    Cheers
    Claudia



  • yes, of course, please tell me where, pastebin doesn’t work, it’s blocked here (at work),
    any other suggestions?



  • actually pastebin is my first choice as well and haven’t used others for quite some time now.

    Heard about

    https://www.zippyshare.com/
    https://www.sendspace.com/

    should be good and anonymous but haven’t tried it so far.

    Cheers
    Claudia



  • Yea, I used to be a fan of regular expression replacement when doing this, but with “larger” datasets there always seems like there is so much tweaking and experimentation needed to get it right (for a particular dataset) that it is hardly worth it, unless you like playing with regular expressions all day instead of solving a particular problem and moving on quickly.

    A Pythonscript solution such as @Claudia-Frank 's seems fine…



  • ok, but why results are inconsistent (with large datasets)?

    @Claudia-Frank, unfortunately I’m not able to access zippyshare either,
    so I’ll upload to pastebin from home if we don’t find another solution



  • @patrickdrd

    I installed the ue trial version but can’t find the menu item to delete the duplicates.
    Is there anything I need to install in addition or am I blind and don’t see the obvious?

    Cheers
    Claudia


Log in to reply