Delete all duplicates words in the text



  • Hello

    Please tell me how to use regular expressions to remove all duplicates in the text?

    Initial text:

    FileName,Keywords
    filename1.eps,tag1;tag2;tag3
    filename2.eps,tag4;tag1;tag5
    filename3.eps,tag6;tag2;tag9
    filename3.eps,tag7;tag2;tag3;tag8

    It should turn out:

    filename1.eps,tag1;tag2;tag3
    filename2.eps,tag4;tag5
    filename3.eps,tag6;tag9
    filename3.eps,tag7;tag8



  • Hello, @сергій-бородін and All,

    Assuming this initial text :

    filename1.eps,tag1;tag2;tag3
    filename2.eps,tag4;tag1;tag5
    filename3.eps,tag6;tag2;tag9
    filename3.eps,tag7;tag2;tag3;tag8
    filename4.eps,tag1;tag9;tag5
    filename5.eps,tag4;tag6;tag10;tag12
    filename5.eps,tag8;tag2;tag1;tag6;tag11
    filename6.eps,tag3;tag2;tag3;tag10;tag14
    filename7.eps,tag5;tag7;tag15
    filename8.eps,tag4;tag5;tag15;tag16
    filename8.eps,tag3;tag14;tag9;tag7
    filename8.eps,tag7;tag2;tag3;tag8
    filename9.eps,tag2;tag10;tag17
    filename10.eps,tag5;tag1;tag13
    filename10.eps,tag7;tag6;tag9;tag10
    filename11.eps,tag7;tag2;tag3;tag8;tag18
    filename11.eps,tag10;tag12;tag13;tag20
    filename12.eps,tag4;tag8;tag3;tag19
    filename13.eps,tag6;tag15;tag9;tag11
    filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4
    filename15.eps,tag0;tag9,tag20
    

    If I follow your algorithm, I suppose that you expect the text below :

    filename1.eps,tag1;tag2;tag3
    filename2.eps,tag4;tag5
    filename3.eps,tag6;tag9
    filename3.eps,tag7;tag8
    filename4.eps
    filename5.eps,tag10;tag12
    filename5.eps,tag11
    filename6.eps,tag14
    filename7.eps,tag15
    filename8.eps,tag16
    filename8.eps
    filename8.eps
    filename9.eps,tag17
    filename10.eps,tag13
    filename10.eps
    filename11.eps,tag18
    filename11.eps,tag20
    filename12.eps,tag19
    filename13.eps
    filename14.eps
    filename15.eps,tag0
    

    if so, here is a road map to achieve such a task ! Let’s go :


    • Move your caret at beginning of the first line of your list

    • Open the Column Editor ( Alt + C )

      • Select the option Number to Insert

      • Type in the value 1 in all the zones

      • Tick the Leading zeros option

      • Click on the OK button

    You should get :

    01filename1.eps,tag1;tag2;tag3
    02filename2.eps,tag4;tag1;tag5
    03filename3.eps,tag6;tag2;tag9
    04filename3.eps,tag7;tag2;tag3;tag8
    05filename4.eps,tag1;tag9;tag5
    06filename5.eps,tag4;tag6;tag10;tag12
    07filename5.eps,tag8;tag2;tag1;tag6;tag11
    08filename6.eps,tag3;tag2;tag3;tag10;tag14
    09filename7.eps,tag5;tag7;tag15
    10filename8.eps,tag4;tag5;tag15;tag16
    11filename8.eps,tag3;tag14;tag9;tag7
    12filename8.eps,tag7;tag2;tag3;tag8
    13filename9.eps,tag2;tag10;tag17
    14filename10.eps,tag5;tag1;tag13
    15filename10.eps,tag7;tag6;tag9;tag10
    16filename11.eps,tag7;tag2;tag3;tag8;tag18
    17filename11.eps,tag10;tag12;tag13;tag20
    18filename12.eps,tag4;tag8;tag3;tag19
    19filename13.eps,tag6;tag15;tag9;tag11
    20filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4
    21filename15.eps,tag0;tag9,tag20
    
    • Run the menu option Edit > Line Operations > Sort Lines Lexicographically Descending ( Not ascending ! )

    So :

    21filename15.eps,tag0;tag9,tag20
    20filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4
    19filename13.eps,tag6;tag15;tag9;tag11
    18filename12.eps,tag4;tag8;tag3;tag19
    17filename11.eps,tag10;tag12;tag13;tag20
    16filename11.eps,tag7;tag2;tag3;tag8;tag18
    15filename10.eps,tag7;tag6;tag9;tag10
    14filename10.eps,tag5;tag1;tag13
    13filename9.eps,tag2;tag10;tag17
    12filename8.eps,tag7;tag2;tag3;tag8
    11filename8.eps,tag3;tag14;tag9;tag7
    10filename8.eps,tag4;tag5;tag15;tag16
    09filename7.eps,tag5;tag7;tag15
    08filename6.eps,tag3;tag2;tag3;tag10;tag14
    07filename5.eps,tag8;tag2;tag1;tag6;tag11
    06filename5.eps,tag4;tag6;tag10;tag12
    05filename4.eps,tag1;tag9;tag5
    04filename3.eps,tag7;tag2;tag3;tag8
    03filename3.eps,tag6;tag2;tag9
    02filename2.eps,tag4;tag1;tag5
    01filename1.eps,tag1;tag2;tag3
    

    With this simple regex S/R, we change all this list in a one-line list :

    • Open the Replace dialog ( Ctrl + H )

      • SEARCH \R

      • REPLACE #    ( any symbol, not used yet, can be chosen )

      • Select the Regular expression search mode

      • Click on the Replace All button

    We obtain the single line, below :

    21filename15.eps,tag0;tag9,tag20#20filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4#19filename13.eps,tag6;tag15;tag9;tag11#18filename12.eps,tag4;tag8;tag3;tag19#17filename11.eps,tag10;tag12;tag13;tag20#16filename11.eps,tag7;tag2;tag3;tag8;tag18#15filename10.eps,tag7;tag6;tag9;tag10#14filename10.eps,tag5;tag1;tag13#13filename9.eps,tag2;tag10;tag17#12filename8.eps,tag7;tag2;tag3;tag8#11filename8.eps,tag3;tag14;tag9;tag7#10filename8.eps,tag4;tag5;tag15;tag16#09filename7.eps,tag5;tag7;tag15#08filename6.eps,tag3;tag2;tag3;tag10;tag14#07filename5.eps,tag8;tag2;tag1;tag6;tag11#06filename5.eps,tag4;tag6;tag10;tag12#05filename4.eps,tag1;tag9;tag5#04filename3.eps,tag7;tag2;tag3;tag8#03filename3.eps,tag6;tag2;tag9#02filename2.eps,tag4;tag1;tag5#01filename1.eps,tag1;tag2;tag3
    
    • Now, here is the regex S/R, which deletes any duplicated tags :

      • SEARCH (?-is)[,;](\w+)(?=[,;#].*?[,;]\1([,;#]|\R|\z))

      • REPLACE Leave the zone EMPTY

    Your text is shortened as below :

    21filename15.eps,tag0#20filename14.eps#19filename13.eps#18filename12.eps;tag19#17filename11.eps;tag20#16filename11.eps;tag18#15filename10.eps#14filename10.eps;tag13#13filename9.eps;tag17#12filename8.eps#11filename8.eps#10filename8.eps;tag16#09filename7.eps;tag15#08filename6.eps;tag14#07filename5.eps;tag11#06filename5.eps;tag10;tag12#05filename4.eps#04filename3.eps,tag7;tag8#03filename3.eps,tag6;tag9#02filename2.eps,tag4;tag5#01filename1.eps,tag1;tag2;tag3
    
    • Then, we use this other regex S/R to change this single line in a multi-lines list :

      • SEARCH #

      • REPLACE \r\n    ( or \n if your file is an Unix file )

    Giving :

    21filename15.eps,tag0
    20filename14.eps
    19filename13.eps
    18filename12.eps;tag19
    17filename11.eps;tag20
    16filename11.eps;tag18
    15filename10.eps
    14filename10.eps;tag13
    13filename9.eps;tag17
    12filename8.eps
    11filename8.eps
    10filename8.eps;tag16
    09filename7.eps;tag15
    08filename6.eps;tag14
    07filename5.eps;tag11
    06filename5.eps;tag10;tag12
    05filename4.eps
    04filename3.eps,tag7;tag8
    03filename3.eps,tag6;tag9
    02filename2.eps,tag4;tag5
    01filename1.eps,tag1;tag2;tag3
    
    • Run the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending
    01filename1.eps,tag1;tag2;tag3
    02filename2.eps,tag4;tag5
    03filename3.eps,tag6;tag9
    04filename3.eps,tag7;tag8
    05filename4.eps
    06filename5.eps;tag10;tag12
    07filename5.eps;tag11
    08filename6.eps;tag14
    09filename7.eps;tag15
    10filename8.eps;tag16
    11filename8.eps
    12filename8.eps
    13filename9.eps;tag17
    14filename10.eps;tag13
    15filename10.eps
    16filename11.eps;tag18
    17filename11.eps;tag20
    18filename12.eps;tag19
    19filename13.eps
    20filename14.eps
    21filename15.eps,tag0
    
    • Finally, the last regex S/R, below :

      • will get rid of the numbering, at beginning of lines

      • will replace any semi-colon, right after the string .eps with a comma

    So :

      • SEARCH ^\d+|(?<=eps)(;)

      • REPLACE ?1,

    And, here is your final expected text ;-))

    filename1.eps,tag1;tag2;tag3
    filename2.eps,tag4;tag5
    filename3.eps,tag6;tag9
    filename3.eps,tag7;tag8
    filename4.eps
    filename5.eps,tag10;tag12
    filename5.eps,tag11
    filename6.eps,tag14
    filename7.eps,tag15
    filename8.eps,tag16
    filename8.eps
    filename8.eps
    filename9.eps,tag17
    filename10.eps,tag13
    filename10.eps
    filename11.eps,tag18
    filename11.eps,tag20
    filename12.eps,tag19
    filename13.eps
    filename14.eps
    filename15.eps,tag0
    

    Best Regards,

    guy038



  • Hi, @сергій-бородін and All,

    Thinking back on your problem, here is a second method, requiring fewer steps, but which will classify each non-duplicated tag, according to a different layout !

    So, assuming the same initial text, below :

    filename1.eps,tag1;tag2;tag3
    filename2.eps,tag4;tag1;tag5
    filename3.eps,tag6;tag2;tag9
    filename3.eps,tag7;tag2;tag3;tag8
    filename4.eps,tag1;tag9;tag5
    filename5.eps,tag4;tag6;tag10;tag12
    filename5.eps,tag8;tag2;tag1;tag6;tag11
    filename6.eps,tag3;tag2;tag3;tag10;tag14
    filename7.eps,tag5;tag7;tag15
    filename8.eps,tag4;tag5;tag15;tag16
    filename8.eps,tag3;tag14;tag9;tag7
    filename8.eps,tag7;tag2;tag3;tag8
    filename9.eps,tag2;tag10;tag17
    filename10.eps,tag5;tag1;tag13
    filename10.eps,tag7;tag6;tag9;tag10
    filename11.eps,tag7;tag2;tag3;tag8;tag18
    filename11.eps,tag10;tag12;tag13;tag20
    filename12.eps,tag4;tag8;tag3;tag19
    filename13.eps,tag6;tag15;tag9;tag11
    filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4
    filename15.eps,tag0;tag9,tag20
    

    First this simple regex S/R, changes all this list in a one-line list :

    • Open the Replace dialog ( Ctrl + H )

      • SEARCH \R

      • REPLACE # ( Any symbol, not used yet, can be chosen )

      • Select the Regular expression search mode

      • Click on the Replace All button

    Which gives the single line, below :

    filename1.eps,tag1;tag2;tag3#filename2.eps,tag4;tag1;tag5#filename3.eps,tag6;tag2;tag9#filename3.eps,tag7;tag2;tag3;tag8#filename4.eps,tag1;tag9;tag5#filename5.eps,tag4;tag6;tag10;tag12#filename5.eps,tag8;tag2;tag1;tag6;tag11#filename6.eps,tag3;tag2;tag3;tag10;tag14#filename7.eps,tag5;tag7;tag15#filename8.eps,tag4;tag5;tag15;tag16#filename8.eps,tag3;tag14;tag9;tag7#filename8.eps,tag7;tag2;tag3;tag8#filename9.eps,tag2;tag10;tag17#filename10.eps,tag5;tag1;tag13#filename10.eps,tag7;tag6;tag9;tag10#filename11.eps,tag7;tag2;tag3;tag8;tag18#filename11.eps,tag10;tag12;tag13;tag20#filename12.eps,tag4;tag8;tag3;tag19#filename13.eps,tag6;tag15;tag9;tag11#filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4#filename15.eps,tag0;tag9,tag20
    
    • Now, here is the regex S/R, which deletes any duplicated tag ( The same regex, described in my previous post ) :

      • SEARCH (?-is)[,;](\w+)(?=[,;#].*?[,;]\1([,;#]|\R|\z))

      • REPLACE Leave the zone EMPTY

    Your text should be shortened as below :

    filename1.eps#filename2.eps#filename3.eps#filename3.eps#filename4.eps#filename5.eps#filename5.eps#filename6.eps#filename7.eps#filename8.eps;tag16#filename8.eps;tag14#filename8.eps#filename9.eps#filename10.eps,tag5;tag1#filename10.eps#filename11.eps;tag18#filename11.eps,tag10;tag13#filename12.eps;tag8;tag19#filename13.eps,tag6;tag15;tag11#filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4#filename15.eps,tag0;tag9,tag20
    
    • Finally, this regex S/R, below :

      • Replaces any semi-colon, right after the string eps with a comma

      • Replaces any # symbol with a line-break ( \r\n or \n )

    SEARCH eps;|(#)

    REPLACE ?1\r\n:eps,    OR    ?1\n:eps, if you works with an Unix file

    And we obtain the final output :

    filename1.eps
    filename2.eps
    filename3.eps
    filename3.eps
    filename4.eps
    filename5.eps
    filename5.eps
    filename6.eps
    filename7.eps
    filename8.eps,tag16
    filename8.eps,tag14
    filename8.eps
    filename9.eps
    filename10.eps,tag5;tag1
    filename10.eps
    filename11.eps,tag18
    filename11.eps,tag10;tag13
    filename12.eps,tag8;tag19
    filename13.eps,tag6;tag15;tag11
    filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4
    filename15.eps,tag0;tag9,tag20
    

    As you can see, the 21 non-duplicated tags ( From tag0 to tag20 ) are arranged differently, with many lines without tag, at beginning of the list !

    Best Regards,

    guy038


Log in to reply