Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    Delete all duplicates words in the text

    Help wanted · · · – – – · · ·
    2
    3
    90
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Сергій Бородін
      Сергій Бородін last edited by

      Hello

      Please tell me how to use regular expressions to remove all duplicates in the text?

      Initial text:

      FileName,Keywords
      filename1.eps,tag1;tag2;tag3
      filename2.eps,tag4;tag1;tag5
      filename3.eps,tag6;tag2;tag9
      filename3.eps,tag7;tag2;tag3;tag8

      It should turn out:

      filename1.eps,tag1;tag2;tag3
      filename2.eps,tag4;tag5
      filename3.eps,tag6;tag9
      filename3.eps,tag7;tag8

      1 Reply Last reply Reply Quote 0
      • guy038
        guy038 last edited by guy038

        Hello, @сергій-бородін and All,

        Assuming this initial text :

        filename1.eps,tag1;tag2;tag3
        filename2.eps,tag4;tag1;tag5
        filename3.eps,tag6;tag2;tag9
        filename3.eps,tag7;tag2;tag3;tag8
        filename4.eps,tag1;tag9;tag5
        filename5.eps,tag4;tag6;tag10;tag12
        filename5.eps,tag8;tag2;tag1;tag6;tag11
        filename6.eps,tag3;tag2;tag3;tag10;tag14
        filename7.eps,tag5;tag7;tag15
        filename8.eps,tag4;tag5;tag15;tag16
        filename8.eps,tag3;tag14;tag9;tag7
        filename8.eps,tag7;tag2;tag3;tag8
        filename9.eps,tag2;tag10;tag17
        filename10.eps,tag5;tag1;tag13
        filename10.eps,tag7;tag6;tag9;tag10
        filename11.eps,tag7;tag2;tag3;tag8;tag18
        filename11.eps,tag10;tag12;tag13;tag20
        filename12.eps,tag4;tag8;tag3;tag19
        filename13.eps,tag6;tag15;tag9;tag11
        filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4
        filename15.eps,tag0;tag9,tag20
        

        If I follow your algorithm, I suppose that you expect the text below :

        filename1.eps,tag1;tag2;tag3
        filename2.eps,tag4;tag5
        filename3.eps,tag6;tag9
        filename3.eps,tag7;tag8
        filename4.eps
        filename5.eps,tag10;tag12
        filename5.eps,tag11
        filename6.eps,tag14
        filename7.eps,tag15
        filename8.eps,tag16
        filename8.eps
        filename8.eps
        filename9.eps,tag17
        filename10.eps,tag13
        filename10.eps
        filename11.eps,tag18
        filename11.eps,tag20
        filename12.eps,tag19
        filename13.eps
        filename14.eps
        filename15.eps,tag0
        

        if so, here is a road map to achieve such a task ! Let’s go :


        • Move your caret at beginning of the first line of your list

        • Open the Column Editor ( Alt + C )

          • Select the option Number to Insert

          • Type in the value 1 in all the zones

          • Tick the Leading zeros option

          • Click on the OK button

        You should get :

        01filename1.eps,tag1;tag2;tag3
        02filename2.eps,tag4;tag1;tag5
        03filename3.eps,tag6;tag2;tag9
        04filename3.eps,tag7;tag2;tag3;tag8
        05filename4.eps,tag1;tag9;tag5
        06filename5.eps,tag4;tag6;tag10;tag12
        07filename5.eps,tag8;tag2;tag1;tag6;tag11
        08filename6.eps,tag3;tag2;tag3;tag10;tag14
        09filename7.eps,tag5;tag7;tag15
        10filename8.eps,tag4;tag5;tag15;tag16
        11filename8.eps,tag3;tag14;tag9;tag7
        12filename8.eps,tag7;tag2;tag3;tag8
        13filename9.eps,tag2;tag10;tag17
        14filename10.eps,tag5;tag1;tag13
        15filename10.eps,tag7;tag6;tag9;tag10
        16filename11.eps,tag7;tag2;tag3;tag8;tag18
        17filename11.eps,tag10;tag12;tag13;tag20
        18filename12.eps,tag4;tag8;tag3;tag19
        19filename13.eps,tag6;tag15;tag9;tag11
        20filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4
        21filename15.eps,tag0;tag9,tag20
        
        • Run the menu option Edit > Line Operations > Sort Lines Lexicographically Descending ( Not ascending ! )

        So :

        21filename15.eps,tag0;tag9,tag20
        20filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4
        19filename13.eps,tag6;tag15;tag9;tag11
        18filename12.eps,tag4;tag8;tag3;tag19
        17filename11.eps,tag10;tag12;tag13;tag20
        16filename11.eps,tag7;tag2;tag3;tag8;tag18
        15filename10.eps,tag7;tag6;tag9;tag10
        14filename10.eps,tag5;tag1;tag13
        13filename9.eps,tag2;tag10;tag17
        12filename8.eps,tag7;tag2;tag3;tag8
        11filename8.eps,tag3;tag14;tag9;tag7
        10filename8.eps,tag4;tag5;tag15;tag16
        09filename7.eps,tag5;tag7;tag15
        08filename6.eps,tag3;tag2;tag3;tag10;tag14
        07filename5.eps,tag8;tag2;tag1;tag6;tag11
        06filename5.eps,tag4;tag6;tag10;tag12
        05filename4.eps,tag1;tag9;tag5
        04filename3.eps,tag7;tag2;tag3;tag8
        03filename3.eps,tag6;tag2;tag9
        02filename2.eps,tag4;tag1;tag5
        01filename1.eps,tag1;tag2;tag3
        

        With this simple regex S/R, we change all this list in a one-line list :

        • Open the Replace dialog ( Ctrl + H )

          • SEARCH \R

          • REPLACE #    ( any symbol, not used yet, can be chosen )

          • Select the Regular expression search mode

          • Click on the Replace All button

        We obtain the single line, below :

        21filename15.eps,tag0;tag9,tag20#20filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4#19filename13.eps,tag6;tag15;tag9;tag11#18filename12.eps,tag4;tag8;tag3;tag19#17filename11.eps,tag10;tag12;tag13;tag20#16filename11.eps,tag7;tag2;tag3;tag8;tag18#15filename10.eps,tag7;tag6;tag9;tag10#14filename10.eps,tag5;tag1;tag13#13filename9.eps,tag2;tag10;tag17#12filename8.eps,tag7;tag2;tag3;tag8#11filename8.eps,tag3;tag14;tag9;tag7#10filename8.eps,tag4;tag5;tag15;tag16#09filename7.eps,tag5;tag7;tag15#08filename6.eps,tag3;tag2;tag3;tag10;tag14#07filename5.eps,tag8;tag2;tag1;tag6;tag11#06filename5.eps,tag4;tag6;tag10;tag12#05filename4.eps,tag1;tag9;tag5#04filename3.eps,tag7;tag2;tag3;tag8#03filename3.eps,tag6;tag2;tag9#02filename2.eps,tag4;tag1;tag5#01filename1.eps,tag1;tag2;tag3
        
        • Now, here is the regex S/R, which deletes any duplicated tags :

          • SEARCH (?-is)[,;](\w+)(?=[,;#].*?[,;]\1([,;#]|\R|\z))

          • REPLACE Leave the zone EMPTY

        Your text is shortened as below :

        21filename15.eps,tag0#20filename14.eps#19filename13.eps#18filename12.eps;tag19#17filename11.eps;tag20#16filename11.eps;tag18#15filename10.eps#14filename10.eps;tag13#13filename9.eps;tag17#12filename8.eps#11filename8.eps#10filename8.eps;tag16#09filename7.eps;tag15#08filename6.eps;tag14#07filename5.eps;tag11#06filename5.eps;tag10;tag12#05filename4.eps#04filename3.eps,tag7;tag8#03filename3.eps,tag6;tag9#02filename2.eps,tag4;tag5#01filename1.eps,tag1;tag2;tag3
        
        • Then, we use this other regex S/R to change this single line in a multi-lines list :

          • SEARCH #

          • REPLACE \r\n    ( or \n if your file is an Unix file )

        Giving :

        21filename15.eps,tag0
        20filename14.eps
        19filename13.eps
        18filename12.eps;tag19
        17filename11.eps;tag20
        16filename11.eps;tag18
        15filename10.eps
        14filename10.eps;tag13
        13filename9.eps;tag17
        12filename8.eps
        11filename8.eps
        10filename8.eps;tag16
        09filename7.eps;tag15
        08filename6.eps;tag14
        07filename5.eps;tag11
        06filename5.eps;tag10;tag12
        05filename4.eps
        04filename3.eps,tag7;tag8
        03filename3.eps,tag6;tag9
        02filename2.eps,tag4;tag5
        01filename1.eps,tag1;tag2;tag3
        
        • Run the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending
        01filename1.eps,tag1;tag2;tag3
        02filename2.eps,tag4;tag5
        03filename3.eps,tag6;tag9
        04filename3.eps,tag7;tag8
        05filename4.eps
        06filename5.eps;tag10;tag12
        07filename5.eps;tag11
        08filename6.eps;tag14
        09filename7.eps;tag15
        10filename8.eps;tag16
        11filename8.eps
        12filename8.eps
        13filename9.eps;tag17
        14filename10.eps;tag13
        15filename10.eps
        16filename11.eps;tag18
        17filename11.eps;tag20
        18filename12.eps;tag19
        19filename13.eps
        20filename14.eps
        21filename15.eps,tag0
        
        • Finally, the last regex S/R, below :

          • will get rid of the numbering, at beginning of lines

          • will replace any semi-colon, right after the string .eps with a comma

        So :

          • SEARCH ^\d+|(?<=eps)(;)

          • REPLACE ?1,

        And, here is your final expected text ;-))

        filename1.eps,tag1;tag2;tag3
        filename2.eps,tag4;tag5
        filename3.eps,tag6;tag9
        filename3.eps,tag7;tag8
        filename4.eps
        filename5.eps,tag10;tag12
        filename5.eps,tag11
        filename6.eps,tag14
        filename7.eps,tag15
        filename8.eps,tag16
        filename8.eps
        filename8.eps
        filename9.eps,tag17
        filename10.eps,tag13
        filename10.eps
        filename11.eps,tag18
        filename11.eps,tag20
        filename12.eps,tag19
        filename13.eps
        filename14.eps
        filename15.eps,tag0
        

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 2
        • guy038
          guy038 last edited by guy038

          Hi, @сергій-бородін and All,

          Thinking back on your problem, here is a second method, requiring fewer steps, but which will classify each non-duplicated tag, according to a different layout !

          So, assuming the same initial text, below :

          filename1.eps,tag1;tag2;tag3
          filename2.eps,tag4;tag1;tag5
          filename3.eps,tag6;tag2;tag9
          filename3.eps,tag7;tag2;tag3;tag8
          filename4.eps,tag1;tag9;tag5
          filename5.eps,tag4;tag6;tag10;tag12
          filename5.eps,tag8;tag2;tag1;tag6;tag11
          filename6.eps,tag3;tag2;tag3;tag10;tag14
          filename7.eps,tag5;tag7;tag15
          filename8.eps,tag4;tag5;tag15;tag16
          filename8.eps,tag3;tag14;tag9;tag7
          filename8.eps,tag7;tag2;tag3;tag8
          filename9.eps,tag2;tag10;tag17
          filename10.eps,tag5;tag1;tag13
          filename10.eps,tag7;tag6;tag9;tag10
          filename11.eps,tag7;tag2;tag3;tag8;tag18
          filename11.eps,tag10;tag12;tag13;tag20
          filename12.eps,tag4;tag8;tag3;tag19
          filename13.eps,tag6;tag15;tag9;tag11
          filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4
          filename15.eps,tag0;tag9,tag20
          

          First this simple regex S/R, changes all this list in a one-line list :

          • Open the Replace dialog ( Ctrl + H )

            • SEARCH \R

            • REPLACE # ( Any symbol, not used yet, can be chosen )

            • Select the Regular expression search mode

            • Click on the Replace All button

          Which gives the single line, below :

          filename1.eps,tag1;tag2;tag3#filename2.eps,tag4;tag1;tag5#filename3.eps,tag6;tag2;tag9#filename3.eps,tag7;tag2;tag3;tag8#filename4.eps,tag1;tag9;tag5#filename5.eps,tag4;tag6;tag10;tag12#filename5.eps,tag8;tag2;tag1;tag6;tag11#filename6.eps,tag3;tag2;tag3;tag10;tag14#filename7.eps,tag5;tag7;tag15#filename8.eps,tag4;tag5;tag15;tag16#filename8.eps,tag3;tag14;tag9;tag7#filename8.eps,tag7;tag2;tag3;tag8#filename9.eps,tag2;tag10;tag17#filename10.eps,tag5;tag1;tag13#filename10.eps,tag7;tag6;tag9;tag10#filename11.eps,tag7;tag2;tag3;tag8;tag18#filename11.eps,tag10;tag12;tag13;tag20#filename12.eps,tag4;tag8;tag3;tag19#filename13.eps,tag6;tag15;tag9;tag11#filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4#filename15.eps,tag0;tag9,tag20
          
          • Now, here is the regex S/R, which deletes any duplicated tag ( The same regex, described in my previous post ) :

            • SEARCH (?-is)[,;](\w+)(?=[,;#].*?[,;]\1([,;#]|\R|\z))

            • REPLACE Leave the zone EMPTY

          Your text should be shortened as below :

          filename1.eps#filename2.eps#filename3.eps#filename3.eps#filename4.eps#filename5.eps#filename5.eps#filename6.eps#filename7.eps#filename8.eps;tag16#filename8.eps;tag14#filename8.eps#filename9.eps#filename10.eps,tag5;tag1#filename10.eps#filename11.eps;tag18#filename11.eps,tag10;tag13#filename12.eps;tag8;tag19#filename13.eps,tag6;tag15;tag11#filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4#filename15.eps,tag0;tag9,tag20
          
          • Finally, this regex S/R, below :

            • Replaces any semi-colon, right after the string eps with a comma

            • Replaces any # symbol with a line-break ( \r\n or \n )

          SEARCH eps;|(#)

          REPLACE ?1\r\n:eps,    OR    ?1\n:eps, if you works with an Unix file

          And we obtain the final output :

          filename1.eps
          filename2.eps
          filename3.eps
          filename3.eps
          filename4.eps
          filename5.eps
          filename5.eps
          filename6.eps
          filename7.eps
          filename8.eps,tag16
          filename8.eps,tag14
          filename8.eps
          filename9.eps
          filename10.eps,tag5;tag1
          filename10.eps
          filename11.eps,tag18
          filename11.eps,tag10;tag13
          filename12.eps,tag8;tag19
          filename13.eps,tag6;tag15;tag11
          filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4
          filename15.eps,tag0;tag9,tag20
          

          As you can see, the 21 non-duplicated tags ( From tag0 to tag20 ) are arranged differently, with many lines without tag, at beginning of the list !

          Best Regards,

          guy038

          1 Reply Last reply Reply Quote 2
          • First post
            Last post
          Copyright © 2014 NodeBB Forums | Contributors