Delete all duplicates words in the text
-
Hello
Please tell me how to use regular expressions to remove all duplicates in the text?
Initial text:
FileName,Keywords
filename1.eps,tag1;tag2;tag3
filename2.eps,tag4;tag1;tag5
filename3.eps,tag6;tag2;tag9
filename3.eps,tag7;tag2;tag3;tag8It should turn out:
filename1.eps,tag1;tag2;tag3
filename2.eps,tag4;tag5
filename3.eps,tag6;tag9
filename3.eps,tag7;tag8 -
Hello, @сергій-бородін and All,
Assuming this initial text :
filename1.eps,tag1;tag2;tag3 filename2.eps,tag4;tag1;tag5 filename3.eps,tag6;tag2;tag9 filename3.eps,tag7;tag2;tag3;tag8 filename4.eps,tag1;tag9;tag5 filename5.eps,tag4;tag6;tag10;tag12 filename5.eps,tag8;tag2;tag1;tag6;tag11 filename6.eps,tag3;tag2;tag3;tag10;tag14 filename7.eps,tag5;tag7;tag15 filename8.eps,tag4;tag5;tag15;tag16 filename8.eps,tag3;tag14;tag9;tag7 filename8.eps,tag7;tag2;tag3;tag8 filename9.eps,tag2;tag10;tag17 filename10.eps,tag5;tag1;tag13 filename10.eps,tag7;tag6;tag9;tag10 filename11.eps,tag7;tag2;tag3;tag8;tag18 filename11.eps,tag10;tag12;tag13;tag20 filename12.eps,tag4;tag8;tag3;tag19 filename13.eps,tag6;tag15;tag9;tag11 filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4 filename15.eps,tag0;tag9,tag20
If I follow your algorithm, I suppose that you expect the text below :
filename1.eps,tag1;tag2;tag3 filename2.eps,tag4;tag5 filename3.eps,tag6;tag9 filename3.eps,tag7;tag8 filename4.eps filename5.eps,tag10;tag12 filename5.eps,tag11 filename6.eps,tag14 filename7.eps,tag15 filename8.eps,tag16 filename8.eps filename8.eps filename9.eps,tag17 filename10.eps,tag13 filename10.eps filename11.eps,tag18 filename11.eps,tag20 filename12.eps,tag19 filename13.eps filename14.eps filename15.eps,tag0
if so, here is a road map to achieve such a task ! Let’s go :
-
Move your caret at beginning of the first line of your list
-
Open the Column Editor (
Alt + C
)-
Select the option
Number to Insert
-
Type in the value
1
in all the zones -
Tick the
Leading zeros
option -
Click on the
OK
button
-
You should get :
01filename1.eps,tag1;tag2;tag3 02filename2.eps,tag4;tag1;tag5 03filename3.eps,tag6;tag2;tag9 04filename3.eps,tag7;tag2;tag3;tag8 05filename4.eps,tag1;tag9;tag5 06filename5.eps,tag4;tag6;tag10;tag12 07filename5.eps,tag8;tag2;tag1;tag6;tag11 08filename6.eps,tag3;tag2;tag3;tag10;tag14 09filename7.eps,tag5;tag7;tag15 10filename8.eps,tag4;tag5;tag15;tag16 11filename8.eps,tag3;tag14;tag9;tag7 12filename8.eps,tag7;tag2;tag3;tag8 13filename9.eps,tag2;tag10;tag17 14filename10.eps,tag5;tag1;tag13 15filename10.eps,tag7;tag6;tag9;tag10 16filename11.eps,tag7;tag2;tag3;tag8;tag18 17filename11.eps,tag10;tag12;tag13;tag20 18filename12.eps,tag4;tag8;tag3;tag19 19filename13.eps,tag6;tag15;tag9;tag11 20filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4 21filename15.eps,tag0;tag9,tag20
- Run the menu option
Edit > Line Operations > Sort Lines Lexicographically Descending
( Not ascending ! )
So :
21filename15.eps,tag0;tag9,tag20 20filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4 19filename13.eps,tag6;tag15;tag9;tag11 18filename12.eps,tag4;tag8;tag3;tag19 17filename11.eps,tag10;tag12;tag13;tag20 16filename11.eps,tag7;tag2;tag3;tag8;tag18 15filename10.eps,tag7;tag6;tag9;tag10 14filename10.eps,tag5;tag1;tag13 13filename9.eps,tag2;tag10;tag17 12filename8.eps,tag7;tag2;tag3;tag8 11filename8.eps,tag3;tag14;tag9;tag7 10filename8.eps,tag4;tag5;tag15;tag16 09filename7.eps,tag5;tag7;tag15 08filename6.eps,tag3;tag2;tag3;tag10;tag14 07filename5.eps,tag8;tag2;tag1;tag6;tag11 06filename5.eps,tag4;tag6;tag10;tag12 05filename4.eps,tag1;tag9;tag5 04filename3.eps,tag7;tag2;tag3;tag8 03filename3.eps,tag6;tag2;tag9 02filename2.eps,tag4;tag1;tag5 01filename1.eps,tag1;tag2;tag3
With this simple regex S/R, we change all this list in a
one
-line list :-
Open the Replace dialog (
Ctrl + H
)-
SEARCH
\R
-
REPLACE
#
( any symbol, not used yet, can be chosen ) -
Select the
Regular expression
search mode -
Click on the
Replace All
button
-
We obtain the single line, below :
21filename15.eps,tag0;tag9,tag20#20filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4#19filename13.eps,tag6;tag15;tag9;tag11#18filename12.eps,tag4;tag8;tag3;tag19#17filename11.eps,tag10;tag12;tag13;tag20#16filename11.eps,tag7;tag2;tag3;tag8;tag18#15filename10.eps,tag7;tag6;tag9;tag10#14filename10.eps,tag5;tag1;tag13#13filename9.eps,tag2;tag10;tag17#12filename8.eps,tag7;tag2;tag3;tag8#11filename8.eps,tag3;tag14;tag9;tag7#10filename8.eps,tag4;tag5;tag15;tag16#09filename7.eps,tag5;tag7;tag15#08filename6.eps,tag3;tag2;tag3;tag10;tag14#07filename5.eps,tag8;tag2;tag1;tag6;tag11#06filename5.eps,tag4;tag6;tag10;tag12#05filename4.eps,tag1;tag9;tag5#04filename3.eps,tag7;tag2;tag3;tag8#03filename3.eps,tag6;tag2;tag9#02filename2.eps,tag4;tag1;tag5#01filename1.eps,tag1;tag2;tag3
-
Now, here is the regex S/R, which deletes any duplicated tags :
-
SEARCH
(?-is)[,;](\w+)(?=[,;#].*?[,;]\1([,;#]|\R|\z))
-
REPLACE
Leave the zone EMPTY
-
Your text is shortened as below :
21filename15.eps,tag0#20filename14.eps#19filename13.eps#18filename12.eps;tag19#17filename11.eps;tag20#16filename11.eps;tag18#15filename10.eps#14filename10.eps;tag13#13filename9.eps;tag17#12filename8.eps#11filename8.eps#10filename8.eps;tag16#09filename7.eps;tag15#08filename6.eps;tag14#07filename5.eps;tag11#06filename5.eps;tag10;tag12#05filename4.eps#04filename3.eps,tag7;tag8#03filename3.eps,tag6;tag9#02filename2.eps,tag4;tag5#01filename1.eps,tag1;tag2;tag3
-
Then, we use this other regex S/R to change this single line in a
multi
-lines list :-
SEARCH
#
-
REPLACE
\r\n
( or\n
if your file is an Unix file )
-
Giving :
21filename15.eps,tag0 20filename14.eps 19filename13.eps 18filename12.eps;tag19 17filename11.eps;tag20 16filename11.eps;tag18 15filename10.eps 14filename10.eps;tag13 13filename9.eps;tag17 12filename8.eps 11filename8.eps 10filename8.eps;tag16 09filename7.eps;tag15 08filename6.eps;tag14 07filename5.eps;tag11 06filename5.eps;tag10;tag12 05filename4.eps 04filename3.eps,tag7;tag8 03filename3.eps,tag6;tag9 02filename2.eps,tag4;tag5 01filename1.eps,tag1;tag2;tag3
- Run the menu option
Edit > Line Operations > Sort Lines Lexicographically Ascending
01filename1.eps,tag1;tag2;tag3 02filename2.eps,tag4;tag5 03filename3.eps,tag6;tag9 04filename3.eps,tag7;tag8 05filename4.eps 06filename5.eps;tag10;tag12 07filename5.eps;tag11 08filename6.eps;tag14 09filename7.eps;tag15 10filename8.eps;tag16 11filename8.eps 12filename8.eps 13filename9.eps;tag17 14filename10.eps;tag13 15filename10.eps 16filename11.eps;tag18 17filename11.eps;tag20 18filename12.eps;tag19 19filename13.eps 20filename14.eps 21filename15.eps,tag0
-
Finally, the last regex S/R, below :
-
will get rid of the numbering, at beginning of lines
-
will replace any semi-colon, right after the string
.eps
with a comma
-
So :
-
-
SEARCH
^\d+|(?<=eps)(;)
-
REPLACE
?1,
-
And, here is your final expected text ;-))
filename1.eps,tag1;tag2;tag3 filename2.eps,tag4;tag5 filename3.eps,tag6;tag9 filename3.eps,tag7;tag8 filename4.eps filename5.eps,tag10;tag12 filename5.eps,tag11 filename6.eps,tag14 filename7.eps,tag15 filename8.eps,tag16 filename8.eps filename8.eps filename9.eps,tag17 filename10.eps,tag13 filename10.eps filename11.eps,tag18 filename11.eps,tag20 filename12.eps,tag19 filename13.eps filename14.eps filename15.eps,tag0
Best Regards,
guy038
-
-
Hi, @сергій-бородін and All,
Thinking back on your problem, here is a second method, requiring fewer steps, but which will classify each non-duplicated tag, according to a different layout !
So, assuming the same initial text, below :
filename1.eps,tag1;tag2;tag3 filename2.eps,tag4;tag1;tag5 filename3.eps,tag6;tag2;tag9 filename3.eps,tag7;tag2;tag3;tag8 filename4.eps,tag1;tag9;tag5 filename5.eps,tag4;tag6;tag10;tag12 filename5.eps,tag8;tag2;tag1;tag6;tag11 filename6.eps,tag3;tag2;tag3;tag10;tag14 filename7.eps,tag5;tag7;tag15 filename8.eps,tag4;tag5;tag15;tag16 filename8.eps,tag3;tag14;tag9;tag7 filename8.eps,tag7;tag2;tag3;tag8 filename9.eps,tag2;tag10;tag17 filename10.eps,tag5;tag1;tag13 filename10.eps,tag7;tag6;tag9;tag10 filename11.eps,tag7;tag2;tag3;tag8;tag18 filename11.eps,tag10;tag12;tag13;tag20 filename12.eps,tag4;tag8;tag3;tag19 filename13.eps,tag6;tag15;tag9;tag11 filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4 filename15.eps,tag0;tag9,tag20
First this simple regex S/R, changes all this list in a
one
-line list :-
Open the Replace dialog (
Ctrl + H
)-
SEARCH
\R
-
REPLACE
#
( Any symbol, not used yet, can be chosen ) -
Select the
Regular expression
search mode -
Click on the
Replace All
button
-
Which gives the single line, below :
filename1.eps,tag1;tag2;tag3#filename2.eps,tag4;tag1;tag5#filename3.eps,tag6;tag2;tag9#filename3.eps,tag7;tag2;tag3;tag8#filename4.eps,tag1;tag9;tag5#filename5.eps,tag4;tag6;tag10;tag12#filename5.eps,tag8;tag2;tag1;tag6;tag11#filename6.eps,tag3;tag2;tag3;tag10;tag14#filename7.eps,tag5;tag7;tag15#filename8.eps,tag4;tag5;tag15;tag16#filename8.eps,tag3;tag14;tag9;tag7#filename8.eps,tag7;tag2;tag3;tag8#filename9.eps,tag2;tag10;tag17#filename10.eps,tag5;tag1;tag13#filename10.eps,tag7;tag6;tag9;tag10#filename11.eps,tag7;tag2;tag3;tag8;tag18#filename11.eps,tag10;tag12;tag13;tag20#filename12.eps,tag4;tag8;tag3;tag19#filename13.eps,tag6;tag15;tag9;tag11#filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4#filename15.eps,tag0;tag9,tag20
-
Now, here is the regex S/R, which deletes any duplicated tag ( The same regex, described in my previous post ) :
-
SEARCH
(?-is)[,;](\w+)(?=[,;#].*?[,;]\1([,;#]|\R|\z))
-
REPLACE
Leave the zone EMPTY
-
Your text should be shortened as below :
filename1.eps#filename2.eps#filename3.eps#filename3.eps#filename4.eps#filename5.eps#filename5.eps#filename6.eps#filename7.eps#filename8.eps;tag16#filename8.eps;tag14#filename8.eps#filename9.eps#filename10.eps,tag5;tag1#filename10.eps#filename11.eps;tag18#filename11.eps,tag10;tag13#filename12.eps;tag8;tag19#filename13.eps,tag6;tag15;tag11#filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4#filename15.eps,tag0;tag9,tag20
-
Finally, this regex S/R, below :
-
Replaces any semi-colon, right after the string
eps
with a comma -
Replaces any
#
symbol with a line-break (\r\n
or\n
)
-
SEARCH
eps;|(#)
REPLACE
?1\r\n:eps,
OR?1\n:eps,
if you works with an Unix fileAnd we obtain the final output :
filename1.eps filename2.eps filename3.eps filename3.eps filename4.eps filename5.eps filename5.eps filename6.eps filename7.eps filename8.eps,tag16 filename8.eps,tag14 filename8.eps filename9.eps filename10.eps,tag5;tag1 filename10.eps filename11.eps,tag18 filename11.eps,tag10;tag13 filename12.eps,tag8;tag19 filename13.eps,tag6;tag15;tag11 filename14.eps,tag7;tag2;tag3;tag17;tag12;tag4 filename15.eps,tag0;tag9,tag20
As you can see, the
21
non-duplicated tags ( Fromtag0
totag20
) are arranged differently, with many lines without tag, at beginning of the list !Best Regards,
guy038
-