Collect duplicates in doc before sign
-
Good day. Can you help me to solve an issue? I need to collect all dublicates before sign #.
This is part of my doc. Thank you guys for any help I appreciate itUmbrella#17:30
T-shirt#11:43
T-shirt#12:04 Polo(M)
T-shirt#14:32
T-shirt#15:27
Cap#12:47
Jeans#10:43 LEVIS
Jeans#12:42 Coll
Jeans#15:27
Gloves#14:41 Kids
Coat#11:23 YD523(M)
Coat#12:54 YD523(L)
Jacket#14:41This is what I need
T-shirt#11:43
T-shirt#12:04 Polo(M)
T-shirt#14:32
T-shirt#15:27
Jeans#10:43 LEVIS
Jeans#12:42 Coll
Jeans#15:27
Coat#11:23 YD523(M)
Coat#12:54 YD523(L) -
@Richie
How many lines of data do you have? The optimal solution for this problem can vary a lot depending on how much data you have.Also, are all the entries with duplicate values before the
#
consecutive? For example, are allT-shirt
entries grouped together, or are there non-T-Shirt
entries between the firstT-shirt
entry and the last?Assuming the entries with duplicate values before
#
are grouped together, a reasonable solution (even for a large amount of data) would be as follows:- Go to the find/replace form,
Mark
tab (Ctrl+M with default hotkeys) - With
Bookmark line
selected and usingRegular expression
as search mode, enter(?-s)(^[^#\r\n]+)#.*\R(\1#.*$\R?)+
into theFind what
box, and hitMark all
.- Notes on this regular expression (general resources available here):
(?-s)
means that the.
metacharacter won’t match newlines.(^[^#\r\n]+)#
tries to match a sequence of characters that are not#
or a line-end character at the start of a line followed by#
, and then stores it as capture group 1..*\R
matches any number of non-newline characters, then a newline.(\1#.*$\R?)+
matches at least one line where the thing before#
in the first line is followed immediately by#
and then any number of characters before the end of the line ($
) and then possibly a newline (\R?
- the only thing there could be other than a newline here is the end of the file)
- You will see all the lines with a duplicate entry before
#
marked. - Use
Search->Bookmark->Remove Non-Bookmarked lines
from the main menu. - Now all the entries with no duplicate before the
#
will remain.
Note that the
Remove Non-Bookmarked lines
operation can be rather slow if the number of lines to be removed is large (say, 10 thousand or more). If that’s a problem, you can useCopy Marked Text
(the third button under theMark all
andClear all marks
buttons in theMark
form) instead to copy the text that you marked in step 2, then paste it into another buffer (or select the entire original file and paste over it). You’d have to do some simple regex-based postprocessing of the result after that, but I’ll leave that as an exercise for you.If the entries with duplicate values are not grouped together, you will need to sort the entries by the value before the
#
, and then follow the steps above. This sorting is difficult to achieve without a custom script, but fortunately I and fellow forum regular AlanKilborn have already written such a script. - Go to the find/replace form,