Collect duplicates in doc before sign

Richie

Good day. Can you help me to solve an issue? I need to collect all dublicates before sign #.
This is part of my doc. Thank you guys for any help I appreciate it

Umbrella#17:30
T-shirt#11:43
T-shirt#12:04 Polo(M)
T-shirt#14:32
T-shirt#15:27
Cap#12:47
Jeans#10:43 LEVIS
Jeans#12:42 Coll
Jeans#15:27
Gloves#14:41 Kids
Coat#11:23 YD523(M)
Coat#12:54 YD523(L)
Jacket#14:41

This is what I need

T-shirt#11:43
T-shirt#12:04 Polo(M)
T-shirt#14:32
T-shirt#15:27
Jeans#10:43 LEVIS
Jeans#12:42 Coll
Jeans#15:27
Coat#11:23 YD523(M)
Coat#12:54 YD523(L)

Mark Olson

@Richie
How many lines of data do you have? The optimal solution for this problem can vary a lot depending on how much data you have.

Also, are all the entries with duplicate values before the # consecutive? For example, are all T-shirt entries grouped together, or are there non-T-Shirt entries between the first T-shirt entry and the last?

Assuming the entries with duplicate values before # are grouped together, a reasonable solution (even for a large amount of data) would be as follows:

Go to the find/replace form, Mark tab (Ctrl+M with default hotkeys)
With Bookmark line selected and using Regular expression as search mode, enter (?-s)(^[^#\r\n]+)#.*\R(\1#.*$\R?)+ into the Find what box, and hit Mark all.
- Notes on this regular expression (general resources available here):
- (?-s) means that the . metacharacter won’t match newlines.
- (^[^#\r\n]+)# tries to match a sequence of characters that are not # or a line-end character at the start of a line followed by #, and then stores it as capture group 1.
- .*\R matches any number of non-newline characters, then a newline.
- (\1#.*$\R?)+ matches at least one line where the thing before # in the first line is followed immediately by # and then any number of characters before the end of the line ($) and then possibly a newline (\R? - the only thing there could be other than a newline here is the end of the file)
You will see all the lines with a duplicate entry before # marked.
Use Search->Bookmark->Remove Non-Bookmarked lines from the main menu.
Now all the entries with no duplicate before the # will remain.

Note that the Remove Non-Bookmarked lines operation can be rather slow if the number of lines to be removed is large (say, 10 thousand or more). If that’s a problem, you can use Copy Marked Text (the third button under the Mark all and Clear all marks buttons in the Mark form) instead to copy the text that you marked in step 2, then paste it into another buffer (or select the entire original file and paste over it). You’d have to do some simple regex-based postprocessing of the result after that, but I’ll leave that as an exercise for you.

If the entries with duplicate values are not grouped together, you will need to sort the entries by the value before the #, and then follow the steps above. This sorting is difficult to achieve without a custom script, but fortunately I and fellow forum regular AlanKilborn have already written such a script.