De-Deuplicate chunks of text? (screenshot)
-
Hi. I’m trying to remove duplicates and sort a huge number of Vivaldi browser bookmarks. This screenshot shows what each bookmark looks like individually. I’m looking for a way to find all chunks of text beginning with { \n “date_added”: until }, and treat them as individual entities, then somehow analyze if any duplicate chunks exist… then do the same thing for every single unique chunk… automagically…
Essentially, I need duplicate file finder software but for chunks of text.
Any ideas? Thanks much.
-
Well @John-Drachenberg I’d try and grab the records in the group of 5 lines from the date added to the url line, combining them all into 1 line (so replacing CR/LF) with some other delimiter. I’d then create a "key at the start of the line, possibly the main part of the url, excluding any /folder names. Sort all the lines so that it would easily match up possible duplicate urls. To my mind a duplicate is any url where even if the /folder names/ portion was different would warrant further inspection.
At this point either eyeball the duplicates, or another regex could mark possible duplicates for further inspection.
Not sure if you actually want a regex to just remove duplicates in the original file, or would be happy just getting a list of possible duplicates which you could then check against the original and remove manually.
Terry
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login