De-Deuplicate chunks of text? (screenshot)



  • Hi. I’m trying to remove duplicates and sort a huge number of Vivaldi browser bookmarks. This screenshot shows what each bookmark looks like individually. I’m looking for a way to find all chunks of text beginning with { \n “date_added”: until }, and treat them as individual entities, then somehow analyze if any duplicate chunks exist… then do the same thing for every single unique chunk… automagically…

    Essentially, I need duplicate file finder software but for chunks of text.

    Any ideas? Thanks much.



  • Well @John-Drachenberg I’d try and grab the records in the group of 5 lines from the date added to the url line, combining them all into 1 line (so replacing CR/LF) with some other delimiter. I’d then create a "key at the start of the line, possibly the main part of the url, excluding any /folder names. Sort all the lines so that it would easily match up possible duplicate urls. To my mind a duplicate is any url where even if the /folder names/ portion was different would warrant further inspection.

    At this point either eyeball the duplicates, or another regex could mark possible duplicates for further inspection.

    Not sure if you actually want a regex to just remove duplicates in the original file, or would be happy just getting a list of possible duplicates which you could then check against the original and remove manually.

    Terry


Log in to reply