help removing originals and duplicates from large data set [was: n00b help pls :)]

Matthew Owen

n00b help pls :)

hi friends! so i was faced with a problem today, needing to remove both the original and duplicate from a rather large data set. in such cases i usually use excel and countif, but the max rows in excel is a little over 1m. i started looking for other ways on how i could achieve such a result and low and behold i have come across notepad++. this is not the first time i have used this app so would like to thank those for making such a bad-ass tool.

https://superuser.com/questions/1711988/remove-duplicate-and-original-values-leaving-only-truly-unique-values-in-notepa

the answer laid out here provides the solution to my problem precisely - however i the difficulty lies with the output, nothing appears to change in the file. for example, after running the search and replace as outlined in the URL above, i am pleased to see

“Replace All: 15800 occurrences were replaced in entire file”

however, the same amount of lines remain, and i can see some of the rows have the orange/green(when saved) marker next to them.

the answer to this is going to be so stupid i know, so i hope the title is fit according to the problem.

take care guys

–
moderator renamed Topic to something meaningful

PeterJones

@Matthew-Owen ,

Unfortunately, since you didn’t show any data or screenshot, we cannot begin to guess what you may have done, or why the regex supplied in the SU answer didn’t work for you.

Could you please provide the example “before” data you have, the “after” data you want, and the “wrong after” data you actually got, along with the regex you actually used (in case it’s been modified from the SU answer). Details on how to provide this are found in our Template for Search/Replace Questions – please follow the advice and format from that Template to get the best results.

so i hope the title is fit according to the problem.

To be honest, “n00b help pls :)” is the absolutely worst kind of title for a Topic, here or in any other forum on the internet. It tells readers nothing about what your discussion is. The title of a post should be an actual title about what you’re asking about. A meaningful title would have been “help removing originals and duplicates from large data set”. (It’s such a bad name that I’m going to use my moderator powers to move your old title into a header in your post, and rename the Topic to match what you’re actually asking about)

----

Other Useful References

Alan Kilborn

@Matthew-Owen said :

needing to remove both the original and duplicate

If that’s what you’re after with your somewhat unclear posting, then no, Notepad++ can’t currently do something like that. It’s been asked for before, usually entitled “Keep only unique lines”.

Such behavior could be scripted.

Matthew Owen

@Alan-Kilborn said in help removing originals and duplicates from large data set [was: n00b help pls :)]:

@Matthew-Owen said :

needing to remove both the original and duplicate

If that’s what you’re after with your somewhat unclear posting, then no, Notepad++ can’t currently do something like that. It’s been asked for before, usually entitled “Keep only unique lines”.

Such behavior could be scripted.

@PeterJones my apologies for the basic post, i genuinely believed it was something very trivial.

@Alan-Kilborn appreciate the clarification!

To both responses, your speed in such is of great thanks. Take care!

PeterJones

@Alan-Kilborn said in help removing originals and duplicates from large data set [was: n00b help pls :)]:

If that’s what you’re after with your somewhat unclear posting, then no, Notepad++ can’t currently do something like that. It’s been asked for before, usually entitled “Keep only unique lines”.

Well, it depends on the data. If the data is sequential, so all the duplicate lines are in the same block with the original, then it’s doable, and the (^.+\R?)\1+ regex supplied at SU will work – it would take the input

before:

and convert it to
after:

3
6

… because those are the only unique lines. And it works.

However, if the original data was
before:

… then yes, I’d agree that the more generic version of the problem requires scripting.

Since @Matthew-Owen referenced that SU post, and didn’t say “but my data was in a different order”, it is entirely possible that @Matthew-Owen’s data is already sorted, and thus the SU regex should work – and thus it may have just been user error, rather than the regex not actually working for the super-secret task.

@Matthew-Owen said,

my apologies for the basic post, i genuinely believed it was something very trivial.

The problem isn’t that it’s “basic”. “Basic” would mean you asked us “how to search for abc and replace with xyz”; if it was really that simple, and we might playfully rib you for asking something so basic, but we’d likely answer. But in that “basic” question, you gave us data, and we could show you what to do.

The problem is that you didn’t provide any actual information – no example of the data or the order of where the matches occur (whether right next to each other, like in the SU post, or whether randomly throughout the large document), even if was as simple as the 1/2/3 example data that was used in SU – but instead just gave us vague hints. If you had shared a screenshot of the REPLACE dialog, we could have seen the Replace All: 3 occurrences were replaced in entire file message that it gave.

Saying “nothing appears to change” doesn’t tell us the “before” or “after” – maybe there was a change, and you just didn’t see it. Or maybe it replaced all those matching lines with themselves, without being changed. Or maybe you just typed something wrong. But we cannot know, because you just made vague assertions.

And you are thus forcing us to make assumptions to be able to answer you; and as you can tell, Alan and I ended up making different assumptions. Which means that the answers you get might not be as helpful as they would be if you gave example data.

If you had shared “before” and “after” data, along with the “wrong after”, it would be much easier to help you.

Since you still refuse to share any example data, I am going to assume you’ve given up, and don’t want help anymore. Which is unfortunate for you.

(And yes, it might very well be that, depending on the order of your data, a script is really the only solution.)

Mark Olson

@Matthew-Owen @PeterJones
It’s actually perfectly feasible to remove all non-empty lines that are duplicated anywhere in the file, regardless of order, without using plugins, although this unfortunately can’t be saved in a macro. This is an extension of the “group lines by content and then put them back in order” procedure that I’ve also used to empty all duplicate lines (leaving them empty) efficiently. The general “group lines by content and then put them back in order” procedure is just steps 1-5 and 7-10 below, where step 6 is the part where you do whatever you want with the lines while they’re grouped by content.

For all of the below find/replace operations, Wrap around must be checked, and Search Mode must be set to Regular expression.

The steps are as follows:

Replace ^ with \x20 (add a space at the beginning of each line)
Move the caret to the start of the file
Use the column editor to add line numbers at the start of each line:
- Number to Insert checked
- Initial number, Increase by, Repeat all set to 1
- Leading = Zeros
- Format: Dec
Replace (?-s)^(\d+)\x20(.*) with ${2}\x20${1} (swap the line numbers with the rest of the line, so that unique lines are annotated with their location)
Run the command Edit->Line Operations->Sort Lines Lexicographically Ascending
Replace ^(.+?)\x20\d+(\R)(?:\1\x20\d+\R?)+ with nothing (remove all non-empty lines that aren’t unique, replace the .+? with .*? if you also want to remove empty lines)
Replace ^(.*?)\x20(\d+)$ with ${2}\x20${1} (move the line numbers back to the start of the line)
Remove the empty line from the end of the file, if there is one
Run the command Edit->Line Operations->Sort Lines Lexicographically Ascending again (put lines back in the original order)
Replace ^\d+\x20 with nothing (remove the starting line numbers and space)

As a demonstration, this procedure can take the file

1337 1
bar
1337 1

baz jke
quz
1337 1

quz
nhj blah
baz jke
103

and convert it into

bar


nhj blah
103