Delete line with duplicate Number
-
**Sorry for the repost, going to try and simplify my question. **
I’ve spent the first half of my day trying to figure out how to do this, along with googling to find my exact answer I was unable.
…Random words, letters, and numbers are on each line…
Objective: find lines that have exact duplicate numbers (not letters or words).
Before example:
A dog went to the mall - #11364
The dog went to the store - #11364
A dog is at the mall - #14369
Dog to the store random - #14369
Sentence a random - #13677
The went dog to store - #11159After example:
A random sentence - #11364
A sentence random - #14369
Sentence a random - #13677
The went dog to store - #11159- The formula needs to at least: match lines that have identical numbers.
- The formula does NOT need to: delete one of the lines
I’m fine with manually deleting the lines that have an identical number match.
Any help is appreciated, thank you
-
Hello, @jim-erlich,
Before finding a way to solve your problem, we need additional information :
-
In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?
-
Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )
-
How many digits can exist after the
#symbol ? Is5the maximum or8or10digits or … ? -
What is the maximum of lines between two “duplicate” lines ( for instance
xxxxxxxx#2345andyyyyyyyy#12345) -
Are they more than
2'duplicate" lines ( I mean, for instance,6lines ending with#12345) -
In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?
Note that we can cope with a possible
spacecharacter between the#symbol and the number Not a problem !See you later !
Best regards,
guy038
-
-
@guy038 said in Delete line with duplicate Number:
Hello, @jim-erlich,
Before finding a way to solve your problem, we need additional information :
-
In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?
-
Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )
-
How many digits can exist after the
#symbol ? Is5the maximum or8or10digits or … ? -
What is the maximum of lines between two “duplicate” lines ( for instance
xxxxxxxx#2345andyyyyyyyy#12345) -
Are they more than
2'duplicate" lines ( I mean, for instance,6lines ending with#12345) -
In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?
Note that we can cope with a possible
spacecharacter between the#symbol and the number Not a problem !See you later !
Best regards,
guy038
- 12,000 lines broken up into different pages. Each page can have 300 to 1,100 lines.
- Right now the lines are in Ascending order based upon the number.
For example:
The dog went to the park - #4599
The cat went to the park - #4657
The kid went to the park - #4797
The lizard went to the zoo - #5100
The cat went to the zoo - #5120
etc…Ideally, I would like to keep the numbers in Ascending order like this, and locate the lines that have duplicate numbers
- The number of digits after the # symbol is 1 to 5… the highest number being about 14000
- Oh this is good… the duplicate line that has the duplicate number SHOULD BE on the next line. So it will be like this
For example:
The dog went to the park - #12554
The cat went to the park - #12554^^^ The formula should say “Hey!! These two numbers are identical” and then I will delete one of them manually.
- There will be only 2 ‘duplicate’ lines. There will not be 6 lines ending with #12345
- Keep the last duplicate line, delete the first
I hope this is clear, let me know if there is anything else.
Thank you
-
-
Hi, @jim-erlich and All,
Many thanks for all your information ! It should be very easy to get the right solution !
The most important points are :
-
Your file is already sorted by ascending
#number -
And, in case of
1duplicate line, it is located right after the original line !
Now, I don’t know how your preliminary sort behaved with numbers, separated with a
spacechar from the#symbol ?For instance :
Are lines sorted, as below ( case
A) :The dog went to the park - #4599 The cat went to the park - # 4657 The kid went to the park - #4797 The lizard went to the zoo - # 5100 The cat went to the zoo - #5120OR
Are lines sorted, like ( case
B) :The cat went to the park - # 4657 The lizard went to the zoo - # 5100 The dog went to the park - #4599 The kid went to the park - #4797 The cat went to the zoo - #5120Indeed, the N++ sort would place the line
# 10000before the line#5000, asspacecode-point is smaller than code-point of adigit!
Anyway, assuming a sort, like in case
Aand the initial text :The dog went to the park - #4599 You went to the zoo - # 4640 He went to the park - # 4640 The cat went to the park - #4657 The kid went to the park - # 4657 The girl went to the park - #4900 The lizard went to the zoo - # 5100 The cat went to the zoo - #5100 I went to the park - #7500 We went to the zoo - #7500 They went to the park - #14000Here is the road map :
-
Open your file in N++
-
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1) -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click once on the
Replace Allbutton or several times on theReplacebutton
Voila !
You should get your expected list :
The dog went to the park - #4599 He went to the park - # 4640 The kid went to the park - # 4657 The girl went to the park - #4900 The cat went to the zoo - #5100 We went to the zoo - #7500 They went to the park - #14000
Note that if your sort is rather like in case
B, it shouldn’t be be difficult to get, again, the caseAorder and run the regex S/R, afterwards ;-))Next time, if everything was OK, I’ll explain how this regex S/R works !
Best Regards,
guy038
-
-
@guy038 said in Delete line with duplicate Number:
Hi, @jim-erlich and All,
Many thanks for all your information ! It should be very easy to get the right solution !
The most important points are :
-
Your file is already sorted by ascending
#number -
And, in case of
1duplicate line, it is located right after the original line !
Now, I don’t know how your preliminary sort behaved with numbers, separated with a
spacechar from the#symbol ?For instance :
Are lines sorted, as below ( case
A) :The dog went to the park - #4599 The cat went to the park - # 4657 The kid went to the park - #4797 The lizard went to the zoo - # 5100 The cat went to the zoo - #5120OR
Are lines sorted, like ( case
B) :The cat went to the park - # 4657 The lizard went to the zoo - # 5100 The dog went to the park - #4599 The kid went to the park - #4797 The cat went to the zoo - #5120Indeed, the N++ sort would place the line
# 10000before the line#5000, asspacecode-point is smaller than code-point of adigit!
Anyway, assuming a sort, like in case
Aand the initial text :The dog went to the park - #4599 You went to the zoo - # 4640 He went to the park - # 4640 The cat went to the park - #4657 The kid went to the park - # 4657 The girl went to the park - #4900 The lizard went to the zoo - # 5100 The cat went to the zoo - #5100 I went to the park - #7500 We went to the zoo - #7500 They went to the park - #14000Here is the road map :
-
Open your file in N++
-
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1) -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click once on the
Replace Allbutton or several times on theReplacebutton
Voila !
You should get your expected list :
The dog went to the park - #4599 He went to the park - # 4640 The kid went to the park - # 4657 The girl went to the park - #4900 The cat went to the zoo - #5100 We went to the zoo - #7500 They went to the park - #14000
Note that if your sort is rather like in case
B, it shouldn’t be be difficult to get, again, the caseAorder and run the regex S/R, afterwards ;-))Next time, if everything was OK, I’ll explain how this regex S/R works !
Best Regards,
guy038
Flawless… absolutely amazing. You don’t understand how thankful I am for your detailed and correct response. This is a whole new language for me and I am amazed. thank you again.
-
-
Hello @jim-erlich and All,
Sorry for being late ! So, here are, below, some explanations about my regex S/R :
SEARCH
(?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)REPLACE
Leave EMPTY-
First, the
(?-s)in-line modifier ensures that any further.regex symbol corresponds to a single standard character, only and not to a line-break char ! -
So, the next part
^.+#\x20?searches, from beginning of line (^), any non-null range of characters (.+), followed by the#symbol and an optionalspacechar (\x20?) -
Then, it looks for a non-null range of digits (
\d+), followed by line-break character(s) -
So, the regex engine looks for an entire line ( digits after the
#are stored as group1as embedded in parentheses ) but ONLY IF the next line ends with the same number ! -
This condition can be expressed with a look-ahead structure
(?=......)which are rather a user assertion in the same way that, for instance, the$symbol is a system assertion, looking for the zero length assertion “end of line” ! -
So current line must be followed with the regex
.+#\x20?\1, which represents, again, a non-null range of standard characters followed with a#and possibly aspacechar and finally the group1(\1) which is the ending number of the current line -
Note that the
^assertion for the second line, in the look-ahead structure, is useless as the range(.+)comes next the line-break char(s)\R, anyway ! -
As the replacement zone is
empty, the current line, with its line-break, is just deleted
For a quick oversight about regular expressions, see the N++ documentation, below :
https://npp-user-manual.org/docs/searching/#regular-expressions
See also the main links regarding the
Boost regexlibrary, used by the regex N++ engine :https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
Finally, see this FAQ topic about regular expressions :
https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation
Best Regards,
guy038
-
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login