Delete line with duplicate Number
-
**Sorry for the repost, going to try and simplify my question. **
I’ve spent the first half of my day trying to figure out how to do this, along with googling to find my exact answer I was unable.
…Random words, letters, and numbers are on each line…
Objective: find lines that have exact duplicate numbers (not letters or words).
Before example:
A dog went to the mall - #11364
The dog went to the store - #11364
A dog is at the mall - #14369
Dog to the store random - #14369
Sentence a random - #13677
The went dog to store - #11159After example:
A random sentence - #11364
A sentence random - #14369
Sentence a random - #13677
The went dog to store - #11159- The formula needs to at least: match lines that have identical numbers.
- The formula does NOT need to: delete one of the lines
I’m fine with manually deleting the lines that have an identical number match.
Any help is appreciated, thank you
-
Hello, @jim-erlich,
Before finding a way to solve your problem, we need additional information :
-
In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?
-
Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )
-
How many digits can exist after the
#symbol ? Is5the maximum or8or10digits or … ? -
What is the maximum of lines between two “duplicate” lines ( for instance
xxxxxxxx#2345andyyyyyyyy#12345) -
Are they more than
2'duplicate" lines ( I mean, for instance,6lines ending with#12345) -
In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?
Note that we can cope with a possible
spacecharacter between the#symbol and the number Not a problem !See you later !
Best regards,
guy038
-
-
@guy038 said in Delete line with duplicate Number:
Hello, @jim-erlich,
Before finding a way to solve your problem, we need additional information :
-
In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?
-
Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )
-
How many digits can exist after the
#symbol ? Is5the maximum or8or10digits or … ? -
What is the maximum of lines between two “duplicate” lines ( for instance
xxxxxxxx#2345andyyyyyyyy#12345) -
Are they more than
2'duplicate" lines ( I mean, for instance,6lines ending with#12345) -
In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?
Note that we can cope with a possible
spacecharacter between the#symbol and the number Not a problem !See you later !
Best regards,
guy038
- 12,000 lines broken up into different pages. Each page can have 300 to 1,100 lines.
- Right now the lines are in Ascending order based upon the number.
For example:
The dog went to the park - #4599
The cat went to the park - #4657
The kid went to the park - #4797
The lizard went to the zoo - #5100
The cat went to the zoo - #5120
etc…Ideally, I would like to keep the numbers in Ascending order like this, and locate the lines that have duplicate numbers
- The number of digits after the # symbol is 1 to 5… the highest number being about 14000
- Oh this is good… the duplicate line that has the duplicate number SHOULD BE on the next line. So it will be like this
For example:
The dog went to the park - #12554
The cat went to the park - #12554^^^ The formula should say “Hey!! These two numbers are identical” and then I will delete one of them manually.
- There will be only 2 ‘duplicate’ lines. There will not be 6 lines ending with #12345
- Keep the last duplicate line, delete the first
I hope this is clear, let me know if there is anything else.
Thank you
-
-
Hi, @jim-erlich and All,
Many thanks for all your information ! It should be very easy to get the right solution !
The most important points are :
-
Your file is already sorted by ascending
#number -
And, in case of
1duplicate line, it is located right after the original line !
Now, I don’t know how your preliminary sort behaved with numbers, separated with a
spacechar from the#symbol ?For instance :
Are lines sorted, as below ( case
A) :The dog went to the park - #4599 The cat went to the park - # 4657 The kid went to the park - #4797 The lizard went to the zoo - # 5100 The cat went to the zoo - #5120OR
Are lines sorted, like ( case
B) :The cat went to the park - # 4657 The lizard went to the zoo - # 5100 The dog went to the park - #4599 The kid went to the park - #4797 The cat went to the zoo - #5120Indeed, the N++ sort would place the line
# 10000before the line#5000, asspacecode-point is smaller than code-point of adigit!
Anyway, assuming a sort, like in case
Aand the initial text :The dog went to the park - #4599 You went to the zoo - # 4640 He went to the park - # 4640 The cat went to the park - #4657 The kid went to the park - # 4657 The girl went to the park - #4900 The lizard went to the zoo - # 5100 The cat went to the zoo - #5100 I went to the park - #7500 We went to the zoo - #7500 They went to the park - #14000Here is the road map :
-
Open your file in N++
-
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1) -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click once on the
Replace Allbutton or several times on theReplacebutton
Voila !
You should get your expected list :
The dog went to the park - #4599 He went to the park - # 4640 The kid went to the park - # 4657 The girl went to the park - #4900 The cat went to the zoo - #5100 We went to the zoo - #7500 They went to the park - #14000
Note that if your sort is rather like in case
B, it shouldn’t be be difficult to get, again, the caseAorder and run the regex S/R, afterwards ;-))Next time, if everything was OK, I’ll explain how this regex S/R works !
Best Regards,
guy038
-
-
@guy038 said in Delete line with duplicate Number:
Hi, @jim-erlich and All,
Many thanks for all your information ! It should be very easy to get the right solution !
The most important points are :
-
Your file is already sorted by ascending
#number -
And, in case of
1duplicate line, it is located right after the original line !
Now, I don’t know how your preliminary sort behaved with numbers, separated with a
spacechar from the#symbol ?For instance :
Are lines sorted, as below ( case
A) :The dog went to the park - #4599 The cat went to the park - # 4657 The kid went to the park - #4797 The lizard went to the zoo - # 5100 The cat went to the zoo - #5120OR
Are lines sorted, like ( case
B) :The cat went to the park - # 4657 The lizard went to the zoo - # 5100 The dog went to the park - #4599 The kid went to the park - #4797 The cat went to the zoo - #5120Indeed, the N++ sort would place the line
# 10000before the line#5000, asspacecode-point is smaller than code-point of adigit!
Anyway, assuming a sort, like in case
Aand the initial text :The dog went to the park - #4599 You went to the zoo - # 4640 He went to the park - # 4640 The cat went to the park - #4657 The kid went to the park - # 4657 The girl went to the park - #4900 The lizard went to the zoo - # 5100 The cat went to the zoo - #5100 I went to the park - #7500 We went to the zoo - #7500 They went to the park - #14000Here is the road map :
-
Open your file in N++
-
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1) -
REPLACE
Leave EMPTY -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click once on the
Replace Allbutton or several times on theReplacebutton
Voila !
You should get your expected list :
The dog went to the park - #4599 He went to the park - # 4640 The kid went to the park - # 4657 The girl went to the park - #4900 The cat went to the zoo - #5100 We went to the zoo - #7500 They went to the park - #14000
Note that if your sort is rather like in case
B, it shouldn’t be be difficult to get, again, the caseAorder and run the regex S/R, afterwards ;-))Next time, if everything was OK, I’ll explain how this regex S/R works !
Best Regards,
guy038
Flawless… absolutely amazing. You don’t understand how thankful I am for your detailed and correct response. This is a whole new language for me and I am amazed. thank you again.
-
-
Hello @jim-erlich and All,
Sorry for being late ! So, here are, below, some explanations about my regex S/R :
SEARCH
(?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)REPLACE
Leave EMPTY-
First, the
(?-s)in-line modifier ensures that any further.regex symbol corresponds to a single standard character, only and not to a line-break char ! -
So, the next part
^.+#\x20?searches, from beginning of line (^), any non-null range of characters (.+), followed by the#symbol and an optionalspacechar (\x20?) -
Then, it looks for a non-null range of digits (
\d+), followed by line-break character(s) -
So, the regex engine looks for an entire line ( digits after the
#are stored as group1as embedded in parentheses ) but ONLY IF the next line ends with the same number ! -
This condition can be expressed with a look-ahead structure
(?=......)which are rather a user assertion in the same way that, for instance, the$symbol is a system assertion, looking for the zero length assertion “end of line” ! -
So current line must be followed with the regex
.+#\x20?\1, which represents, again, a non-null range of standard characters followed with a#and possibly aspacechar and finally the group1(\1) which is the ending number of the current line -
Note that the
^assertion for the second line, in the look-ahead structure, is useless as the range(.+)comes next the line-break char(s)\R, anyway ! -
As the replacement zone is
empty, the current line, with its line-break, is just deleted
For a quick oversight about regular expressions, see the N++ documentation, below :
https://npp-user-manual.org/docs/searching/#regular-expressions
See also the main links regarding the
Boost regexlibrary, used by the regex N++ engine :https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
Finally, see this FAQ topic about regular expressions :
https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation
Best Regards,
guy038
-