Delete line with duplicate Number
-
**Sorry for the repost, going to try and simplify my question. **
I’ve spent the first half of my day trying to figure out how to do this, along with googling to find my exact answer I was unable.
…Random words, letters, and numbers are on each line…
Objective: find lines that have exact duplicate numbers (not letters or words).
Before example:
A dog went to the mall - #11364
The dog went to the store - #11364
A dog is at the mall - #14369
Dog to the store random - #14369
Sentence a random - #13677
The went dog to store - #11159After example:
A random sentence - #11364
A sentence random - #14369
Sentence a random - #13677
The went dog to store - #11159- The formula needs to at least: match lines that have identical numbers.
- The formula does NOT need to: delete one of the lines
I’m fine with manually deleting the lines that have an identical number match.
Any help is appreciated, thank you
-
Hello, @jim-erlich,
Before finding a way to solve your problem, we need additional information :
-
In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?
-
Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )
-
How many digits can exist after the
#
symbol ? Is5
the maximum or8
or10
digits or … ? -
What is the maximum of lines between two “duplicate” lines ( for instance
xxxxxxxx#2345
andyyyyyyyy#12345
) -
Are they more than
2
'duplicate" lines ( I mean, for instance,6
lines ending with#12345
) -
In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?
Note that we can cope with a possible
space
character between the#
symbol and the number Not a problem !See you later !
Best regards,
guy038
-
-
@guy038 said in Delete line with duplicate Number:
Hello, @jim-erlich,
Before finding a way to solve your problem, we need additional information :
-
In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?
-
Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )
-
How many digits can exist after the
#
symbol ? Is5
the maximum or8
or10
digits or … ? -
What is the maximum of lines between two “duplicate” lines ( for instance
xxxxxxxx#2345
andyyyyyyyy#12345
) -
Are they more than
2
'duplicate" lines ( I mean, for instance,6
lines ending with#12345
) -
In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?
Note that we can cope with a possible
space
character between the#
symbol and the number Not a problem !See you later !
Best regards,
guy038
- 12,000 lines broken up into different pages. Each page can have 300 to 1,100 lines.
- Right now the lines are in Ascending order based upon the number.
For example:
The dog went to the park - #4599
The cat went to the park - #4657
The kid went to the park - #4797
The lizard went to the zoo - #5100
The cat went to the zoo - #5120
etc…Ideally, I would like to keep the numbers in Ascending order like this, and locate the lines that have duplicate numbers
- The number of digits after the # symbol is 1 to 5… the highest number being about 14000
- Oh this is good… the duplicate line that has the duplicate number SHOULD BE on the next line. So it will be like this
For example:
The dog went to the park - #12554
The cat went to the park - #12554^^^ The formula should say “Hey!! These two numbers are identical” and then I will delete one of them manually.
- There will be only 2 ‘duplicate’ lines. There will not be 6 lines ending with #12345
- Keep the last duplicate line, delete the first
I hope this is clear, let me know if there is anything else.
Thank you
-
-
Hi, @jim-erlich and All,
Many thanks for all your information ! It should be very easy to get the right solution !
The most important points are :
-
Your file is already sorted by ascending
#number
-
And, in case of
1
duplicate line, it is located right after the original line !
Now, I don’t know how your preliminary sort behaved with numbers, separated with a
space
char from the#
symbol ?For instance :
Are lines sorted, as below ( case
A
) :The dog went to the park - #4599 The cat went to the park - # 4657 The kid went to the park - #4797 The lizard went to the zoo - # 5100 The cat went to the zoo - #5120
OR
Are lines sorted, like ( case
B
) :The cat went to the park - # 4657 The lizard went to the zoo - # 5100 The dog went to the park - #4599 The kid went to the park - #4797 The cat went to the zoo - #5120
Indeed, the N++ sort would place the line
# 10000
before the line#5000
, asspace
code-point is smaller than code-point of adigit
!
Anyway, assuming a sort, like in case
A
and the initial text :The dog went to the park - #4599 You went to the zoo - # 4640 He went to the park - # 4640 The cat went to the park - #4657 The kid went to the park - # 4657 The girl went to the park - #4900 The lizard went to the zoo - # 5100 The cat went to the zoo - #5100 I went to the park - #7500 We went to the zoo - #7500 They went to the park - #14000
Here is the road map :
-
Open your file in N++
-
Open the Replace dialog (
Ctrl + H
) -
SEARCH
(?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)
-
REPLACE
Leave EMPTY
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click once on the
Replace All
button or several times on theReplace
button
Voila !
You should get your expected list :
The dog went to the park - #4599 He went to the park - # 4640 The kid went to the park - # 4657 The girl went to the park - #4900 The cat went to the zoo - #5100 We went to the zoo - #7500 They went to the park - #14000
Note that if your sort is rather like in case
B
, it shouldn’t be be difficult to get, again, the caseA
order and run the regex S/R, afterwards ;-))Next time, if everything was OK, I’ll explain how this regex S/R works !
Best Regards,
guy038
-
-
@guy038 said in Delete line with duplicate Number:
Hi, @jim-erlich and All,
Many thanks for all your information ! It should be very easy to get the right solution !
The most important points are :
-
Your file is already sorted by ascending
#number
-
And, in case of
1
duplicate line, it is located right after the original line !
Now, I don’t know how your preliminary sort behaved with numbers, separated with a
space
char from the#
symbol ?For instance :
Are lines sorted, as below ( case
A
) :The dog went to the park - #4599 The cat went to the park - # 4657 The kid went to the park - #4797 The lizard went to the zoo - # 5100 The cat went to the zoo - #5120
OR
Are lines sorted, like ( case
B
) :The cat went to the park - # 4657 The lizard went to the zoo - # 5100 The dog went to the park - #4599 The kid went to the park - #4797 The cat went to the zoo - #5120
Indeed, the N++ sort would place the line
# 10000
before the line#5000
, asspace
code-point is smaller than code-point of adigit
!
Anyway, assuming a sort, like in case
A
and the initial text :The dog went to the park - #4599 You went to the zoo - # 4640 He went to the park - # 4640 The cat went to the park - #4657 The kid went to the park - # 4657 The girl went to the park - #4900 The lizard went to the zoo - # 5100 The cat went to the zoo - #5100 I went to the park - #7500 We went to the zoo - #7500 They went to the park - #14000
Here is the road map :
-
Open your file in N++
-
Open the Replace dialog (
Ctrl + H
) -
SEARCH
(?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)
-
REPLACE
Leave EMPTY
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click once on the
Replace All
button or several times on theReplace
button
Voila !
You should get your expected list :
The dog went to the park - #4599 He went to the park - # 4640 The kid went to the park - # 4657 The girl went to the park - #4900 The cat went to the zoo - #5100 We went to the zoo - #7500 They went to the park - #14000
Note that if your sort is rather like in case
B
, it shouldn’t be be difficult to get, again, the caseA
order and run the regex S/R, afterwards ;-))Next time, if everything was OK, I’ll explain how this regex S/R works !
Best Regards,
guy038
Flawless… absolutely amazing. You don’t understand how thankful I am for your detailed and correct response. This is a whole new language for me and I am amazed. thank you again.
-
-
Hello @jim-erlich and All,
Sorry for being late ! So, here are, below, some explanations about my regex S/R :
SEARCH
(?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)
REPLACE
Leave EMPTY
-
First, the
(?-s)
in-line modifier ensures that any further.
regex symbol corresponds to a single standard character, only and not to a line-break char ! -
So, the next part
^.+#\x20?
searches, from beginning of line (^
), any non-null range of characters (.+
), followed by the#
symbol and an optionalspace
char (\x20?
) -
Then, it looks for a non-null range of digits (
\d+
), followed by line-break character(s) -
So, the regex engine looks for an entire line ( digits after the
#
are stored as group1
as embedded in parentheses ) but ONLY IF the next line ends with the same number ! -
This condition can be expressed with a look-ahead structure
(?=......)
which are rather a user assertion in the same way that, for instance, the$
symbol is a system assertion, looking for the zero length assertion “end of line” ! -
So current line must be followed with the regex
.+#\x20?\1
, which represents, again, a non-null range of standard characters followed with a#
and possibly aspace
char and finally the group1
(\1
) which is the ending number of the current line -
Note that the
^
assertion for the second line, in the look-ahead structure, is useless as the range(.+)
comes next the line-break char(s)\R
, anyway ! -
As the replacement zone is
empty
, the current line, with its line-break, is just deleted
For a quick oversight about regular expressions, see the N++ documentation, below :
https://npp-user-manual.org/docs/searching/#regular-expressions
See also the main links regarding the
Boost regex
library, used by the regex N++ engine :https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
Finally, see this FAQ topic about regular expressions :
https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation
Best Regards,
guy038
-