Delete line with duplicate Number

Jim Erlich

**Sorry for the repost, going to try and simplify my question. **

I’ve spent the first half of my day trying to figure out how to do this, along with googling to find my exact answer I was unable.

…Random words, letters, and numbers are on each line…

Objective: find lines that have exact duplicate numbers (not letters or words).

Before example:

A dog went to the mall - #11364
The dog went to the store - #11364
A dog is at the mall - #14369
Dog to the store random - #14369
Sentence a random - #13677
The went dog to store - #11159

After example:

A random sentence - #11364
A sentence random - #14369
Sentence a random - #13677
The went dog to store - #11159

The formula needs to at least: match lines that have identical numbers.
The formula does NOT need to: delete one of the lines

I’m fine with manually deleting the lines that have an identical number match.

Any help is appreciated, thank you

guy038

Hello, @jim-erlich,

Before finding a way to solve your problem, we need additional information :

In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?
Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )
How many digits can exist after the # symbol ? Is 5 the maximum or 8 or 10 digits or … ?
What is the maximum of lines between two “duplicate” lines ( for instance xxxxxxxx#2345 and yyyyyyyy#12345 )
Are they more than 2 'duplicate" lines ( I mean, for instance, 6 lines ending with #12345 )
In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?

Note that we can cope with a possible space character between the # symbol and the number Not a problem !

See you later !

Best regards,

guy038

Jim Erlich

@guy038 said in Delete line with duplicate Number:

Hello, @jim-erlich,

Before finding a way to solve your problem, we need additional information :

In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?

Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )

How many digits can exist after the # symbol ? Is 5 the maximum or 8 or 10 digits or … ?

What is the maximum of lines between two “duplicate” lines ( for instance xxxxxxxx#2345 and yyyyyyyy#12345 )

Are they more than 2 'duplicate" lines ( I mean, for instance, 6 lines ending with #12345 )

In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?

Note that we can cope with a possible space character between the # symbol and the number Not a problem !

See you later !

Best regards,

guy038

12,000 lines broken up into different pages. Each page can have 300 to 1,100 lines.
Right now the lines are in Ascending order based upon the number.

For example:
The dog went to the park - #4599
The cat went to the park - #4657
The kid went to the park - #4797
The lizard went to the zoo - #5100
The cat went to the zoo - #5120
etc…

Ideally, I would like to keep the numbers in Ascending order like this, and locate the lines that have duplicate numbers

The number of digits after the # symbol is 1 to 5… the highest number being about 14000
Oh this is good… the duplicate line that has the duplicate number SHOULD BE on the next line. So it will be like this

For example:
The dog went to the park - #12554
The cat went to the park - #12554

^^^ The formula should say “Hey!! These two numbers are identical” and then I will delete one of them manually.

There will be only 2 ‘duplicate’ lines. There will not be 6 lines ending with #12345
Keep the last duplicate line, delete the first

I hope this is clear, let me know if there is anything else.

Thank you

guy038

Hi, @jim-erlich and All,

Many thanks for all your information ! It should be very easy to get the right solution !

The most important points are :

Your file is already sorted by ascending #number
And, in case of 1 duplicate line, it is located right after the original line !

Now, I don’t know how your preliminary sort behaved with numbers, separated with a space char from the # symbol ?

For instance :

Are lines sorted, as below ( case A ) :

The dog went to the park - #4599
The cat went to the park - # 4657
The kid went to the park - #4797
The lizard went to the zoo - # 5100
The cat went to the zoo - #5120

OR

Are lines sorted, like ( case B ) :

The cat went to the park - # 4657
The lizard went to the zoo - # 5100
The dog went to the park - #4599
The kid went to the park - #4797
The cat went to the zoo - #5120

Indeed, the N++ sort would place the line # 10000 before the line #5000, as space code-point is smaller than code-point of a digit !

Anyway, assuming a sort, like in case A and the initial text :

The dog went to the park - #4599
You went to the zoo - # 4640
He went to the park - # 4640
The cat went to the park - #4657
The kid went to the park - # 4657
The girl went to the park - #4900
The lizard went to the zoo - # 5100
The cat went to the zoo - #5100
I went to the park - #7500
We went to the zoo - #7500
They went to the park - #14000

Here is the road map :

Open your file in N++
Open the Replace dialog ( Ctrl + H )
SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)
REPLACE Leave EMPTY
Tick the Wrap around option
Select the Regular expression search mode
Click once on the Replace All button or several times on the Replace button

Voila !

You should get your expected list :

The dog went to the park - #4599
He went to the park - # 4640
The kid went to the park - # 4657
The girl went to the park - #4900
The cat went to the zoo - #5100
We went to the zoo - #7500
They went to the park - #14000

Note that if your sort is rather like in case B, it shouldn’t be be difficult to get, again, the case A order and run the regex S/R, afterwards ;-))

Next time, if everything was OK, I’ll explain how this regex S/R works !

Best Regards,

guy038

Jim Erlich

@guy038 said in Delete line with duplicate Number:

Hi, @jim-erlich and All,

Many thanks for all your information ! It should be very easy to get the right solution !

The most important points are :

Your file is already sorted by ascending #number

And, in case of 1 duplicate line, it is located right after the original line !

Now, I don’t know how your preliminary sort behaved with numbers, separated with a space char from the # symbol ?

For instance :

Are lines sorted, as below ( case A ) :
The dog went to the park - #4599
The cat went to the park - # 4657
The kid went to the park - #4797
The lizard went to the zoo - # 5100
The cat went to the zoo - #5120
OR

Are lines sorted, like ( case B ) :
The cat went to the park - # 4657
The lizard went to the zoo - # 5100
The dog went to the park - #4599
The kid went to the park - #4797
The cat went to the zoo - #5120
Indeed, the N++ sort would place the line # 10000 before the line #5000, as space code-point is smaller than code-point of a digit !

Anyway, assuming a sort, like in case A and the initial text :
The dog went to the park - #4599
You went to the zoo - # 4640
He went to the park - # 4640
The cat went to the park - #4657
The kid went to the park - # 4657
The girl went to the park - #4900
The lizard went to the zoo - # 5100
The cat went to the zoo - #5100
I went to the park - #7500
We went to the zoo - #7500
They went to the park - #14000
Here is the road map :

Open your file in N++

Open the Replace dialog ( Ctrl + H )

SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

REPLACE Leave EMPTY

Tick the Wrap around option

Select the Regular expression search mode

Click once on the Replace All button or several times on the Replace button

Voila !

You should get your expected list :
The dog went to the park - #4599
He went to the park - # 4640
The kid went to the park - # 4657
The girl went to the park - #4900
The cat went to the zoo - #5100
We went to the zoo - #7500
They went to the park - #14000
Note that if your sort is rather like in case B, it shouldn’t be be difficult to get, again, the case A order and run the regex S/R, afterwards ;-))

Next time, if everything was OK, I’ll explain how this regex S/R works !

Best Regards,

guy038

Flawless… absolutely amazing. You don’t understand how thankful I am for your detailed and correct response. This is a whole new language for me and I am amazed. thank you again.

guy038

Hello @jim-erlich and All,

Sorry for being late ! So, here are, below, some explanations about my regex S/R :

SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

REPLACE Leave EMPTY

First, the (?-s) in-line modifier ensures that any further . regex symbol corresponds to a single standard character, only and not to a line-break char !
So, the next part ^.+#\x20? searches, from beginning of line ( ^ ), any non-null range of characters ( .+ ), followed by the # symbol and an optional space char (\x20?)
Then, it looks for a non-null range of digits ( \d+ ), followed by line-break character(s)
So, the regex engine looks for an entire line ( digits after the # are stored as group 1 as embedded in parentheses ) but ONLY IF the next line ends with the same number !
This condition can be expressed with a look-ahead structure (?=......) which are rather a user assertion in the same way that, for instance, the $ symbol is a system assertion, looking for the zero length assertion “end of line” !
So current line must be followed with the regex .+#\x20?\1, which represents, again, a non-null range of standard characters followed with a # and possibly a space char and finally the group 1 ( \1 ) which is the ending number of the current line
Note that the ^ assertion for the second line, in the look-ahead structure, is useless as the range (.+) comes next the line-break char(s) \R, anyway !
As the replacement zone is empty, the current line, with its line-break, is just deleted

For a quick oversight about regular expressions, see the N++ documentation, below :

https://npp-user-manual.org/docs/searching/#regular-expressions

See also the main links regarding the Boost regex library, used by the regex N++ engine :

https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

Finally, see this FAQ topic about regular expressions :

https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation

Best Regards,

guy038