Delete line with duplicate Number



  • **Sorry for the repost, going to try and simplify my question. **

    I’ve spent the first half of my day trying to figure out how to do this, along with googling to find my exact answer I was unable.

    …Random words, letters, and numbers are on each line…

    Objective: find lines that have exact duplicate numbers (not letters or words).

    Before example:

    A dog went to the mall - #11364
    The dog went to the store - #11364
    A dog is at the mall - #14369
    Dog to the store random - #14369
    Sentence a random - #13677
    The went dog to store - #11159

    After example:

    A random sentence - #11364
    A sentence random - #14369
    Sentence a random - #13677
    The went dog to store - #11159

    • The formula needs to at least: match lines that have identical numbers.
    • The formula does NOT need to: delete one of the lines

    I’m fine with manually deleting the lines that have an identical number match.

    Any help is appreciated, thank you



  • Hello, @jim-erlich,

    Before finding a way to solve your problem, we need additional information :

    • In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?

    • Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )

    • How many digits can exist after the # symbol ? Is 5 the maximum or 8 or 10 digits or … ?

    • What is the maximum of lines between two “duplicate” lines ( for instance xxxxxxxx#2345 and yyyyyyyy#12345 )

    • Are they more than 2 'duplicate" lines ( I mean, for instance, 6 lines ending with #12345 )

    • In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?

    Note that we can cope with a possible space character between the # symbol and the number Not a problem !

    See you later !

    Best regards,

    guy038



  • @guy038 said in Delete line with duplicate Number:

    Hello, @jim-erlich,

    Before finding a way to solve your problem, we need additional information :

    • In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?

    • Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )

    • How many digits can exist after the # symbol ? Is 5 the maximum or 8 or 10 digits or … ?

    • What is the maximum of lines between two “duplicate” lines ( for instance xxxxxxxx#2345 and yyyyyyyy#12345 )

    • Are they more than 2 'duplicate" lines ( I mean, for instance, 6 lines ending with #12345 )

    • In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?

    Note that we can cope with a possible space character between the # symbol and the number Not a problem !

    See you later !

    Best regards,

    guy038

    • 12,000 lines broken up into different pages. Each page can have 300 to 1,100 lines.
    • Right now the lines are in Ascending order based upon the number.

    For example:
    The dog went to the park - #4599
    The cat went to the park - #4657
    The kid went to the park - #4797
    The lizard went to the zoo - #5100
    The cat went to the zoo - #5120
    etc…

    Ideally, I would like to keep the numbers in Ascending order like this, and locate the lines that have duplicate numbers

    • The number of digits after the # symbol is 1 to 5… the highest number being about 14000
    • Oh this is good… the duplicate line that has the duplicate number SHOULD BE on the next line. So it will be like this

    For example:
    The dog went to the park - #12554
    The cat went to the park - #12554

    ^^^ The formula should say “Hey!! These two numbers are identical” and then I will delete one of them manually.

    • There will be only 2 ‘duplicate’ lines. There will not be 6 lines ending with #12345
    • Keep the last duplicate line, delete the first

    I hope this is clear, let me know if there is anything else.

    Thank you



  • Hi, @jim-erlich and All,

    Many thanks for all your information ! It should be very easy to get the right solution !

    The most important points are :

    • Your file is already sorted by ascending #number

    • And, in case of 1 duplicate line, it is located right after the original line !


    Now, I don’t know how your preliminary sort behaved with numbers, separated with a space char from the # symbol ?

    For instance :

    Are lines sorted, as below ( case A ) :

    The dog went to the park - #4599
    The cat went to the park - # 4657
    The kid went to the park - #4797
    The lizard went to the zoo - # 5100
    The cat went to the zoo - #5120
    

    OR

    Are lines sorted, like ( case B ) :

    The cat went to the park - # 4657
    The lizard went to the zoo - # 5100
    The dog went to the park - #4599
    The kid went to the park - #4797
    The cat went to the zoo - #5120
    

    Indeed, the N++ sort would place the line # 10000 before the line #5000, as space code-point is smaller than code-point of a digit !


    Anyway, assuming a sort, like in case A and the initial text :

    The dog went to the park - #4599
    You went to the zoo - # 4640
    He went to the park - # 4640
    The cat went to the park - #4657
    The kid went to the park - # 4657
    The girl went to the park - #4900
    The lizard went to the zoo - # 5100
    The cat went to the zoo - #5100
    I went to the park - #7500
    We went to the zoo - #7500
    They went to the park - #14000
    

    Here is the road map :

    • Open your file in N++

    • Open the Replace dialog ( Ctrl + H )

    • SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

    • REPLACE Leave EMPTY

    • Tick the Wrap around option

    • Select the Regular expression search mode

    • Click once on the Replace All button or several times on the Replace button

    Voila !

    You should get your expected list :

    The dog went to the park - #4599
    He went to the park - # 4640
    The kid went to the park - # 4657
    The girl went to the park - #4900
    The cat went to the zoo - #5100
    We went to the zoo - #7500
    They went to the park - #14000
    

    Note that if your sort is rather like in case B, it shouldn’t be be difficult to get, again, the case A order and run the regex S/R, afterwards ;-))

    Next time, if everything was OK, I’ll explain how this regex S/R works !

    Best Regards,

    guy038



  • @guy038 said in Delete line with duplicate Number:

    Hi, @jim-erlich and All,

    Many thanks for all your information ! It should be very easy to get the right solution !

    The most important points are :

    • Your file is already sorted by ascending #number

    • And, in case of 1 duplicate line, it is located right after the original line !


    Now, I don’t know how your preliminary sort behaved with numbers, separated with a space char from the # symbol ?

    For instance :

    Are lines sorted, as below ( case A ) :

    The dog went to the park - #4599
    The cat went to the park - # 4657
    The kid went to the park - #4797
    The lizard went to the zoo - # 5100
    The cat went to the zoo - #5120
    

    OR

    Are lines sorted, like ( case B ) :

    The cat went to the park - # 4657
    The lizard went to the zoo - # 5100
    The dog went to the park - #4599
    The kid went to the park - #4797
    The cat went to the zoo - #5120
    

    Indeed, the N++ sort would place the line # 10000 before the line #5000, as space code-point is smaller than code-point of a digit !


    Anyway, assuming a sort, like in case A and the initial text :

    The dog went to the park - #4599
    You went to the zoo - # 4640
    He went to the park - # 4640
    The cat went to the park - #4657
    The kid went to the park - # 4657
    The girl went to the park - #4900
    The lizard went to the zoo - # 5100
    The cat went to the zoo - #5100
    I went to the park - #7500
    We went to the zoo - #7500
    They went to the park - #14000
    

    Here is the road map :

    • Open your file in N++

    • Open the Replace dialog ( Ctrl + H )

    • SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

    • REPLACE Leave EMPTY

    • Tick the Wrap around option

    • Select the Regular expression search mode

    • Click once on the Replace All button or several times on the Replace button

    Voila !

    You should get your expected list :

    The dog went to the park - #4599
    He went to the park - # 4640
    The kid went to the park - # 4657
    The girl went to the park - #4900
    The cat went to the zoo - #5100
    We went to the zoo - #7500
    They went to the park - #14000
    

    Note that if your sort is rather like in case B, it shouldn’t be be difficult to get, again, the case A order and run the regex S/R, afterwards ;-))

    Next time, if everything was OK, I’ll explain how this regex S/R works !

    Best Regards,

    guy038

    Flawless… absolutely amazing. You don’t understand how thankful I am for your detailed and correct response. This is a whole new language for me and I am amazed. thank you again.



  • Hello @jim-erlich and All,

    Sorry for being late ! So, here are, below, some explanations about my regex S/R :

    SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

    REPLACE Leave EMPTY

    • First, the (?-s) in-line modifier ensures that any further . regex symbol corresponds to a single standard character, only and not to a line-break char !

    • So, the next part ^.+#\x20? searches, from beginning of line ( ^ ), any non-null range of characters ( .+ ), followed by the # symbol and an optional space char (\x20?)

    • Then, it looks for a non-null range of digits ( \d+ ), followed by line-break character(s)

    • So, the regex engine looks for an entire line ( digits after the # are stored as group 1 as embedded in parentheses ) but ONLY IF the next line ends with the same number !

    • This condition can be expressed with a look-ahead structure (?=......) which are rather a user assertion in the same way that, for instance, the $ symbol is a system assertion, looking for the zero length assertion “end of line” !

    • So current line must be followed with the regex .+#\x20?\1, which represents, again, a non-null range of standard characters followed with a # and possibly a space char and finally the group 1 ( \1 ) which is the ending number of the current line

    • Note that the ^ assertion for the second line, in the look-ahead structure, is useless as the range (.+) comes next the line-break char(s) \R, anyway !

    • As the replacement zone is empty, the current line, with its line-break, is just deleted


    For a quick oversight about regular expressions, see the N++ documentation, below :

    https://npp-user-manual.org/docs/searching/#regular-expressions

    See also the main links regarding the Boost regex library, used by the regex N++ engine :

    https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

    https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

    Finally, see this FAQ topic about regular expressions :

    https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation

    Best Regards,

    guy038


Log in to reply