• Login
Community
  • Login

Delete line with duplicate Number

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
6 Posts 2 Posters 586 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J
    Jim Erlich
    last edited by Jul 2, 2020, 6:23 PM

    **Sorry for the repost, going to try and simplify my question. **

    I’ve spent the first half of my day trying to figure out how to do this, along with googling to find my exact answer I was unable.

    …Random words, letters, and numbers are on each line…

    Objective: find lines that have exact duplicate numbers (not letters or words).

    Before example:

    A dog went to the mall - #11364
    The dog went to the store - #11364
    A dog is at the mall - #14369
    Dog to the store random - #14369
    Sentence a random - #13677
    The went dog to store - #11159

    After example:

    A random sentence - #11364
    A sentence random - #14369
    Sentence a random - #13677
    The went dog to store - #11159

    • The formula needs to at least: match lines that have identical numbers.
    • The formula does NOT need to: delete one of the lines

    I’m fine with manually deleting the lines that have an identical number match.

    Any help is appreciated, thank you

    1 Reply Last reply Reply Quote 0
    • G
      guy038
      last edited by Jul 2, 2020, 6:53 PM

      Hello, @jim-erlich,

      Before finding a way to solve your problem, we need additional information :

      • In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?

      • Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )

      • How many digits can exist after the # symbol ? Is 5 the maximum or 8 or 10 digits or … ?

      • What is the maximum of lines between two “duplicate” lines ( for instance xxxxxxxx#2345 and yyyyyyyy#12345 )

      • Are they more than 2 'duplicate" lines ( I mean, for instance, 6 lines ending with #12345 )

      • In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?

      Note that we can cope with a possible space character between the # symbol and the number Not a problem !

      See you later !

      Best regards,

      guy038

      J 1 Reply Last reply Jul 2, 2020, 7:26 PM Reply Quote 2
      • J
        Jim Erlich @guy038
        last edited by Jul 2, 2020, 7:26 PM

        @guy038 said in Delete line with duplicate Number:

        Hello, @jim-erlich,

        Before finding a way to solve your problem, we need additional information :

        • In your previous post, which is deleted by now, you spoke about a giant file : what is the approximative size and how many lines contains this file ?

        • Do you mind if your file is modified by a preliminary sort ? If you don’t mind, this could, significantly simplify all the process ! )

        • How many digits can exist after the # symbol ? Is 5 the maximum or 8 or 10 digits or … ?

        • What is the maximum of lines between two “duplicate” lines ( for instance xxxxxxxx#2345 and yyyyyyyy#12345 )

        • Are they more than 2 'duplicate" lines ( I mean, for instance, 6 lines ending with #12345 )

        • In case of multiple “duplicate” lines, which one you want to keep : the first duplicate line or the last duplicate ?

        Note that we can cope with a possible space character between the # symbol and the number Not a problem !

        See you later !

        Best regards,

        guy038

        • 12,000 lines broken up into different pages. Each page can have 300 to 1,100 lines.
        • Right now the lines are in Ascending order based upon the number.

        For example:
        The dog went to the park - #4599
        The cat went to the park - #4657
        The kid went to the park - #4797
        The lizard went to the zoo - #5100
        The cat went to the zoo - #5120
        etc…

        Ideally, I would like to keep the numbers in Ascending order like this, and locate the lines that have duplicate numbers

        • The number of digits after the # symbol is 1 to 5… the highest number being about 14000
        • Oh this is good… the duplicate line that has the duplicate number SHOULD BE on the next line. So it will be like this

        For example:
        The dog went to the park - #12554
        The cat went to the park - #12554

        ^^^ The formula should say “Hey!! These two numbers are identical” and then I will delete one of them manually.

        • There will be only 2 ‘duplicate’ lines. There will not be 6 lines ending with #12345
        • Keep the last duplicate line, delete the first

        I hope this is clear, let me know if there is anything else.

        Thank you

        1 Reply Last reply Reply Quote 1
        • G
          guy038
          last edited by guy038 Jul 2, 2020, 9:04 PM Jul 2, 2020, 8:53 PM

          Hi, @jim-erlich and All,

          Many thanks for all your information ! It should be very easy to get the right solution !

          The most important points are :

          • Your file is already sorted by ascending #number

          • And, in case of 1 duplicate line, it is located right after the original line !


          Now, I don’t know how your preliminary sort behaved with numbers, separated with a space char from the # symbol ?

          For instance :

          Are lines sorted, as below ( case A ) :

          The dog went to the park - #4599
          The cat went to the park - # 4657
          The kid went to the park - #4797
          The lizard went to the zoo - # 5100
          The cat went to the zoo - #5120
          

          OR

          Are lines sorted, like ( case B ) :

          The cat went to the park - # 4657
          The lizard went to the zoo - # 5100
          The dog went to the park - #4599
          The kid went to the park - #4797
          The cat went to the zoo - #5120
          

          Indeed, the N++ sort would place the line # 10000 before the line #5000, as space code-point is smaller than code-point of a digit !


          Anyway, assuming a sort, like in case A and the initial text :

          The dog went to the park - #4599
          You went to the zoo - # 4640
          He went to the park - # 4640
          The cat went to the park - #4657
          The kid went to the park - # 4657
          The girl went to the park - #4900
          The lizard went to the zoo - # 5100
          The cat went to the zoo - #5100
          I went to the park - #7500
          We went to the zoo - #7500
          They went to the park - #14000
          

          Here is the road map :

          • Open your file in N++

          • Open the Replace dialog ( Ctrl + H )

          • SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

          • REPLACE Leave EMPTY

          • Tick the Wrap around option

          • Select the Regular expression search mode

          • Click once on the Replace All button or several times on the Replace button

          Voila !

          You should get your expected list :

          The dog went to the park - #4599
          He went to the park - # 4640
          The kid went to the park - # 4657
          The girl went to the park - #4900
          The cat went to the zoo - #5100
          We went to the zoo - #7500
          They went to the park - #14000
          

          Note that if your sort is rather like in case B, it shouldn’t be be difficult to get, again, the case A order and run the regex S/R, afterwards ;-))

          Next time, if everything was OK, I’ll explain how this regex S/R works !

          Best Regards,

          guy038

          J 1 Reply Last reply Jul 2, 2020, 10:36 PM Reply Quote 1
          • J
            Jim Erlich @guy038
            last edited by Jul 2, 2020, 10:36 PM

            @guy038 said in Delete line with duplicate Number:

            Hi, @jim-erlich and All,

            Many thanks for all your information ! It should be very easy to get the right solution !

            The most important points are :

            • Your file is already sorted by ascending #number

            • And, in case of 1 duplicate line, it is located right after the original line !


            Now, I don’t know how your preliminary sort behaved with numbers, separated with a space char from the # symbol ?

            For instance :

            Are lines sorted, as below ( case A ) :

            The dog went to the park - #4599
            The cat went to the park - # 4657
            The kid went to the park - #4797
            The lizard went to the zoo - # 5100
            The cat went to the zoo - #5120
            

            OR

            Are lines sorted, like ( case B ) :

            The cat went to the park - # 4657
            The lizard went to the zoo - # 5100
            The dog went to the park - #4599
            The kid went to the park - #4797
            The cat went to the zoo - #5120
            

            Indeed, the N++ sort would place the line # 10000 before the line #5000, as space code-point is smaller than code-point of a digit !


            Anyway, assuming a sort, like in case A and the initial text :

            The dog went to the park - #4599
            You went to the zoo - # 4640
            He went to the park - # 4640
            The cat went to the park - #4657
            The kid went to the park - # 4657
            The girl went to the park - #4900
            The lizard went to the zoo - # 5100
            The cat went to the zoo - #5100
            I went to the park - #7500
            We went to the zoo - #7500
            They went to the park - #14000
            

            Here is the road map :

            • Open your file in N++

            • Open the Replace dialog ( Ctrl + H )

            • SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

            • REPLACE Leave EMPTY

            • Tick the Wrap around option

            • Select the Regular expression search mode

            • Click once on the Replace All button or several times on the Replace button

            Voila !

            You should get your expected list :

            The dog went to the park - #4599
            He went to the park - # 4640
            The kid went to the park - # 4657
            The girl went to the park - #4900
            The cat went to the zoo - #5100
            We went to the zoo - #7500
            They went to the park - #14000
            

            Note that if your sort is rather like in case B, it shouldn’t be be difficult to get, again, the case A order and run the regex S/R, afterwards ;-))

            Next time, if everything was OK, I’ll explain how this regex S/R works !

            Best Regards,

            guy038

            Flawless… absolutely amazing. You don’t understand how thankful I am for your detailed and correct response. This is a whole new language for me and I am amazed. thank you again.

            1 Reply Last reply Reply Quote 0
            • G
              guy038
              last edited by Jul 4, 2020, 11:10 PM

              Hello @jim-erlich and All,

              Sorry for being late ! So, here are, below, some explanations about my regex S/R :

              SEARCH (?-s)^.+#\x20?(\d+)\R(?=.+#\x20?\1)

              REPLACE Leave EMPTY

              • First, the (?-s) in-line modifier ensures that any further . regex symbol corresponds to a single standard character, only and not to a line-break char !

              • So, the next part ^.+#\x20? searches, from beginning of line ( ^ ), any non-null range of characters ( .+ ), followed by the # symbol and an optional space char (\x20?)

              • Then, it looks for a non-null range of digits ( \d+ ), followed by line-break character(s)

              • So, the regex engine looks for an entire line ( digits after the # are stored as group 1 as embedded in parentheses ) but ONLY IF the next line ends with the same number !

              • This condition can be expressed with a look-ahead structure (?=......) which are rather a user assertion in the same way that, for instance, the $ symbol is a system assertion, looking for the zero length assertion “end of line” !

              • So current line must be followed with the regex .+#\x20?\1, which represents, again, a non-null range of standard characters followed with a # and possibly a space char and finally the group 1 ( \1 ) which is the ending number of the current line

              • Note that the ^ assertion for the second line, in the look-ahead structure, is useless as the range (.+) comes next the line-break char(s) \R, anyway !

              • As the replacement zone is empty, the current line, with its line-break, is just deleted


              For a quick oversight about regular expressions, see the N++ documentation, below :

              https://npp-user-manual.org/docs/searching/#regular-expressions

              See also the main links regarding the Boost regex library, used by the regex N++ engine :

              https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

              https://www.boost.org/doc/libs/1_70_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

              Finally, see this FAQ topic about regular expressions :

              https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 0
              2 out of 6
              • First post
                2/6
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors